Skip to content

Handle concurrently deleted local storage dataset#10480

Open
jmpesp wants to merge 1 commit into
oxidecomputer:mainfrom
jmpesp:handle_gone_local_storage_dataset
Open

Handle concurrently deleted local storage dataset#10480
jmpesp wants to merge 1 commit into
oxidecomputer:mainfrom
jmpesp:handle_gone_local_storage_dataset

Conversation

@jmpesp
Copy link
Copy Markdown
Contributor

@jmpesp jmpesp commented May 21, 2026

Surface if a dataset is gone, and instead of returning a 500, match on the new DatasetNotFound variant and return Ok() from delete_local_storage_dataset.

Fixes #10477

Surface if a dataset is gone, and instead of returning a 500, match on
the new DatasetNotFound variant and return Ok() from
`delete_local_storage_dataset`.

Fixes oxidecomputer#10477
@jmpesp jmpesp requested a review from leftwo May 21, 2026 19:37
Comment thread illumos-utils/src/zfs.rs
},
}

impl From<SetValueErrorRaw> for EnsureDatasetErrorRaw {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Part of me kind of wonders if we might want to implement DatasetNotFound as a generic wrapper around any other error that's either the inner error or DatasetNotFound, and if that might reduce some boilerplate here? Something like:

#[derive(thiserror::Error)]
enum DatasetError<T> {
    #[error("Dataset not found")]
    NotFound,
    #[error(transparent)]
    Other(#[from] T)
}

although, it's equally possible that this is actually more annoying than just doing it the current way

Comment thread illumos-utils/src/zfs.rs
Comment on lines +144 to 154
#[error("Dataset not found")]
DatasetNotFound,

#[error("ZFS execution error")]
Execution(#[from] crate::ExecutionError),

#[error("Unexpected output from ZFS commands: {0}")]
Output(String),

#[error("Dataset does not exist")]
DoesNotExist,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the difference between DatasetNotFound and the DoesNotExist error which was already here? should that one be removed and should we be erturning DatasetNotFound instead?

if they have different semantics, it would be nice to have a comment explaining that (and maybe naming that makes the distinction clearer?), because this sure looks confusing.

Comment thread illumos-utils/src/zfs.rs
Comment on lines +1623 to +1625
crate::ExecutionError::CommandFailure(info)
if info.stderr.contains("does not exist") =>
{
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i kinda wonder if we ought to have an execute_async wrapper for zfs commands, or something, that does this detection and returns either an ExecutionError or a DatasetNotFound, rather than forcing every caller to detect it individually? the caller-side detection seems sketchier to me, because now you have to remember to check for this contents in info.stderr every time you call a zfs command. meanwhile, having the function that runs the command do it gives the caller an enum which they are forced to handle.

take it or leave it, just a thought.

Comment thread illumos-utils/src/zfs.rs
Comment on lines +1635 to +1638
values: name_values
.iter()
.map(|(k, v)| format!("{k}={v}"))
.join(","),
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we are now doing this twice, btw --- you have a

let values =
    name_values.iter().map(|(k, v)| format!("{k}={v}")).join(",");

at the top of the map_err closure...

Suggested change
values: name_values
.iter()
.map(|(k, v)| format!("{k}={v}"))
.join(","),
values,

Comment on lines +1553 to +1557
RemoveReservationErrorInner::DatasetNotFound => {
// If the parent dataset is no longer found, then a
// concurrent deletion occurred. Return Ok
return Ok(());
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm, is this really the only place we might encounter a dataset that was already removed by a concurrent deletion? should we also handle this error at the point where we actually delete the child volume dataset, too?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Unable to delete faulted disk from console during test_local_storage_scale

3 participants