logging: do not panic in the event of a failure to log a message by hawkw · Pull Request #1610 · oxidecomputer/dropshot

hawkw · 2026-05-22T20:55:43Z

This is a quite sad situation. Currently, the logger configuration produced by dropshot::ConfigLogging will panic whenever an error writing a log line occurs. This crashes the entire process, which is often very much not what you want to have happen. Therefore, this branch changes the behavior to ignore errors returned here, resulting in the log line being silently eaten. This is also a behavior that I will be the first to admit is quite sad, but I believe it to be a less sad default than crashing the entire process.

As an example of why, consider the Oxide rack's sled-agent. The sled-agent will currently crash when it attempts to log something and the root filesystem to which it attempts to log is full, as described in oxidecomputer/omicron#4354. This is quite bad, as if we were to implement some way of freeing up additional storage space in such a situation, it is likely the sled-agent that would be responsible for doing so!

In an ideal world, we might wish for different behavior based on the particular error that occurred. We may wish to panic on errors to serialize a log line (as that would be indicative of a programmer error), while we may wish to not panic on I/O errors. Sadly, slog's current interfaces make this challenging, as all errors from the loggers we use, are presently coerced to std::io::Error, making it difficult to distinguish between I/O and serialization errors in a non-flaky way.

In the future, we might want to change the behavior from silently dropping log lines to instead recording when we have done so, perhaps by maintaining a count of logging errors. This could be broken down by error kinds, and track the timestamp of the last time a log message was lost. Reporting could report these error counts through other means, such as timeseries metrics, would allow the operator of the service to at least know that their log is incomplete. We could even imagine having a thingy that tracks if we have lost log messages, and once a log message is written successfully, tries to log a "hey, by the way, we also dropped however many log lines over the last however long!".

We might also want to make the panciking behavior configurable. This is sadly somewhat more annoying than it ought to be, as slog::Drain::fuse() and slog::Drain::ignore_res() change the type of the drain, requiring duplicate code paths that construct almost-but-not-entirely identical loggers. But, this would let us have a config which, say, panics in tests but does not panic in production, which seems like a reasonable thing to want.

Nonetheless, this commit takes the shortest path to not panicking, and just turns all the fuse()s into ignore_res()es. I think the present behavior is bad enough in production that the quick fix feels fairly justified. We should consider making this nicer in the future though.

Copilot

Pull request overview

Adjusts Dropshot’s default slog-based logging configuration to avoid panicking the entire process when a log write fails (e.g., transient I/O errors like ENOSPC), trading hard-fail behavior for dropping the affected log line(s).

Changes:

Replaces fuse() with ignore_res() for terminal and file drains to prevent panics on log write errors.
Applies ignore_res() to the async drain so async logging failures don’t crash the process.
Updates the file-drain helper return type to reflect the new drain wrapper (IgnoreResult).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

This is a quite sad situation. Currently, the logger configuration produced by `dropshot::ConfigLogging` will panic whenever an error writing a log line occurs. This crashes the entire process, which is often very much not what you want to have happen. Therefore, this branch changes the behavior to ignore errors returned here, resulting in the log line being silently eaten. This is *also* a behavior that I will be the first to admit is quite sad, but I believe it to be a less sad default than crashing the entire process. As an example of why, consider the Oxide rack's sled-agent. The sled-agent will currently crash when it attempts to log something and the root filesystem to which it attempts to log is full, as described in oxidecomputer/omicron#4354. This is quite bad, as if we were to implement some way of freeing up additional storage space in such a situation, it is likely the sled-agent that would be responsible for doing so! In an ideal world, we might wish for different behavior based on the particular error that occurred. We may wish to panic on errors to serialize a log line (as that would be indicative of a programmer error), while we may wish to not panic on I/O errors. Sadly, slog's current interfaces make this challenging, as all errors from the loggers we use, are presently coerced to `std::io::Error`, making it difficult to distinguish between I/O and serialization errors in a non-flaky way. In the future, we might want to change the behavior from *silently* dropping log lines to instead recording when we have done so, perhaps by maintaining a count of logging errors. This could be broken down by error kinds, and track the timestamp of the last time a log message was lost. Reporting could report these error counts through other means, such as timeseries metrics, would allow the operator of the service to at least *know* that their log is incomplete. We could even imagine having a thingy that tracks if we have lost log messages, and once a log message is written successfully, tries to log a "hey, by the way, we also dropped however many log lines over the last however long!". We might also want to make the panciking behavior configurable. This is sadly somewhat more annoying than it ought to be, as `slog::Drain::fuse()` and `slog::Drain::ignore_res()` change the _type_ of the drain, requiring duplicate code paths that construct almost-but-not-entirely identical loggers. But, this would let us have a config which, say, panics in tests but does not panic in production, which seems like a reasonable thing to want. Nonetheless, this commit takes the shortest path to not panicking, and just turns all the `fuse()`s into `ignore_res()`es. I think the present behavior is bad enough in production that the quick fix feels fairly justified. We should consider making this nicer in the future though.

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

hawkw · 2026-05-22T21:03:03Z

(as an aside, I feel like the fancier potential solutions, such as tracking logger errors, are probably best saved until the ConfigLogging stuff finds a new home in its own repo, as @davepacheco proposes in #1607 (comment))

hawkw requested review from ahl, Copilot, davepacheco and jmcarp May 22, 2026 20:55

Copilot started reviewing on behalf of hawkw May 22, 2026 20:55 View session

Copilot AI reviewed May 22, 2026

View reviewed changes

Comment thread dropshot/src/logging.rs Outdated

Comment thread dropshot/src/logging.rs Outdated

hawkw force-pushed the eliza/fuse-considered-harmful branch from 770ccc0 to a68c7e1 Compare May 22, 2026 20:59

hawkw and others added 2 commits May 22, 2026 14:00

I DIDN'T MEAN TO CLICK THE GODDAMN COPILOT THING BUT IT FOUDN SOME TYPOS

a64ff7d

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

rustfmt

ee5793b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

logging: do not panic in the event of a failure to log a message#1610

logging: do not panic in the event of a failure to log a message#1610
hawkw wants to merge 3 commits into
mainfrom
eliza/fuse-considered-harmful

hawkw commented May 22, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

hawkw commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

hawkw commented May 22, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

hawkw commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants