Skip to content

logging: do not panic in the event of a failure to log a message#1610

Open
hawkw wants to merge 3 commits into
mainfrom
eliza/fuse-considered-harmful
Open

logging: do not panic in the event of a failure to log a message#1610
hawkw wants to merge 3 commits into
mainfrom
eliza/fuse-considered-harmful

Conversation

@hawkw
Copy link
Copy Markdown
Member

@hawkw hawkw commented May 22, 2026

This is a quite sad situation. Currently, the logger configuration produced by dropshot::ConfigLogging will panic whenever an error writing a log line occurs. This crashes the entire process, which is often very much not what you want to have happen. Therefore, this branch changes the behavior to ignore errors returned here, resulting in the log line being silently eaten. This is also a behavior that I will be the first to admit is quite sad, but I believe it to be a less sad default than crashing the entire process.

As an example of why, consider the Oxide rack's sled-agent. The sled-agent will currently crash when it attempts to log something and the root filesystem to which it attempts to log is full, as described in oxidecomputer/omicron#4354. This is quite bad, as if we were to implement some way of freeing up additional storage space in such a situation, it is likely the sled-agent that would be responsible for doing so!

In an ideal world, we might wish for different behavior based on the particular error that occurred. We may wish to panic on errors to serialize a log line (as that would be indicative of a programmer error), while we may wish to not panic on I/O errors. Sadly, slog's current interfaces make this challenging, as all errors from the loggers we use, are presently coerced to std::io::Error, making it difficult to distinguish between I/O and serialization errors in a non-flaky way.

In the future, we might want to change the behavior from silently dropping log lines to instead recording when we have done so, perhaps by maintaining a count of logging errors. This could be broken down by error kinds, and track the timestamp of the last time a log message was lost. Reporting could report these error counts through other means, such as timeseries metrics, would allow the operator of the service to at least know that their log is incomplete. We could even imagine having a thingy that tracks if we have lost log messages, and once a log message is written successfully, tries to log a "hey, by the way, we also dropped however many log lines over the last however long!".

We might also want to make the panciking behavior configurable. This is sadly somewhat more annoying than it ought to be, as slog::Drain::fuse() and slog::Drain::ignore_res() change the type of the drain, requiring duplicate code paths that construct almost-but-not-entirely identical loggers. But, this would let us have a config which, say, panics in tests but does not panic in production, which seems like a reasonable thing to want.

Nonetheless, this commit takes the shortest path to not panicking, and just turns all the fuse()s into ignore_res()es. I think the present behavior is bad enough in production that the quick fix feels fairly justified. We should consider making this nicer in the future though.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adjusts Dropshot’s default slog-based logging configuration to avoid panicking the entire process when a log write fails (e.g., transient I/O errors like ENOSPC), trading hard-fail behavior for dropping the affected log line(s).

Changes:

  • Replaces fuse() with ignore_res() for terminal and file drains to prevent panics on log write errors.
  • Applies ignore_res() to the async drain so async logging failures don’t crash the process.
  • Updates the file-drain helper return type to reflect the new drain wrapper (IgnoreResult).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread dropshot/src/logging.rs Outdated
Comment thread dropshot/src/logging.rs Outdated
This is a quite sad situation. Currently, the logger configuration
produced by `dropshot::ConfigLogging` will panic whenever an error
writing a log line occurs. This crashes the entire process, which is
often very much not what you want to have happen. Therefore, this branch
changes the behavior to ignore errors returned here, resulting in the
log line being silently eaten. This is *also* a behavior that I will be
the first to admit is quite sad, but I believe it to be a less sad
default than crashing the entire process.

As an example of why, consider the Oxide rack's sled-agent. The
sled-agent will currently crash when it attempts to log something and
the root filesystem to which it attempts to log is full, as described in
oxidecomputer/omicron#4354. This is quite bad, as if we were to
implement some way of freeing up additional storage space in such a
situation, it is likely the sled-agent that would be responsible for
doing so!

In an ideal world, we might wish for different behavior based on the
particular error that occurred. We may wish to panic on errors to
serialize a log line (as that would be indicative of a programmer
error), while we may wish to not panic on I/O errors. Sadly, slog's
current interfaces make this challenging, as all errors from the loggers
we use, are presently coerced to `std::io::Error`, making it difficult
to distinguish between I/O and serialization errors in a non-flaky way.

In the future, we might want to change the behavior from *silently*
dropping log lines to instead recording when we have done so, perhaps by
maintaining a count of logging errors. This could be broken down by
error kinds, and track the timestamp of the last time a log message was
lost. Reporting could report these error counts through other means,
such as timeseries metrics, would allow the operator of the service to
at least *know* that their log is incomplete. We could even imagine
having a thingy that tracks if we have lost log messages, and once a log
message is written successfully, tries to log a "hey, by the way, we
also dropped however many log lines over the last however long!".

We might also want to make the panciking behavior configurable. This is
sadly somewhat more annoying than it ought to be, as
`slog::Drain::fuse()` and `slog::Drain::ignore_res()` change the _type_
of the drain, requiring duplicate code paths that construct
almost-but-not-entirely identical loggers. But, this would let us have a
config which, say, panics in tests but does not panic in production,
which seems like a reasonable thing to want.

Nonetheless, this commit takes the shortest path to not panicking, and
just turns all the `fuse()`s into `ignore_res()`es. I think the present
behavior is bad enough in production that the quick fix feels fairly
justified. We should consider making this nicer in the future though.
@hawkw hawkw force-pushed the eliza/fuse-considered-harmful branch from 770ccc0 to a68c7e1 Compare May 22, 2026 20:59
hawkw and others added 2 commits May 22, 2026 14:00
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
@hawkw
Copy link
Copy Markdown
Member Author

hawkw commented May 22, 2026

(as an aside, I feel like the fancier potential solutions, such as tracking logger errors, are probably best saved until the ConfigLogging stuff finds a new home in its own repo, as @davepacheco proposes in #1607 (comment))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants