logging: do not panic in the event of a failure to log a message#1610
Open
hawkw wants to merge 3 commits into
Open
logging: do not panic in the event of a failure to log a message#1610hawkw wants to merge 3 commits into
hawkw wants to merge 3 commits into
Conversation
There was a problem hiding this comment.
Pull request overview
Adjusts Dropshot’s default slog-based logging configuration to avoid panicking the entire process when a log write fails (e.g., transient I/O errors like ENOSPC), trading hard-fail behavior for dropping the affected log line(s).
Changes:
- Replaces
fuse()withignore_res()for terminal and file drains to prevent panics on log write errors. - Applies
ignore_res()to the async drain so async logging failures don’t crash the process. - Updates the file-drain helper return type to reflect the new drain wrapper (
IgnoreResult).
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
This is a quite sad situation. Currently, the logger configuration produced by `dropshot::ConfigLogging` will panic whenever an error writing a log line occurs. This crashes the entire process, which is often very much not what you want to have happen. Therefore, this branch changes the behavior to ignore errors returned here, resulting in the log line being silently eaten. This is *also* a behavior that I will be the first to admit is quite sad, but I believe it to be a less sad default than crashing the entire process. As an example of why, consider the Oxide rack's sled-agent. The sled-agent will currently crash when it attempts to log something and the root filesystem to which it attempts to log is full, as described in oxidecomputer/omicron#4354. This is quite bad, as if we were to implement some way of freeing up additional storage space in such a situation, it is likely the sled-agent that would be responsible for doing so! In an ideal world, we might wish for different behavior based on the particular error that occurred. We may wish to panic on errors to serialize a log line (as that would be indicative of a programmer error), while we may wish to not panic on I/O errors. Sadly, slog's current interfaces make this challenging, as all errors from the loggers we use, are presently coerced to `std::io::Error`, making it difficult to distinguish between I/O and serialization errors in a non-flaky way. In the future, we might want to change the behavior from *silently* dropping log lines to instead recording when we have done so, perhaps by maintaining a count of logging errors. This could be broken down by error kinds, and track the timestamp of the last time a log message was lost. Reporting could report these error counts through other means, such as timeseries metrics, would allow the operator of the service to at least *know* that their log is incomplete. We could even imagine having a thingy that tracks if we have lost log messages, and once a log message is written successfully, tries to log a "hey, by the way, we also dropped however many log lines over the last however long!". We might also want to make the panciking behavior configurable. This is sadly somewhat more annoying than it ought to be, as `slog::Drain::fuse()` and `slog::Drain::ignore_res()` change the _type_ of the drain, requiring duplicate code paths that construct almost-but-not-entirely identical loggers. But, this would let us have a config which, say, panics in tests but does not panic in production, which seems like a reasonable thing to want. Nonetheless, this commit takes the shortest path to not panicking, and just turns all the `fuse()`s into `ignore_res()`es. I think the present behavior is bad enough in production that the quick fix feels fairly justified. We should consider making this nicer in the future though.
770ccc0 to
a68c7e1
Compare
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Member
Author
|
(as an aside, I feel like the fancier potential solutions, such as tracking logger errors, are probably best saved until the |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This is a quite sad situation. Currently, the logger configuration produced by
dropshot::ConfigLoggingwill panic whenever an error writing a log line occurs. This crashes the entire process, which is often very much not what you want to have happen. Therefore, this branch changes the behavior to ignore errors returned here, resulting in the log line being silently eaten. This is also a behavior that I will be the first to admit is quite sad, but I believe it to be a less sad default than crashing the entire process.As an example of why, consider the Oxide rack's sled-agent. The sled-agent will currently crash when it attempts to log something and the root filesystem to which it attempts to log is full, as described in oxidecomputer/omicron#4354. This is quite bad, as if we were to implement some way of freeing up additional storage space in such a situation, it is likely the sled-agent that would be responsible for doing so!
In an ideal world, we might wish for different behavior based on the particular error that occurred. We may wish to panic on errors to serialize a log line (as that would be indicative of a programmer error), while we may wish to not panic on I/O errors. Sadly, slog's current interfaces make this challenging, as all errors from the loggers we use, are presently coerced to
std::io::Error, making it difficult to distinguish between I/O and serialization errors in a non-flaky way.In the future, we might want to change the behavior from silently dropping log lines to instead recording when we have done so, perhaps by maintaining a count of logging errors. This could be broken down by error kinds, and track the timestamp of the last time a log message was lost. Reporting could report these error counts through other means, such as timeseries metrics, would allow the operator of the service to at least know that their log is incomplete. We could even imagine having a thingy that tracks if we have lost log messages, and once a log message is written successfully, tries to log a "hey, by the way, we also dropped however many log lines over the last however long!".
We might also want to make the panciking behavior configurable. This is sadly somewhat more annoying than it ought to be, as
slog::Drain::fuse()andslog::Drain::ignore_res()change the type of the drain, requiring duplicate code paths that construct almost-but-not-entirely identical loggers. But, this would let us have a config which, say, panics in tests but does not panic in production, which seems like a reasonable thing to want.Nonetheless, this commit takes the shortest path to not panicking, and just turns all the
fuse()s intoignore_res()es. I think the present behavior is bad enough in production that the quick fix feels fairly justified. We should consider making this nicer in the future though.