Skip to content

[inventory] Add stale saga information#10027

Open
karencfv wants to merge 20 commits intooxidecomputer:mainfrom
karencfv:stale-saga-info-in-inventory
Open

[inventory] Add stale saga information#10027
karencfv wants to merge 20 commits intooxidecomputer:mainfrom
karencfv:stale-saga-info-in-inventory

Conversation

@karencfv
Copy link
Copy Markdown
Contributor

@karencfv karencfv commented Mar 11, 2026

As per #9876, this PR adds information about stale sagas to inventory. Any saga that is older than 15 minutes which is in a running or unwinding state.

The 15 minute cutoff is arbitrary, the update status endpoint will only consider a saga older than 1 hour as a health concern, but it would be good to store more data on sagas that are in one of these two states for a shorter time for debugging purposes. I can change this threshold if people don't agree though

A stale saga is present:

$ ./target/debug/omdb db inventory collections show latest
<...>
STALE SAGAS
    Found 1 running or unwinding for longer than 15 minutes
        NAME     ID                                       STATE       CREATOR                                  CURRENT_SEC                              TIME_CREATED                   
        demo     457c4832-1a87-4bd3-a6d1-0e98105b1a5c     running     a4ef738a-1fb0-47b1-9da2-4919c7ec7c7f     a4ef738a-1fb0-47b1-9da2-4919c7ec7c7f     2026-04-13 07:39:27.357491 UTC 

Everything is happy:

$ ./target/debug/omdb db inventory collections show latest

<...>
STALE SAGAS
    No sagas have been running or unwinding for longer than 15 minutes

Closes: #10287

@karencfv karencfv marked this pull request as ready for review April 14, 2026 07:29
@karencfv
Copy link
Copy Markdown
Contributor Author

karencfv commented Apr 14, 2026

From the build-and-test (helios) job https://buildomat.eng.oxide.computer/wg/0/details/01KP5DY5F65D03T0G7EHK3D0VX/3HTS7wQUErwzoUUVG6juJt1SdvKCVqdRoL8vOvWb0bSdknVd/01KP5DYFQWBTSMENHD0BFYC31X#S6677

6677	2026-04-14T07:40:41.181Z	error: creating test list failed
6678	2026-04-14T07:40:41.181Z	
6679	2026-04-14T07:40:41.181Z	Caused by:
6680	2026-04-14T07:40:41.181Z	  for `nexus-fm`, command `/work/oxidecomputer/omicron/target/debug/deps/nexus_fm-618b2e4b6bedc71d --list --format terse` aborted with signal 9 (SIGKILL)
6681	2026-04-14T07:40:41.181Z	  --- stdout:
6682	2026-04-14T07:40:41.181Z	
6683	2026-04-14T07:40:41.181Z	  --- stderr:
6684	2026-04-14T07:40:41.181Z	  ld.so.1: nexus_fm-618b2e4b6bedc71d: fatal: libpq.so.5: open failed: No such file or directory
6685	2026-04-14T07:40:41.181Z	
6686	2026-04-14T07:40:41.181Z	  ---

I am legitimately stumped by this. I didn't even touch that crate. The build-and-test (ubuntu-22.04) tests run fine, and the problem isn't showing on main either!!

@karencfv
Copy link
Copy Markdown
Contributor Author

@davepacheco @jgallagher tiny ping for a review please?

Copy link
Copy Markdown
Collaborator

@davepacheco davepacheco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel terrible about this, but seeing the concrete change does give me some serious doubts about the direction that I suggested before. I have a bunch of specific comments inline but also a some bigger questions that might preempt those.

First, I'm very surprised how many packages required the pq-sys treatment and it includes a bunch of stuff that really ought never to talk to the database (e.g., nexus-reconfigurator-simulation). As much as I hate using traits for dependency injection, in this case, I'd be inclined to at least do the same thing we did with the SledAgentEnumerator -- put the database bit behind a trait so that most consumers of nexus/inventory don't have to do all this. (That does make me wonder though why there are so many consumers of nexus-inventory. If I remember right, the types are in nexus-types, so the only consumers should be people building or collecting their own inventories. It seems like that should be very few?)

Second, the inventory is encoding some pretty arbitrary judgment about what makes a saga stuck (i.e., that it's been running for 15 minutes). That's a departure from most of inventory today. It feels like that judgment belongs in the consumer that's deciding what's okay. The obvious way to do that would be to include all sagas here, but that does seem like a lot of extra work when we hope that in almost all cases they're going to be ignored.

Third, seeing that we're literally reading from the database and inserting the same data into a second table in the database doesn't feel awesome. I'm still not sure it's wrong. One alternative we discussed is to query Nexus instances. But that's tricky because new Nexus instances can be created (how do you know you didn't miss those?) and old ones can be expunged. When they get expunged, their sagas could move from one instance to another. With all that, I'm not sure how to query them all in a way where you'd know for sure that if there was a long-running saga, then you definitely saw it.


This gets me to wondering whether we shouldn't do this check as a separate source of information at the point where we're doing a health check? In other words, when doing a pre-upgrade health check, we already plan to fetch an inventory collection, compare it to the blueprint, etc. Maybe we should just fetch running sagas from the database at that point and include any long-running ones in the check, separate from inventory?

Edit: Sorry in advance if my own concerns have us change direction from the one I initially suggested!

Comment thread end-to-end-tests/build.rs
// See omicron-rpaths for documentation.
// NOTE: This file MUST be kept in sync with the other build.rs files in this
// repository.
fn main() {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I gather that a whole bunch of packages needed the pq-sys/rpaths/build.rs treatment:

omicron-sled-agent
omicron-test-utils
tqdb
end-to-end-tests
nexus-reconfigurator-blippy
nexus-fm
nexus-reconfigurator-planning
nexus-reconfigurator-simulation

But I'm not clear on why. My understanding is that you'd need this if you added dependencies on nexus-db-queries, which would happen for things that depended on nexus-inventory before. But omicron-sled-agent doesn't have a dependency on nexus-inventory.

Similarly, omicron-test-utils doesn't seem to have any other dependency on pq-sys:

 $ cargo tree -i pq-sys -p omicron-test-utils
pq-sys v0.4.6 (https://github.com/oxidecomputer/pq-sys?branch=oxide%2Fomicron#b1194c19)
└── omicron-test-utils v0.1.0 (/home/dap/omicron-review/test-utils)

and end-to-end-tests doesn't seem to have any that aren't one of those two:

$ cargo tree -i pq-sys -p end-to-end-tests
pq-sys v0.4.6 (https://github.com/oxidecomputer/pq-sys?branch=oxide%2Fomicron#b1194c19)
├── end-to-end-tests v0.1.0 (/home/dap/omicron-review/end-to-end-tests)
├── omicron-sled-agent v0.1.0 (/home/dap/omicron-review/sled-agent)
│   └── end-to-end-tests v0.1.0 (/home/dap/omicron-review/end-to-end-tests)
└── omicron-test-utils v0.1.0 (/home/dap/omicron-review/test-utils)
    └── end-to-end-tests v0.1.0 (/home/dap/omicron-review/end-to-end-tests)

tqdb is admittedly more complicated... it seems to be pulling in nexus-db-queries through this path:

$ cargo tree -i pq-sys -p tqdb 
pq-sys v0.4.6 (https://github.com/oxidecomputer/pq-sys?branch=oxide%2Fomicron#b1194c19)
├── diesel v2.3.7
│   ├── async-bb8-diesel v0.2.1
│   │   ├── nexus-db-lookup v0.1.0 (/home/dap/omicron-review/nexus/db-lookup)
│   │   │   └── nexus-db-queries v0.1.0 (/home/dap/omicron-review/nexus/db-queries)
│   │   │       └── nexus-inventory v0.1.0 (/home/dap/omicron-review/nexus/inventory)
│   │   │           ├── nexus-reconfigurator-planning v0.1.0 (/home/dap/omicron-review/nexus/reconfigurator/planning)
│   │   │           │   ├── nexus-reconfigurator-simulation v0.1.0 (/home/dap/omicron-review/nexus/reconfigurator/simulation)
│   │   │           │   │   └── reconfigurator-cli v0.1.0 (/home/dap/omicron-review/dev-tools/reconfigurator-cli)
│   │   │           │   │       └── tqdb v0.1.0 (/home/dap/omicron-review/trust-quorum/tqdb)
...

which is pretty unfortunate but I'm not sure how hard it is to avoid.

In general, it's surprising from first principles that several of these would require database access (like the nexus-reconfigurator packages).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so this might just be me misunderstanding the instructions in rpaths/src/lib.rs. It says:

  1. Any crate that depends on pq-sys, directly or not, needs to follow these
    instructions. Generally, we depend on pq-sys indirectly, by virtue of
    depending on Diesel.

I took that to mean any crate on the dependency tree. So everything I get when running cargo tree -i nexus-inventory. Judging by your reaction perhaps I was wrong?

Comment thread uuid-kinds/src/lib.rs
Comment on lines +83 to +84
SagaCreator = {},
SagaSec = {},
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure these are really different, are they?

Comment thread schema/crdb/dbinit.sql
Comment on lines +5104 to +5107
creator UUID NOT NULL,
current_sec UUID,
name TEXT NOT NULL,
state omicron.public.stale_saga_state NOT NULL,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bunch of this duplicates information from the saga table, and some of it (like current_sec and state) could become out of date. Maybe we should just have consumers JOIN against the saga table (or just fetch from it directly) when they want more information about a stale saga.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that this information could become out of date. But isn't the point of the inventory to capture a snapshot of the state of the system at a given time? That was my reasoning behind creating this table.

Comment thread schema/crdb/dbinit.sql
Comment on lines +2870 to +2877
/*
* For listing sagas by state (e.g., finding all running and unwinding sagas
* across all SECs). We need to paginate this list by the id.
*/
CREATE INDEX IF NOT EXISTS lookup_saga_by_state ON omicron.public.saga (
saga_state, id
);

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First, this almost certainly wants to be limited to WHERE saga_state = 'running' OR saga_state = 'unwinding', right? Otherwise, this will grow unbounded and include all the 'done' sagas that we don't care about here because they're not stuck.

Now, that's almost the same condition as the previous index. In principle, I think it should be the same condition. The difference right now is that that index includes abandoned sagas. But I don't think that it should. I'm guessing that was an oversight when we added the abandon state.

So maybe the change to make here is to update the lookup_saga_by_sec index to include AND saga_state != 'abandoned'? We'd probably want to double-check that any queries using saga_state` will keep using this index.

Zone 587da9e8-8fc0-4854-b585-070741a7b00d: Internal DNS generation @ 1
Zone ffbf02f0-261d-4723-b613-eb861245acbd: Internal DNS generation @ 1

STALE SAGAS
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General note: "stale" doesn't seem like the right word for this. They seem "stuck" at worst. "Long-running" is more precise, if longer.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Argh, yeah. Initially I had called them long running sagas, but since I included the ones in "unwinding" state in addition to the ones in "running"state, I thought the name could be misleading.

pub time_collected: DateTime<Utc>,
}

impl IdOrdItem for InventoryStaleSaga {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem correct. An iddqd::IdOrdMap<T> is analogous to a std::collections::BTreeMap<<T as IdOrdItem>::Key, T>. In this case, this is analogous to a BTreeMap<SagaCreatorUuid, InventoryStaleSaga>. That means we're only tracking at most one stuck saga for each Nexus instance. If that's what you meant, I would use a BTreeMap. But I'm guessing you meant to track all of them, in which case the key should be the saga_id: SagaUuid, not the creator: SagaCreatorUuid.

if stale_sagas.is_empty() {
writeln!(
f,
"No sagas have been running or unwinding for longer than 15 minutes"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure it's worth fixing this but hardcoding the "15 minutes" here seems wrong. For one, if we changed the value, that'd be reflected in newer inventory collections but not older ones. Whether we updated this text or not, it would be wrong sometimes.

Better would be to include the limit in the inventory collection, but admittedly that seems like more trouble than it's worth.

Another option would be to not filter at all. Include all running sagas here. The tool could filter at runtime if it wanted.

Thoughts?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will discuss IRL tomorrow, but yeah, the more I think about it the more I want to just include all sagas that are either in a running or unwinding state.

{
Ok(sagas) => sagas,
Err(e) => {
self.in_progress.found_error(InventoryError::from(anyhow!(e)));
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably want to add .context so that the error reflects that this happened while listing sagas.


let time_collected = Utc::now();
for saga in sagas.into_iter().filter(|saga| {
(time_collected - saga.time_created) > STALE_SAGA_THRESHOLD
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason not to put this into the query instead?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My intention was to create an API to query the DB that could be reusable for other purposes, but yeah, it's more wasteful this way.

let state = match saga.saga_state.try_into() {
Ok(state) => state,
Err(msg) => {
error!(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd add a comment that that this should be impossible (presumably).

Copy link
Copy Markdown
Contributor Author

@karencfv karencfv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the thorough review @davepacheco! We'll be chatting about this at the sync tomorrow, but I want to leave some thoughts so I don't forget.

First, I'm very surprised how many packages required the pq-sys treatment and it includes a bunch of stuff that really ought never to talk to the database (e.g., nexus-reconfigurator-simulation). As much as I hate using traits for dependency injection, in this case, I'd be inclined to at least do the same thing we did with the SledAgentEnumerator -- put the database bit behind a trait so that most consumers of nexus/inventory don't have to do all this. (That does make me wonder though why there are so many consumers of nexus-inventory. If I remember right, the types are in nexus-types, so the only consumers should be people building or collecting their own inventories. It seems like that should be very few?)

This may be on me, I may have taken the instructions from rpaths/src/lib.rs too literally 😅 I wrote more on this on the comment itself.

Second, the inventory is encoding some pretty arbitrary judgment about what makes a saga stuck (i.e., that it's been running for 15 minutes). That's a departure from most of inventory today. It feels like that judgment belongs in the consumer that's deciding what's okay. The obvious way to do that would be to include all sagas here, but that does seem like a lot of extra work when we hope that in almost all cases they're going to be ignored.

Fair, it makes sense to get rid of that threshold. My initial thought here is that perhaps we don't have to include all sagas. In a similar way to SMF services "enabled not online", we could include sagas only in certain states; namely, "running" and "unwinding". Aside from helping us determine if a saga may be stuck or not as part of these health checks, that list could serve as a snapshot in time for later debugging. For example, it can help us spot patterns in usage or any anomalies. I don't think saving this information would be detrimental.

Third, seeing that we're literally reading from the database and inserting the same data into a second table in the database doesn't feel awesome. I'm still not sure it's wrong. One alternative we discussed is to query Nexus instances. But that's tricky because new Nexus instances can be created (how do you know you didn't miss those?) and old ones can be expunged. When they get expunged, their sagas could move from one instance to another. With all that, I'm not sure how to query them all in a way where you'd know for sure that if there was a long-running saga, then you definitely saw it.

I agree that querying Nexus instances directly could create more problems and be less reliable for the reasons you mention. On the commentary about this approach, the data we're inserting is the same, in a way, but different in another. A saga in the saga table can be updated, and we lose information about it's state and current SEC at a given moment in time. Without a dedicated table for inventory as a snapshot in time, we can no longer answer "which was the saga that was running for more than n minutes?". Having a snapshot of running/unwinding sagas per collection also gives us a time series. We can see a saga show up running at T1, T2, T3 and know it was stuck for hours rather than minutes. That's valuable for both debugging and spotting performance issues.

This gets me to wondering whether we shouldn't do this check as a separate source of information at the point where we're doing a health check? In other words, when doing a pre-upgrade health check, we already plan to fetch an inventory collection, compare it to the blueprint, etc. Maybe we should just fetch running sagas from the database at that point and include any long-running ones in the check, separate from inventory?

For the reason above I would prefer to record saga information. That said, I agree that the threshold is completely unnecessary and it makes sense to me to record all sagas in either running/unwinding state.

Zone 587da9e8-8fc0-4854-b585-070741a7b00d: Internal DNS generation @ 1
Zone ffbf02f0-261d-4723-b613-eb861245acbd: Internal DNS generation @ 1

STALE SAGAS
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Argh, yeah. Initially I had called them long running sagas, but since I included the ones in "unwinding" state in addition to the ones in "running"state, I thought the name could be misleading.

Comment thread end-to-end-tests/build.rs
// See omicron-rpaths for documentation.
// NOTE: This file MUST be kept in sync with the other build.rs files in this
// repository.
fn main() {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so this might just be me misunderstanding the instructions in rpaths/src/lib.rs. It says:

  1. Any crate that depends on pq-sys, directly or not, needs to follow these
    instructions. Generally, we depend on pq-sys indirectly, by virtue of
    depending on Diesel.

I took that to mean any crate on the dependency tree. So everything I get when running cargo tree -i nexus-inventory. Judging by your reaction perhaps I was wrong?

.map_err(|e| public_error_from_diesel(e, ErrorHandler::Server))
}

pub async fn saga_list_by_states_batched(
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair 🤔


let time_collected = Utc::now();
for saga in sagas.into_iter().filter(|saga| {
(time_collected - saga.time_created) > STALE_SAGA_THRESHOLD
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My intention was to create an API to query the DB that could be reusable for other purposes, but yeah, it's more wasteful this way.

if stale_sagas.is_empty() {
writeln!(
f,
"No sagas have been running or unwinding for longer than 15 minutes"
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will discuss IRL tomorrow, but yeah, the more I think about it the more I want to just include all sagas that are either in a running or unwinding state.

Comment thread schema/crdb/dbinit.sql
Comment on lines +5104 to +5107
creator UUID NOT NULL,
current_sec UUID,
name TEXT NOT NULL,
state omicron.public.stale_saga_state NOT NULL,
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that this information could become out of date. But isn't the point of the inventory to capture a snapshot of the state of the system at a given time? That was my reasoning behind creating this table.

@karencfv
Copy link
Copy Markdown
Contributor Author

Forgot to add that when we restructured the health checks in #9876 we agreed that all of the information would be recorded in inventory as the source of truth. Any checks that one client does, could be replicated by another client simply by looking at inventory. This is relevant because support will eventually only have the information included in the support bundles. We want to be able to run the same health checks from a support bundle.

@karencfv
Copy link
Copy Markdown
Contributor Author

Update, We chatted about this at the sync today and decided a few things:

  • We won't record any saga information for inventory
  • The checks will be done when the API is called. They will be done directly against what is on the saga table at that time.
  • We'll log all information about health checks.
  • This information will be added to support bundles independently from the inventory collection.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add stale saga information to DB

2 participants