[inventory] Add stale saga information by karencfv · Pull Request #10027 · oxidecomputer/omicron

karencfv · 2026-03-11T06:23:59Z

As per #9876, this PR adds information about stale sagas to inventory. Any saga that is older than 15 minutes which is in a running or unwinding state.

The 15 minute cutoff is arbitrary, the update status endpoint will only consider a saga older than 1 hour as a health concern, but it would be good to store more data on sagas that are in one of these two states for a shorter time for debugging purposes. I can change this threshold if people don't agree though

A stale saga is present:

$ ./target/debug/omdb db inventory collections show latest
<...>
STALE SAGAS
    Found 1 running or unwinding for longer than 15 minutes
        NAME     ID                                       STATE       CREATOR                                  CURRENT_SEC                              TIME_CREATED                   
        demo     457c4832-1a87-4bd3-a6d1-0e98105b1a5c     running     a4ef738a-1fb0-47b1-9da2-4919c7ec7c7f     a4ef738a-1fb0-47b1-9da2-4919c7ec7c7f     2026-04-13 07:39:27.357491 UTC

Everything is happy:

$ ./target/debug/omdb db inventory collections show latest

<...>
STALE SAGAS
    No sagas have been running or unwinding for longer than 15 minutes

Closes: #10287

karencfv · 2026-04-14T07:50:31Z

From the build-and-test (helios) job https://buildomat.eng.oxide.computer/wg/0/details/01KP5DY5F65D03T0G7EHK3D0VX/3HTS7wQUErwzoUUVG6juJt1SdvKCVqdRoL8vOvWb0bSdknVd/01KP5DYFQWBTSMENHD0BFYC31X#S6677

6677	2026-04-14T07:40:41.181Z	error: creating test list failed
6678	2026-04-14T07:40:41.181Z	
6679	2026-04-14T07:40:41.181Z	Caused by:
6680	2026-04-14T07:40:41.181Z	  for `nexus-fm`, command `/work/oxidecomputer/omicron/target/debug/deps/nexus_fm-618b2e4b6bedc71d --list --format terse` aborted with signal 9 (SIGKILL)
6681	2026-04-14T07:40:41.181Z	  --- stdout:
6682	2026-04-14T07:40:41.181Z	
6683	2026-04-14T07:40:41.181Z	  --- stderr:
6684	2026-04-14T07:40:41.181Z	  ld.so.1: nexus_fm-618b2e4b6bedc71d: fatal: libpq.so.5: open failed: No such file or directory
6685	2026-04-14T07:40:41.181Z	
6686	2026-04-14T07:40:41.181Z	  ---

I am legitimately stumped by this. I didn't even touch that crate. The build-and-test (ubuntu-22.04) tests run fine, and the problem isn't showing on main either!!

karencfv · 2026-04-20T02:05:20Z

@davepacheco @jgallagher tiny ping for a review please?

davepacheco

I feel terrible about this, but seeing the concrete change does give me some serious doubts about the direction that I suggested before. I have a bunch of specific comments inline but also a some bigger questions that might preempt those.

First, I'm very surprised how many packages required the pq-sys treatment and it includes a bunch of stuff that really ought never to talk to the database (e.g., nexus-reconfigurator-simulation). As much as I hate using traits for dependency injection, in this case, I'd be inclined to at least do the same thing we did with the SledAgentEnumerator -- put the database bit behind a trait so that most consumers of nexus/inventory don't have to do all this. (That does make me wonder though why there are so many consumers of nexus-inventory. If I remember right, the types are in nexus-types, so the only consumers should be people building or collecting their own inventories. It seems like that should be very few?)

Second, the inventory is encoding some pretty arbitrary judgment about what makes a saga stuck (i.e., that it's been running for 15 minutes). That's a departure from most of inventory today. It feels like that judgment belongs in the consumer that's deciding what's okay. The obvious way to do that would be to include all sagas here, but that does seem like a lot of extra work when we hope that in almost all cases they're going to be ignored.

Third, seeing that we're literally reading from the database and inserting the same data into a second table in the database doesn't feel awesome. I'm still not sure it's wrong. One alternative we discussed is to query Nexus instances. But that's tricky because new Nexus instances can be created (how do you know you didn't miss those?) and old ones can be expunged. When they get expunged, their sagas could move from one instance to another. With all that, I'm not sure how to query them all in a way where you'd know for sure that if there was a long-running saga, then you definitely saw it.

This gets me to wondering whether we shouldn't do this check as a separate source of information at the point where we're doing a health check? In other words, when doing a pre-upgrade health check, we already plan to fetch an inventory collection, compare it to the blueprint, etc. Maybe we should just fetch running sagas from the database at that point and include any long-running ones in the check, separate from inventory?

Edit: Sorry in advance if my own concerns have us change direction from the one I initially suggested!

davepacheco · 2026-04-20T20:47:21Z

+// See omicron-rpaths for documentation.
+// NOTE: This file MUST be kept in sync with the other build.rs files in this
+// repository.
+fn main() {


I gather that a whole bunch of packages needed the pq-sys/rpaths/build.rs treatment:

omicron-sled-agent
omicron-test-utils
tqdb
end-to-end-tests
nexus-reconfigurator-blippy
nexus-fm
nexus-reconfigurator-planning
nexus-reconfigurator-simulation

But I'm not clear on why. My understanding is that you'd need this if you added dependencies on nexus-db-queries, which would happen for things that depended on nexus-inventory before. But omicron-sled-agent doesn't have a dependency on nexus-inventory.

Similarly, omicron-test-utils doesn't seem to have any other dependency on pq-sys:

$ cargo tree -i pq-sys -p omicron-test-utils pq-sys v0.4.6 (https://github.com/oxidecomputer/pq-sys?branch=oxide%2Fomicron#b1194c19) └── omicron-test-utils v0.1.0 (/home/dap/omicron-review/test-utils)

and end-to-end-tests doesn't seem to have any that aren't one of those two:

$ cargo tree -i pq-sys -p end-to-end-tests pq-sys v0.4.6 (https://github.com/oxidecomputer/pq-sys?branch=oxide%2Fomicron#b1194c19) ├── end-to-end-tests v0.1.0 (/home/dap/omicron-review/end-to-end-tests) ├── omicron-sled-agent v0.1.0 (/home/dap/omicron-review/sled-agent) │ └── end-to-end-tests v0.1.0 (/home/dap/omicron-review/end-to-end-tests) └── omicron-test-utils v0.1.0 (/home/dap/omicron-review/test-utils) └── end-to-end-tests v0.1.0 (/home/dap/omicron-review/end-to-end-tests)

tqdb is admittedly more complicated... it seems to be pulling in nexus-db-queries through this path:

$ cargo tree -i pq-sys -p tqdb pq-sys v0.4.6 (https://github.com/oxidecomputer/pq-sys?branch=oxide%2Fomicron#b1194c19) ├── diesel v2.3.7 │ ├── async-bb8-diesel v0.2.1 │ │ ├── nexus-db-lookup v0.1.0 (/home/dap/omicron-review/nexus/db-lookup) │ │ │ └── nexus-db-queries v0.1.0 (/home/dap/omicron-review/nexus/db-queries) │ │ │ └── nexus-inventory v0.1.0 (/home/dap/omicron-review/nexus/inventory) │ │ │ ├── nexus-reconfigurator-planning v0.1.0 (/home/dap/omicron-review/nexus/reconfigurator/planning) │ │ │ │ ├── nexus-reconfigurator-simulation v0.1.0 (/home/dap/omicron-review/nexus/reconfigurator/simulation) │ │ │ │ │ └── reconfigurator-cli v0.1.0 (/home/dap/omicron-review/dev-tools/reconfigurator-cli) │ │ │ │ │ └── tqdb v0.1.0 (/home/dap/omicron-review/trust-quorum/tqdb) ...

which is pretty unfortunate but I'm not sure how hard it is to avoid.

In general, it's surprising from first principles that several of these would require database access (like the nexus-reconfigurator packages).

Ok, so this might just be me misunderstanding the instructions in rpaths/src/lib.rs. It says:

Any crate that depends on pq-sys, directly or not, needs to follow these
instructions. Generally, we depend on pq-sys indirectly, by virtue of
depending on Diesel.

I took that to mean any crate on the dependency tree. So everything I get when running cargo tree -i nexus-inventory. Judging by your reaction perhaps I was wrong?

davepacheco · 2026-04-20T20:51:14Z

+        SagaCreator = {},
+        SagaSec = {},


I'm not sure these are really different, are they?

davepacheco · 2026-04-20T20:58:38Z

+    creator UUID NOT NULL,
+    current_sec UUID,
+    name TEXT NOT NULL,
+    state omicron.public.stale_saga_state NOT NULL,


A bunch of this duplicates information from the saga table, and some of it (like current_sec and state) could become out of date. Maybe we should just have consumers JOIN against the saga table (or just fetch from it directly) when they want more information about a stale saga.

I agree that this information could become out of date. But isn't the point of the inventory to capture a snapshot of the state of the system at a given time? That was my reasoning behind creating this table.

davepacheco · 2026-04-20T21:05:07Z

+/*
+ * For listing sagas by state (e.g., finding all running and unwinding sagas
+ * across all SECs).  We need to paginate this list by the id.
+ */
+CREATE INDEX IF NOT EXISTS lookup_saga_by_state ON omicron.public.saga (
+    saga_state, id
+);
+


First, this almost certainly wants to be limited to WHERE saga_state = 'running' OR saga_state = 'unwinding', right? Otherwise, this will grow unbounded and include all the 'done' sagas that we don't care about here because they're not stuck.

Now, that's almost the same condition as the previous index. In principle, I think it should be the same condition. The difference right now is that that index includes abandoned sagas. But I don't think that it should. I'm guessing that was an oversight when we added the abandon state.

So maybe the change to make here is to update the lookup_saga_by_sec index to include AND saga_state != 'abandoned'? We'd probably want to double-check that any queries using saga_state` will keep using this index.

davepacheco · 2026-04-20T21:07:03Z

    Zone 587da9e8-8fc0-4854-b585-070741a7b00d: Internal DNS generation @ 1
    Zone ffbf02f0-261d-4723-b613-eb861245acbd: Internal DNS generation @ 1

+STALE SAGAS


General note: "stale" doesn't seem like the right word for this. They seem "stuck" at worst. "Long-running" is more precise, if longer.

Argh, yeah. Initially I had called them long running sagas, but since I included the ones in "unwinding" state in addition to the ones in "running"state, I thought the name could be misleading.

davepacheco · 2026-04-20T21:26:01Z

+    pub time_collected: DateTime<Utc>,
+}
+
+impl IdOrdItem for InventoryStaleSaga {


This doesn't seem correct. An iddqd::IdOrdMap<T> is analogous to a std::collections::BTreeMap<<T as IdOrdItem>::Key, T>. In this case, this is analogous to a BTreeMap<SagaCreatorUuid, InventoryStaleSaga>. That means we're only tracking at most one stuck saga for each Nexus instance. If that's what you meant, I would use a BTreeMap. But I'm guessing you meant to track all of them, in which case the key should be the saga_id: SagaUuid, not the creator: SagaCreatorUuid.

davepacheco · 2026-04-20T21:29:04Z

+    if stale_sagas.is_empty() {
+        writeln!(
+            f,
+            "No sagas have been running or unwinding for longer than 15 minutes"


I'm not sure it's worth fixing this but hardcoding the "15 minutes" here seems wrong. For one, if we changed the value, that'd be reflected in newer inventory collections but not older ones. Whether we updated this text or not, it would be wrong sometimes.

Better would be to include the limit in the inventory collection, but admittedly that seems like more trouble than it's worth.

Another option would be to not filter at all. Include all running sagas here. The tool could filter at runtime if it wanted.

Thoughts?

Will discuss IRL tomorrow, but yeah, the more I think about it the more I want to just include all sagas that are either in a running or unwinding state.

davepacheco · 2026-04-20T22:39:21Z

+        {
+            Ok(sagas) => sagas,
+            Err(e) => {
+                self.in_progress.found_error(InventoryError::from(anyhow!(e)));


Probably want to add .context so that the error reflects that this happened while listing sagas.

davepacheco · 2026-04-20T22:39:33Z

+
+        let time_collected = Utc::now();
+        for saga in sagas.into_iter().filter(|saga| {
+            (time_collected - saga.time_created) > STALE_SAGA_THRESHOLD


Any reason not to put this into the query instead?

My intention was to create an API to query the DB that could be reusable for other purposes, but yeah, it's more wasteful this way.

davepacheco · 2026-04-20T22:40:29Z

+            let state = match saga.saga_state.try_into() {
+                Ok(state) => state,
+                Err(msg) => {
+                    error!(


I'd add a comment that that this should be impossible (presumably).

karencfv

Thanks for the thorough review @davepacheco! We'll be chatting about this at the sync tomorrow, but I want to leave some thoughts so I don't forget.

First, I'm very surprised how many packages required the pq-sys treatment and it includes a bunch of stuff that really ought never to talk to the database (e.g., nexus-reconfigurator-simulation). As much as I hate using traits for dependency injection, in this case, I'd be inclined to at least do the same thing we did with the SledAgentEnumerator -- put the database bit behind a trait so that most consumers of nexus/inventory don't have to do all this. (That does make me wonder though why there are so many consumers of nexus-inventory. If I remember right, the types are in nexus-types, so the only consumers should be people building or collecting their own inventories. It seems like that should be very few?)

This may be on me, I may have taken the instructions from rpaths/src/lib.rs too literally 😅 I wrote more on this on the comment itself.

Second, the inventory is encoding some pretty arbitrary judgment about what makes a saga stuck (i.e., that it's been running for 15 minutes). That's a departure from most of inventory today. It feels like that judgment belongs in the consumer that's deciding what's okay. The obvious way to do that would be to include all sagas here, but that does seem like a lot of extra work when we hope that in almost all cases they're going to be ignored.

Fair, it makes sense to get rid of that threshold. My initial thought here is that perhaps we don't have to include all sagas. In a similar way to SMF services "enabled not online", we could include sagas only in certain states; namely, "running" and "unwinding". Aside from helping us determine if a saga may be stuck or not as part of these health checks, that list could serve as a snapshot in time for later debugging. For example, it can help us spot patterns in usage or any anomalies. I don't think saving this information would be detrimental.

Third, seeing that we're literally reading from the database and inserting the same data into a second table in the database doesn't feel awesome. I'm still not sure it's wrong. One alternative we discussed is to query Nexus instances. But that's tricky because new Nexus instances can be created (how do you know you didn't miss those?) and old ones can be expunged. When they get expunged, their sagas could move from one instance to another. With all that, I'm not sure how to query them all in a way where you'd know for sure that if there was a long-running saga, then you definitely saw it.

I agree that querying Nexus instances directly could create more problems and be less reliable for the reasons you mention. On the commentary about this approach, the data we're inserting is the same, in a way, but different in another. A saga in the saga table can be updated, and we lose information about it's state and current SEC at a given moment in time. Without a dedicated table for inventory as a snapshot in time, we can no longer answer "which was the saga that was running for more than n minutes?". Having a snapshot of running/unwinding sagas per collection also gives us a time series. We can see a saga show up running at T1, T2, T3 and know it was stuck for hours rather than minutes. That's valuable for both debugging and spotting performance issues.

This gets me to wondering whether we shouldn't do this check as a separate source of information at the point where we're doing a health check? In other words, when doing a pre-upgrade health check, we already plan to fetch an inventory collection, compare it to the blueprint, etc. Maybe we should just fetch running sagas from the database at that point and include any long-running ones in the check, separate from inventory?

For the reason above I would prefer to record saga information. That said, I agree that the threshold is completely unnecessary and it makes sense to me to record all sagas in either running/unwinding state.

karencfv · 2026-04-21T05:51:38Z

    Zone 587da9e8-8fc0-4854-b585-070741a7b00d: Internal DNS generation @ 1
    Zone ffbf02f0-261d-4723-b613-eb861245acbd: Internal DNS generation @ 1

+STALE SAGAS


Argh, yeah. Initially I had called them long running sagas, but since I included the ones in "unwinding" state in addition to the ones in "running"state, I thought the name could be misleading.

karencfv · 2026-04-21T05:56:13Z

+// See omicron-rpaths for documentation.
+// NOTE: This file MUST be kept in sync with the other build.rs files in this
+// repository.
+fn main() {


Ok, so this might just be me misunderstanding the instructions in rpaths/src/lib.rs. It says:

Any crate that depends on pq-sys, directly or not, needs to follow these
instructions. Generally, we depend on pq-sys indirectly, by virtue of
depending on Diesel.

I took that to mean any crate on the dependency tree. So everything I get when running cargo tree -i nexus-inventory. Judging by your reaction perhaps I was wrong?

karencfv · 2026-04-21T06:02:39Z

        .map_err(|e| public_error_from_diesel(e, ErrorHandler::Server))
    }
+
+    pub async fn saga_list_by_states_batched(


karencfv · 2026-04-21T06:07:25Z

+
+        let time_collected = Utc::now();
+        for saga in sagas.into_iter().filter(|saga| {
+            (time_collected - saga.time_created) > STALE_SAGA_THRESHOLD


My intention was to create an API to query the DB that could be reusable for other purposes, but yeah, it's more wasteful this way.

karencfv · 2026-04-21T06:10:30Z

+    if stale_sagas.is_empty() {
+        writeln!(
+            f,
+            "No sagas have been running or unwinding for longer than 15 minutes"


Will discuss IRL tomorrow, but yeah, the more I think about it the more I want to just include all sagas that are either in a running or unwinding state.

karencfv · 2026-04-21T06:22:58Z

+    creator UUID NOT NULL,
+    current_sec UUID,
+    name TEXT NOT NULL,
+    state omicron.public.stale_saga_state NOT NULL,


I agree that this information could become out of date. But isn't the point of the inventory to capture a snapshot of the state of the system at a given time? That was my reasoning behind creating this table.

karencfv · 2026-04-21T07:24:40Z

Forgot to add that when we restructured the health checks in #9876 we agreed that all of the information would be recorded in inventory as the source of truth. Any checks that one client does, could be replicated by another client simply by looking at inventory. This is relevant because support will eventually only have the information included in the support bundles. We want to be able to run the same health checks from a support bundle.

karencfv · 2026-04-21T21:40:52Z

Update, We chatted about this at the sync today and decided a few things:

We won't record any saga information for inventory
The checks will be done when the API is called. They will be done directly against what is on the saga table at that time.
We'll log all information about health checks.
This information will be added to support bundles independently from the inventory collection.

karencfv added 18 commits March 10, 2026 16:34

[inventory] Add long running sagas

2272dea

change to stale sagas and create new type

44f65c2

call the db in the correct place

dc9f86f

clean up

87a3b7c

clean up

d791656

Merge branch 'main' into stale-saga-info-in-inventory

6c39718

add info to DB

9b3911e

display

d4de6da

expectorate

2624bff

Use typed UUIDs

f5a40c2

clean up

56fc5bb

clean up

ac54427

use IdOrdMap instead of Vec

bcf4895

merge main

81ffb7d

fix version

f5c41b6

Use dedicated StaleSagaState

a55e7e2

remove debug code

cd89fd5

clean up

8f991e1

karencfv marked this pull request as ready for review April 14, 2026 07:29

karencfv requested review from davepacheco and jgallagher April 14, 2026 07:29

karencfv added 2 commits April 14, 2026 20:07

fix schema migration file

cf8ea12

do the rpaths thing

fc8033f

karencfv mentioned this pull request Apr 16, 2026

[external-api] Add contact support field to update status #10271

Open

davepacheco reviewed Apr 20, 2026

View reviewed changes

karencfv commented Apr 21, 2026

View reviewed changes

Conversation

karencfv commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

karencfv commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

karencfv commented Apr 20, 2026

Uh oh!

davepacheco left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

karencfv left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

karencfv commented Apr 21, 2026

Uh oh!

karencfv commented Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

karencfv commented Mar 11, 2026 •

edited

Loading

karencfv commented Apr 14, 2026 •

edited

Loading

davepacheco left a comment •

edited

Loading

karencfv left a comment •

edited

Loading