[inventory] Add stale saga information#10027
[inventory] Add stale saga information#10027karencfv wants to merge 20 commits intooxidecomputer:mainfrom
Conversation
|
From the build-and-test (helios) job https://buildomat.eng.oxide.computer/wg/0/details/01KP5DY5F65D03T0G7EHK3D0VX/3HTS7wQUErwzoUUVG6juJt1SdvKCVqdRoL8vOvWb0bSdknVd/01KP5DYFQWBTSMENHD0BFYC31X#S6677 I am legitimately stumped by this. I didn't even touch that crate. The build-and-test (ubuntu-22.04) tests run fine, and the problem isn't showing on main either!! |
|
@davepacheco @jgallagher tiny ping for a review please? |
There was a problem hiding this comment.
I feel terrible about this, but seeing the concrete change does give me some serious doubts about the direction that I suggested before. I have a bunch of specific comments inline but also a some bigger questions that might preempt those.
First, I'm very surprised how many packages required the pq-sys treatment and it includes a bunch of stuff that really ought never to talk to the database (e.g., nexus-reconfigurator-simulation). As much as I hate using traits for dependency injection, in this case, I'd be inclined to at least do the same thing we did with the SledAgentEnumerator -- put the database bit behind a trait so that most consumers of nexus/inventory don't have to do all this. (That does make me wonder though why there are so many consumers of nexus-inventory. If I remember right, the types are in nexus-types, so the only consumers should be people building or collecting their own inventories. It seems like that should be very few?)
Second, the inventory is encoding some pretty arbitrary judgment about what makes a saga stuck (i.e., that it's been running for 15 minutes). That's a departure from most of inventory today. It feels like that judgment belongs in the consumer that's deciding what's okay. The obvious way to do that would be to include all sagas here, but that does seem like a lot of extra work when we hope that in almost all cases they're going to be ignored.
Third, seeing that we're literally reading from the database and inserting the same data into a second table in the database doesn't feel awesome. I'm still not sure it's wrong. One alternative we discussed is to query Nexus instances. But that's tricky because new Nexus instances can be created (how do you know you didn't miss those?) and old ones can be expunged. When they get expunged, their sagas could move from one instance to another. With all that, I'm not sure how to query them all in a way where you'd know for sure that if there was a long-running saga, then you definitely saw it.
This gets me to wondering whether we shouldn't do this check as a separate source of information at the point where we're doing a health check? In other words, when doing a pre-upgrade health check, we already plan to fetch an inventory collection, compare it to the blueprint, etc. Maybe we should just fetch running sagas from the database at that point and include any long-running ones in the check, separate from inventory?
Edit: Sorry in advance if my own concerns have us change direction from the one I initially suggested!
| // See omicron-rpaths for documentation. | ||
| // NOTE: This file MUST be kept in sync with the other build.rs files in this | ||
| // repository. | ||
| fn main() { |
There was a problem hiding this comment.
I gather that a whole bunch of packages needed the pq-sys/rpaths/build.rs treatment:
omicron-sled-agent
omicron-test-utils
tqdb
end-to-end-tests
nexus-reconfigurator-blippy
nexus-fm
nexus-reconfigurator-planning
nexus-reconfigurator-simulation
But I'm not clear on why. My understanding is that you'd need this if you added dependencies on nexus-db-queries, which would happen for things that depended on nexus-inventory before. But omicron-sled-agent doesn't have a dependency on nexus-inventory.
Similarly, omicron-test-utils doesn't seem to have any other dependency on pq-sys:
$ cargo tree -i pq-sys -p omicron-test-utils
pq-sys v0.4.6 (https://github.com/oxidecomputer/pq-sys?branch=oxide%2Fomicron#b1194c19)
└── omicron-test-utils v0.1.0 (/home/dap/omicron-review/test-utils)
and end-to-end-tests doesn't seem to have any that aren't one of those two:
$ cargo tree -i pq-sys -p end-to-end-tests
pq-sys v0.4.6 (https://github.com/oxidecomputer/pq-sys?branch=oxide%2Fomicron#b1194c19)
├── end-to-end-tests v0.1.0 (/home/dap/omicron-review/end-to-end-tests)
├── omicron-sled-agent v0.1.0 (/home/dap/omicron-review/sled-agent)
│ └── end-to-end-tests v0.1.0 (/home/dap/omicron-review/end-to-end-tests)
└── omicron-test-utils v0.1.0 (/home/dap/omicron-review/test-utils)
└── end-to-end-tests v0.1.0 (/home/dap/omicron-review/end-to-end-tests)
tqdb is admittedly more complicated... it seems to be pulling in nexus-db-queries through this path:
$ cargo tree -i pq-sys -p tqdb
pq-sys v0.4.6 (https://github.com/oxidecomputer/pq-sys?branch=oxide%2Fomicron#b1194c19)
├── diesel v2.3.7
│ ├── async-bb8-diesel v0.2.1
│ │ ├── nexus-db-lookup v0.1.0 (/home/dap/omicron-review/nexus/db-lookup)
│ │ │ └── nexus-db-queries v0.1.0 (/home/dap/omicron-review/nexus/db-queries)
│ │ │ └── nexus-inventory v0.1.0 (/home/dap/omicron-review/nexus/inventory)
│ │ │ ├── nexus-reconfigurator-planning v0.1.0 (/home/dap/omicron-review/nexus/reconfigurator/planning)
│ │ │ │ ├── nexus-reconfigurator-simulation v0.1.0 (/home/dap/omicron-review/nexus/reconfigurator/simulation)
│ │ │ │ │ └── reconfigurator-cli v0.1.0 (/home/dap/omicron-review/dev-tools/reconfigurator-cli)
│ │ │ │ │ └── tqdb v0.1.0 (/home/dap/omicron-review/trust-quorum/tqdb)
...
which is pretty unfortunate but I'm not sure how hard it is to avoid.
In general, it's surprising from first principles that several of these would require database access (like the nexus-reconfigurator packages).
There was a problem hiding this comment.
Ok, so this might just be me misunderstanding the instructions in rpaths/src/lib.rs. It says:
- Any crate that depends on pq-sys, directly or not, needs to follow these
instructions. Generally, we depend on pq-sys indirectly, by virtue of
depending on Diesel.
I took that to mean any crate on the dependency tree. So everything I get when running cargo tree -i nexus-inventory. Judging by your reaction perhaps I was wrong?
| SagaCreator = {}, | ||
| SagaSec = {}, |
There was a problem hiding this comment.
I'm not sure these are really different, are they?
| creator UUID NOT NULL, | ||
| current_sec UUID, | ||
| name TEXT NOT NULL, | ||
| state omicron.public.stale_saga_state NOT NULL, |
There was a problem hiding this comment.
A bunch of this duplicates information from the saga table, and some of it (like current_sec and state) could become out of date. Maybe we should just have consumers JOIN against the saga table (or just fetch from it directly) when they want more information about a stale saga.
There was a problem hiding this comment.
I agree that this information could become out of date. But isn't the point of the inventory to capture a snapshot of the state of the system at a given time? That was my reasoning behind creating this table.
| /* | ||
| * For listing sagas by state (e.g., finding all running and unwinding sagas | ||
| * across all SECs). We need to paginate this list by the id. | ||
| */ | ||
| CREATE INDEX IF NOT EXISTS lookup_saga_by_state ON omicron.public.saga ( | ||
| saga_state, id | ||
| ); | ||
|
|
There was a problem hiding this comment.
First, this almost certainly wants to be limited to WHERE saga_state = 'running' OR saga_state = 'unwinding', right? Otherwise, this will grow unbounded and include all the 'done' sagas that we don't care about here because they're not stuck.
Now, that's almost the same condition as the previous index. In principle, I think it should be the same condition. The difference right now is that that index includes abandoned sagas. But I don't think that it should. I'm guessing that was an oversight when we added the abandon state.
So maybe the change to make here is to update the lookup_saga_by_sec index to include AND saga_state != 'abandoned'? We'd probably want to double-check that any queries using saga_state` will keep using this index.
| Zone 587da9e8-8fc0-4854-b585-070741a7b00d: Internal DNS generation @ 1 | ||
| Zone ffbf02f0-261d-4723-b613-eb861245acbd: Internal DNS generation @ 1 | ||
|
|
||
| STALE SAGAS |
There was a problem hiding this comment.
General note: "stale" doesn't seem like the right word for this. They seem "stuck" at worst. "Long-running" is more precise, if longer.
There was a problem hiding this comment.
Argh, yeah. Initially I had called them long running sagas, but since I included the ones in "unwinding" state in addition to the ones in "running"state, I thought the name could be misleading.
| pub time_collected: DateTime<Utc>, | ||
| } | ||
|
|
||
| impl IdOrdItem for InventoryStaleSaga { |
There was a problem hiding this comment.
This doesn't seem correct. An iddqd::IdOrdMap<T> is analogous to a std::collections::BTreeMap<<T as IdOrdItem>::Key, T>. In this case, this is analogous to a BTreeMap<SagaCreatorUuid, InventoryStaleSaga>. That means we're only tracking at most one stuck saga for each Nexus instance. If that's what you meant, I would use a BTreeMap. But I'm guessing you meant to track all of them, in which case the key should be the saga_id: SagaUuid, not the creator: SagaCreatorUuid.
| if stale_sagas.is_empty() { | ||
| writeln!( | ||
| f, | ||
| "No sagas have been running or unwinding for longer than 15 minutes" |
There was a problem hiding this comment.
I'm not sure it's worth fixing this but hardcoding the "15 minutes" here seems wrong. For one, if we changed the value, that'd be reflected in newer inventory collections but not older ones. Whether we updated this text or not, it would be wrong sometimes.
Better would be to include the limit in the inventory collection, but admittedly that seems like more trouble than it's worth.
Another option would be to not filter at all. Include all running sagas here. The tool could filter at runtime if it wanted.
Thoughts?
There was a problem hiding this comment.
Will discuss IRL tomorrow, but yeah, the more I think about it the more I want to just include all sagas that are either in a running or unwinding state.
| { | ||
| Ok(sagas) => sagas, | ||
| Err(e) => { | ||
| self.in_progress.found_error(InventoryError::from(anyhow!(e))); |
There was a problem hiding this comment.
Probably want to add .context so that the error reflects that this happened while listing sagas.
|
|
||
| let time_collected = Utc::now(); | ||
| for saga in sagas.into_iter().filter(|saga| { | ||
| (time_collected - saga.time_created) > STALE_SAGA_THRESHOLD |
There was a problem hiding this comment.
Any reason not to put this into the query instead?
There was a problem hiding this comment.
My intention was to create an API to query the DB that could be reusable for other purposes, but yeah, it's more wasteful this way.
| let state = match saga.saga_state.try_into() { | ||
| Ok(state) => state, | ||
| Err(msg) => { | ||
| error!( |
There was a problem hiding this comment.
I'd add a comment that that this should be impossible (presumably).
There was a problem hiding this comment.
Thanks for the thorough review @davepacheco! We'll be chatting about this at the sync tomorrow, but I want to leave some thoughts so I don't forget.
First, I'm very surprised how many packages required the pq-sys treatment and it includes a bunch of stuff that really ought never to talk to the database (e.g., nexus-reconfigurator-simulation). As much as I hate using traits for dependency injection, in this case, I'd be inclined to at least do the same thing we did with the SledAgentEnumerator -- put the database bit behind a trait so that most consumers of nexus/inventory don't have to do all this. (That does make me wonder though why there are so many consumers of nexus-inventory. If I remember right, the types are in nexus-types, so the only consumers should be people building or collecting their own inventories. It seems like that should be very few?)
This may be on me, I may have taken the instructions from rpaths/src/lib.rs too literally 😅 I wrote more on this on the comment itself.
Second, the inventory is encoding some pretty arbitrary judgment about what makes a saga stuck (i.e., that it's been running for 15 minutes). That's a departure from most of inventory today. It feels like that judgment belongs in the consumer that's deciding what's okay. The obvious way to do that would be to include all sagas here, but that does seem like a lot of extra work when we hope that in almost all cases they're going to be ignored.
Fair, it makes sense to get rid of that threshold. My initial thought here is that perhaps we don't have to include all sagas. In a similar way to SMF services "enabled not online", we could include sagas only in certain states; namely, "running" and "unwinding". Aside from helping us determine if a saga may be stuck or not as part of these health checks, that list could serve as a snapshot in time for later debugging. For example, it can help us spot patterns in usage or any anomalies. I don't think saving this information would be detrimental.
Third, seeing that we're literally reading from the database and inserting the same data into a second table in the database doesn't feel awesome. I'm still not sure it's wrong. One alternative we discussed is to query Nexus instances. But that's tricky because new Nexus instances can be created (how do you know you didn't miss those?) and old ones can be expunged. When they get expunged, their sagas could move from one instance to another. With all that, I'm not sure how to query them all in a way where you'd know for sure that if there was a long-running saga, then you definitely saw it.
I agree that querying Nexus instances directly could create more problems and be less reliable for the reasons you mention. On the commentary about this approach, the data we're inserting is the same, in a way, but different in another. A saga in the saga table can be updated, and we lose information about it's state and current SEC at a given moment in time. Without a dedicated table for inventory as a snapshot in time, we can no longer answer "which was the saga that was running for more than n minutes?". Having a snapshot of running/unwinding sagas per collection also gives us a time series. We can see a saga show up running at T1, T2, T3 and know it was stuck for hours rather than minutes. That's valuable for both debugging and spotting performance issues.
This gets me to wondering whether we shouldn't do this check as a separate source of information at the point where we're doing a health check? In other words, when doing a pre-upgrade health check, we already plan to fetch an inventory collection, compare it to the blueprint, etc. Maybe we should just fetch running sagas from the database at that point and include any long-running ones in the check, separate from inventory?
For the reason above I would prefer to record saga information. That said, I agree that the threshold is completely unnecessary and it makes sense to me to record all sagas in either running/unwinding state.
| Zone 587da9e8-8fc0-4854-b585-070741a7b00d: Internal DNS generation @ 1 | ||
| Zone ffbf02f0-261d-4723-b613-eb861245acbd: Internal DNS generation @ 1 | ||
|
|
||
| STALE SAGAS |
There was a problem hiding this comment.
Argh, yeah. Initially I had called them long running sagas, but since I included the ones in "unwinding" state in addition to the ones in "running"state, I thought the name could be misleading.
| // See omicron-rpaths for documentation. | ||
| // NOTE: This file MUST be kept in sync with the other build.rs files in this | ||
| // repository. | ||
| fn main() { |
There was a problem hiding this comment.
Ok, so this might just be me misunderstanding the instructions in rpaths/src/lib.rs. It says:
- Any crate that depends on pq-sys, directly or not, needs to follow these
instructions. Generally, we depend on pq-sys indirectly, by virtue of
depending on Diesel.
I took that to mean any crate on the dependency tree. So everything I get when running cargo tree -i nexus-inventory. Judging by your reaction perhaps I was wrong?
| .map_err(|e| public_error_from_diesel(e, ErrorHandler::Server)) | ||
| } | ||
|
|
||
| pub async fn saga_list_by_states_batched( |
|
|
||
| let time_collected = Utc::now(); | ||
| for saga in sagas.into_iter().filter(|saga| { | ||
| (time_collected - saga.time_created) > STALE_SAGA_THRESHOLD |
There was a problem hiding this comment.
My intention was to create an API to query the DB that could be reusable for other purposes, but yeah, it's more wasteful this way.
| if stale_sagas.is_empty() { | ||
| writeln!( | ||
| f, | ||
| "No sagas have been running or unwinding for longer than 15 minutes" |
There was a problem hiding this comment.
Will discuss IRL tomorrow, but yeah, the more I think about it the more I want to just include all sagas that are either in a running or unwinding state.
| creator UUID NOT NULL, | ||
| current_sec UUID, | ||
| name TEXT NOT NULL, | ||
| state omicron.public.stale_saga_state NOT NULL, |
There was a problem hiding this comment.
I agree that this information could become out of date. But isn't the point of the inventory to capture a snapshot of the state of the system at a given time? That was my reasoning behind creating this table.
|
Forgot to add that when we restructured the health checks in #9876 we agreed that all of the information would be recorded in inventory as the source of truth. Any checks that one client does, could be replicated by another client simply by looking at inventory. This is relevant because support will eventually only have the information included in the support bundles. We want to be able to run the same health checks from a support bundle. |
|
Update, We chatted about this at the sync today and decided a few things:
|
As per #9876, this PR adds information about stale sagas to inventory. Any saga that is older than 15 minutes which is in a running or unwinding state.
The 15 minute cutoff is arbitrary, the update status endpoint will only consider a saga older than 1 hour as a health concern, but it would be good to store more data on sagas that are in one of these two states for a shorter time for debugging purposes. I can change this threshold if people don't agree though
A stale saga is present:
Everything is happy:
Closes: #10287