[fm] ereporter restart ordering dingus by hawkw · Pull Request #10132 · oxidecomputer/omicron

hawkw · 2026-03-23T22:13:32Z

Ereports are uniquely identified by the tuple of their reporter's restart ID and the ereport's ENA, as described in RFD 520. ENAs are ordered within the restart of the reporter, but reporter IDs are randomly-generated UUIDs and do not have a temporal ordering. As I described in this comment on issue #10073, the approach I'd like to take for loading new ereports to use as inputs to fault management analysis relies on having an ordering of not only the ENA component of the ereport ID but also the ordering of reporter restarts (i.e. "is this restart newer than or older than the one I currently believe to be the latest?"). In order to do this, we need to track the sequence of restart IDs per reporter location. This is done on a location-by-location basis because, well, the laws of physics prevent two sleds/switches/PSCs from occupying the same physical location in the rack at the same time, making their history of reporter restarts inherently orderable in a way that other things, like serial numbers, may be less orderable.

This PR adds a table for constructing that ordering and modifies the ereport insert query to actually use it. There's some additional tidiness and refactoring I'd like to do here later (and an OMDB command for printing the ordering table would be nice), but for now, I'd like to be able to actually use this.

smklein · 2026-03-24T22:45:34Z

+    ROW_NUMBER() OVER (
+        PARTITION BY reporter, slot_type, slot
+        ORDER BY time_first_seen
+    ) - 1 AS generation,


I presume ROW_NUMBER is CRDB "enumerate"?

(Is this particularly different than "COUNT DISTINCT"?)

It is pretty much exactly "CRDB for Iterator::enumerate()": https://www.cockroachlabs.com/docs/stable/window-functions#add-row-numbers-to-query-output

I guess the way you would use COUNT DISTINCT to produce the generation numbers in order for each row is to COUNT all the DISTINCT records for a slot prior to each row you're about to insert? I wasn't totally sure how to do that, and this thing was basically just an example I could copy out of the CRDB docs :)

Also, I would kinda guess that using COUNT DISTINCT in that way would actually make the database go back and actually COUNT all the rows each time, which feels very O(n²), while this is (I think) just adding numbers to the result set of a query it's already gone and done (AFAICT?).

smklein · 2026-03-24T22:47:11Z

+    /// List restarts of an ereporter at a given physical location, paginated by
+    /// restart generation.
+    pub async fn ereporter_restart_list(


Kinda curious; is this going to be for debugging?

(seems like in the "common case" we'll only care about "what's the latest one", right?)

Yeah, I was thinking I would add a "select latest one" query...but I kinda think that in practice, this table is going to end up being used mostly for filtering other queries so I'm not actually sure if we will end up using a separate Rust function that does just a "latest ID from this reporter" or doing a select from this table to get the latest inside the database.

At present this method is just being used for the tests, but I was planning to plumb this through into OMDB, which is why it's pub currently.

mergeconflict · 2026-03-25T17:44:30Z

+        let next_generation = coalesce(
+            restart_dsl::ereporter_restart
+                .filter(restart_dsl::reporter_type.eq(reporter_type))
+                .filter(restart_dsl::slot_type.eq(slot_type))
+                .filter(restart_dsl::slot.eq(slot))
+                .select(max(restart_dsl::generation))
+                .limit(1)
+                .single_value()
+                + 1,
+            0,
+        );


Is it possible that two concurrent transactions for different restart IDs sharing the same (reporter_type, slot_type, slot) could both read the same max(restart_dsl::generation), compute the same next_generation, and collide?

So, in the event of two concurrent inserts with different restart IDs, the one that loses the race will be rejected by the unique index here:

omicron/schema/crdb/dbinit.sql

Lines 7312 to 7319 in ccb9a3a

CREATE UNIQUE INDEX IF NOT EXISTS

lookup_ereporter_restart_generations_by_location

ON omicron.public.ereporter_restart (

reporter_type,

slot_type,

slot,

generation

);

This is good, but I think the rust code should probably be retrying if that constraint is violated, so that the loser of the race is inserted at the next generation.

As discussed in [this comment][1], it turns out that the approach to sitrep input loading which required the ereporter restart table that I added in #10132 will actually just straight up not work. Unfortunately, we didn't figure that out until after the PR adding that table had merged. We've now found a different way to do this that doesn't require the ordering table. Additionally, the ordering table orders restarts by the order in which ereports were _added to the database_, not the order in which the restarts were actually first observed, which is of debatable usefulness at best. Therefore, we should just get rid of it. Unfortunately, because there was a database migration, we cannot simply push a commit that reverts eb0bb82 and removes all evidence of my shame. Instead, we must have a new migration that just kinda undoes the migration I wrote yesterday, in case anyone has installed a build of the control plane containing eb0bb82 some time in approximately the last two hours. So, undoing the change requires somewhat more ceremony than it would if the database schema has not been touched. This PR does that. [1]: #10073 (comment)

hawkw added 14 commits March 23, 2026 15:11

add table

a7e185e

reticulating

e37e260

wip

f0bfbd3

reticulating CTE

a0d59cd

reticulating cte some more

b339aad

reticulate

763956e

lsit query

145bfc9

okay query might actually be good now

7d5b8b9

commentary

fc3ed42

reticulating

9abe495

CTE can actually just use MAX()

031a2f1

oh my god you can do it all in Diesel!!!

acc6961

plumbing

b901b65

migration

8578faa

smklein reviewed Mar 24, 2026

View reviewed changes

Comment thread nexus/db-queries/src/db/datastore/ereport.rs Outdated

smklein reviewed Mar 24, 2026

View reviewed changes

Comment thread nexus/db-queries/src/db/datastore/ereport.rs

smklein reviewed Mar 24, 2026

View reviewed changes

Comment thread nexus/db-queries/src/db/datastore/ereport.rs

smklein reviewed Mar 24, 2026

View reviewed changes

Comment thread nexus/db-queries/src/db/datastore/ereport.rs

smklein approved these changes Mar 24, 2026

View reviewed changes

hawkw added 2 commits March 24, 2026 16:06

LOL I FORGOR TO DO THIS PART

c5d1320

expect that it fails with the right thing

bb042c1

hawkw added the fault-management Everything related to the fault-management initiative (RFD480 and others) label Mar 24, 2026

hawkw marked this pull request as ready for review March 24, 2026 23:36

hawkw changed the title ~~[WIP][fm] ereporter restart ordering dingus~~ [fm] ereporter restart ordering dingus Mar 24, 2026

hawkw requested a review from mergeconflict March 24, 2026 23:36

hawkw enabled auto-merge (squash) March 24, 2026 23:37

fix ereports in support bundle tests getting clobbered

ccb9a3a

hawkw merged commit eb0bb82 into main Mar 25, 2026
16 checks passed

hawkw deleted the eliza/ereporter-restart-order branch March 25, 2026 17:39

mergeconflict reviewed Mar 25, 2026

View reviewed changes

This was referenced Mar 25, 2026

FM: design for diagnosis inputs/analysis preparation phase #10073

Open

[fm] remove ereporter restart ordering dingus #10152

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[fm] ereporter restart ordering dingus#10132

[fm] ereporter restart ordering dingus#10132
hawkw merged 17 commits intomainfrom
eliza/ereporter-restart-order

hawkw commented Mar 23, 2026 •

edited

Loading

Uh oh!

smklein Mar 24, 2026

Uh oh!

hawkw Mar 24, 2026

Uh oh!

smklein Mar 24, 2026

Uh oh!

hawkw Mar 24, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mergeconflict Mar 25, 2026 •

edited

Loading

Uh oh!

hawkw Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	CREATE UNIQUE INDEX IF NOT EXISTS
	lookup_ereporter_restart_generations_by_location
	ON omicron.public.ereporter_restart (
	reporter_type,
	slot_type,
	slot,
	generation
	);

Conversation

hawkw commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

smklein Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

hawkw Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

smklein Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

hawkw Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mergeconflict Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hawkw Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

hawkw commented Mar 23, 2026 •

edited

Loading

mergeconflict Mar 25, 2026 •

edited

Loading