Skip to content

fix(ingest): add unique constraint to hmrc_ingestion_meta.checksum for DB-enforced idempotency #86

@nikilok

Description

@nikilok

Summary

hmrc_ingestion_meta.checksum (varchar(64)) is the natural idempotency key for the HMRC CSV ingestion pipeline, but the table has no unique constraint on it. Application-side dedup at apps/web/scripts/ingest-hmrc-csv.ts:40-41 is insufficient.

Surfaced by CodeRabbit on PR #85 — the migration meta snapshot diff included the full schema, so the comment landed on the unrelated hmrc_ingestion_meta table. Deferred from PR #85 as out of scope.

Why the application-side check isn't enough

Current guard:

const lastIngestion = await sql\`SELECT "checksum" FROM "hmrc_ingestion_meta" ORDER BY "ingested_at" DESC LIMIT 1\`;
if (!force && lastIngestion?.checksum === checksum) {
  // skip
}

Two real weaknesses:

  1. Race condition — two concurrent ingestions can both pass the check, both INSERT, both succeed.
  2. Wrong key — only checks the most recent row. If you ingest A → B → re-ingest A, the third ingestion isn't caught (latest checksum is B's, not A's). Re-ingesting an older CSV (e.g. for a backfill replay) silently inserts a duplicate.

Proposed fix

Add a unique constraint at the DB layer:

ALTER TABLE hmrc_ingestion_meta
  ADD CONSTRAINT hmrc_ingestion_meta_checksum_unique UNIQUE (checksum);

Or via Drizzle:

export const hmrcIngestionMeta = pgTable('hmrc_ingestion_meta', {
  ...,
  checksum: varchar('checksum', { length: 64 }).notNull().unique(),
  ...
});

Then update ingest-hmrc-csv.ts:226 to use `ON CONFLICT (checksum) DO NOTHING` so a duplicate ingestion is a clean no-op rather than an error:

INSERT INTO hmrc_ingestion_meta (csv_url, checksum, record_count)
VALUES (\${url}, \${checksum}, \${dedupedRows.length})
ON CONFLICT (checksum) DO NOTHING

Pre-flight check

Before applying the constraint, verify no existing dupes block the migration:

SELECT checksum, COUNT(*)
FROM hmrc_ingestion_meta
GROUP BY checksum
HAVING COUNT(*) > 1;

Empty result → safe to migrate. Non-empty → resolve dupes first (probably keep the earliest `ingested_at` row per checksum and delete the rest).

Open question

Use (checksum) or (csv_url, checksum)? (checksum) is stricter (the same content can't be ingested under two different URLs). (csv_url, checksum) allows re-ingesting the same content from a different source. The HMRC use case has a single canonical source (assets.publishing.service.gov.uk), so (checksum) alone is probably right. Decide before migrating.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions