Summary
hmrc_ingestion_meta.checksum (varchar(64)) is the natural idempotency key for the HMRC CSV ingestion pipeline, but the table has no unique constraint on it. Application-side dedup at apps/web/scripts/ingest-hmrc-csv.ts:40-41 is insufficient.
Surfaced by CodeRabbit on PR #85 — the migration meta snapshot diff included the full schema, so the comment landed on the unrelated hmrc_ingestion_meta table. Deferred from PR #85 as out of scope.
Why the application-side check isn't enough
Current guard:
const lastIngestion = await sql\`SELECT "checksum" FROM "hmrc_ingestion_meta" ORDER BY "ingested_at" DESC LIMIT 1\`;
if (!force && lastIngestion?.checksum === checksum) {
// skip
}
Two real weaknesses:
- Race condition — two concurrent ingestions can both pass the check, both INSERT, both succeed.
- Wrong key — only checks the most recent row. If you ingest
A → B → re-ingest A, the third ingestion isn't caught (latest checksum is B's, not A's). Re-ingesting an older CSV (e.g. for a backfill replay) silently inserts a duplicate.
Proposed fix
Add a unique constraint at the DB layer:
ALTER TABLE hmrc_ingestion_meta
ADD CONSTRAINT hmrc_ingestion_meta_checksum_unique UNIQUE (checksum);
Or via Drizzle:
export const hmrcIngestionMeta = pgTable('hmrc_ingestion_meta', {
...,
checksum: varchar('checksum', { length: 64 }).notNull().unique(),
...
});
Then update ingest-hmrc-csv.ts:226 to use `ON CONFLICT (checksum) DO NOTHING` so a duplicate ingestion is a clean no-op rather than an error:
INSERT INTO hmrc_ingestion_meta (csv_url, checksum, record_count)
VALUES (\${url}, \${checksum}, \${dedupedRows.length})
ON CONFLICT (checksum) DO NOTHING
Pre-flight check
Before applying the constraint, verify no existing dupes block the migration:
SELECT checksum, COUNT(*)
FROM hmrc_ingestion_meta
GROUP BY checksum
HAVING COUNT(*) > 1;
Empty result → safe to migrate. Non-empty → resolve dupes first (probably keep the earliest `ingested_at` row per checksum and delete the rest).
Open question
Use (checksum) or (csv_url, checksum)? (checksum) is stricter (the same content can't be ingested under two different URLs). (csv_url, checksum) allows re-ingesting the same content from a different source. The HMRC use case has a single canonical source (assets.publishing.service.gov.uk), so (checksum) alone is probably right. Decide before migrating.
Summary
hmrc_ingestion_meta.checksum(varchar(64)) is the natural idempotency key for the HMRC CSV ingestion pipeline, but the table has no unique constraint on it. Application-side dedup atapps/web/scripts/ingest-hmrc-csv.ts:40-41is insufficient.Why the application-side check isn't enough
Current guard:
Two real weaknesses:
A → B → re-ingest A, the third ingestion isn't caught (latest checksum is B's, not A's). Re-ingesting an older CSV (e.g. for a backfill replay) silently inserts a duplicate.Proposed fix
Add a unique constraint at the DB layer:
Or via Drizzle:
Then update
ingest-hmrc-csv.ts:226to use `ON CONFLICT (checksum) DO NOTHING` so a duplicate ingestion is a clean no-op rather than an error:Pre-flight check
Before applying the constraint, verify no existing dupes block the migration:
Empty result → safe to migrate. Non-empty → resolve dupes first (probably keep the earliest `ingested_at` row per checksum and delete the rest).
Open question
Use
(checksum)or(csv_url, checksum)?(checksum)is stricter (the same content can't be ingested under two different URLs).(csv_url, checksum)allows re-ingesting the same content from a different source. The HMRC use case has a single canonical source (assets.publishing.service.gov.uk), so(checksum)alone is probably right. Decide before migrating.