fix(ingest): add unique constraint to hmrc_ingestion_meta.checksum for DB-enforced idempotency

## Summary

`hmrc_ingestion_meta.checksum` (varchar(64)) is the natural idempotency key for the HMRC CSV ingestion pipeline, but the table has no unique constraint on it. Application-side dedup at [`apps/web/scripts/ingest-hmrc-csv.ts:40-41`](apps/web/scripts/ingest-hmrc-csv.ts#L40-L41) is insufficient.

> Surfaced by CodeRabbit on PR #85 — the migration meta snapshot diff included the full schema, so the comment landed on the unrelated `hmrc_ingestion_meta` table. Deferred from PR #85 as out of scope.

## Why the application-side check isn't enough

Current guard:

```ts
const lastIngestion = await sql\`SELECT "checksum" FROM "hmrc_ingestion_meta" ORDER BY "ingested_at" DESC LIMIT 1\`;
if (!force && lastIngestion?.checksum === checksum) {
  // skip
}
```

Two real weaknesses:

1. **Race condition** — two concurrent ingestions can both pass the check, both INSERT, both succeed.
2. **Wrong key** — only checks the *most recent* row. If you ingest `A → B → re-ingest A`, the third ingestion isn't caught (latest checksum is B's, not A's). Re-ingesting an older CSV (e.g. for a backfill replay) silently inserts a duplicate.

## Proposed fix

Add a unique constraint at the DB layer:

```sql
ALTER TABLE hmrc_ingestion_meta
  ADD CONSTRAINT hmrc_ingestion_meta_checksum_unique UNIQUE (checksum);
```

Or via Drizzle:

```ts
export const hmrcIngestionMeta = pgTable('hmrc_ingestion_meta', {
  ...,
  checksum: varchar('checksum', { length: 64 }).notNull().unique(),
  ...
});
```

Then update `ingest-hmrc-csv.ts:226` to use \`ON CONFLICT (checksum) DO NOTHING\` so a duplicate ingestion is a clean no-op rather than an error:

```ts
INSERT INTO hmrc_ingestion_meta (csv_url, checksum, record_count)
VALUES (\${url}, \${checksum}, \${dedupedRows.length})
ON CONFLICT (checksum) DO NOTHING
```

## Pre-flight check

Before applying the constraint, verify no existing dupes block the migration:

```sql
SELECT checksum, COUNT(*)
FROM hmrc_ingestion_meta
GROUP BY checksum
HAVING COUNT(*) > 1;
```

Empty result → safe to migrate. Non-empty → resolve dupes first (probably keep the earliest \`ingested_at\` row per checksum and delete the rest).

## Open question

Use `(checksum)` or `(csv_url, checksum)`? `(checksum)` is stricter (the same content can't be ingested under two different URLs). `(csv_url, checksum)` allows re-ingesting the same content from a different source. The HMRC use case has a single canonical source (`assets.publishing.service.gov.uk`), so `(checksum)` alone is probably right. Decide before migrating.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(ingest): add unique constraint to hmrc_ingestion_meta.checksum for DB-enforced idempotency #86

Summary

Why the application-side check isn't enough

Proposed fix

Pre-flight check

Open question

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

fix(ingest): add unique constraint to hmrc_ingestion_meta.checksum for DB-enforced idempotency #86

Description

Summary

Why the application-side check isn't enough

Proposed fix

Pre-flight check

Open question

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions