From f01cb88ab85ebe0b466a6b6411eb0d4a22cfb0c0 Mon Sep 17 00:00:00 2001
From: Nikil Kuruvilla <nikilok@gmail.com>
Date: Fri, 29 May 2026 00:50:20 +0100
Subject: [PATCH] docs: plan for searching across previous company names
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Option-B feature plan: a materialized search_names table (current + former names) with a gin_trgm_ops index, so sponsors can be found by a former Companies House name (e.g. "Motodynamics Ltd" -> PhysicsX). Planning doc only — no code changes.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---
 docs/search-previous-names-plan.md | 213 +++++++++++++++++++++++++++++
 1 file changed, 213 insertions(+)
 create mode 100644 docs/search-previous-names-plan.md

diff --git a/docs/search-previous-names-plan.md b/docs/search-previous-names-plan.md
new file mode 100644
index 0000000..5d8db96
--- /dev/null
+++ b/docs/search-previous-names-plan.md
@@ -0,0 +1,213 @@
+# Feature plan: search across previous company names (Option B)
+
+**Status:** Planned · **Created:** 2026-05-29 · **Branch:** `feat/search-previous-names`
+
+## Goal
+
+Let users find a sponsor by a name it *used to* have. Companies House records a
+company's former names; HMRC lists the sponsor under one name only. Example:
+**PhysicsX** was formerly **Motodynamics Ltd** — typing `Motodynamics Ltd` (or
+`motodynamics`) on the home page should surface the PhysicsX listing.
+
+## Decision: Option B (materialized search table) over Option A (inline join)
+
+Two approaches were considered:
+
+- **Option A — inline exact match.** Add a `previous_company_names @> ARRAY[...]`
+  branch to `searchHmrc`, reusing the existing array GIN index. Fast to build, but
+  only matches a former name typed (near-)verbatim — no fuzzy/partial — because the
+  array GIN is not a trigram index.
+- **Option B — unified materialized search table (chosen).** One row per
+  searchable name (current **and** former), with a `gin_trgm_ops` index. Search
+  becomes a single trigram scan over a unified index, so former names get the same
+  fuzzy/prefix matching as current names. Costs a new table + a rebuild step, but
+  *simplifies* the query (drops the 3-table runtime join) and makes all search
+  faster and more powerful.
+
+## Current state (what we're changing)
+
+- `searchHmrc` ([apps/web/src/api/hmrc.ts](../apps/web/src/api/hmrc.ts)) scans
+  `hmrc_skilled_workers` directly, scoring `organisation_name` with regex
+  word-boundary (`~*` + `\m`) and `pg_trgm` `word_similarity`/`similarity`
+  (prefix `2.0` > word-boundary `1.0` > trigram). Backed by a `gin_trgm_ops`
+  index on `organisation_name`. Paginated by `offset`, over-fetch `+1` for
+  `hasMore`, empty for queries `< 3` chars.
+- Same fn powers the home-page infinite query **and** the MCP search tool — both
+  get this feature for free.
+
+Relevant data (relationships):
+
+```
+hmrc_skilled_workers.organisation_name           -- searchable list (1 row / route)
+  → hmrc_company_mapping.organisation_name        -- PK; → company_number (nullable)
+  → companies_house_profiles.company_number       -- → previous_company_names text[]
+```
+
+Volumes today: 141,030 HMRC rows · 126,210 distinct org names · 43,546 former-name
+entries → search table ≈ **~170k rows**.
+
+## 1. New table
+
+Add to [packages/db/src/schema.ts](../packages/db/src/schema.ts):
+
+```ts
+export const searchNames = pgTable(
+  'search_names',
+  {
+    organisationName: text('organisation_name').notNull(), // HMRC row(s) to surface
+    name: text('name').notNull(),                          // searchable term
+    kind: varchar('kind', { length: 16 }).notNull(),       // 'current' | 'previous'
+  },
+  (table) => [
+    primaryKey({ columns: [table.organisationName, table.name] }),
+    index('idx_search_names_trgm').using('gin', sql`${table.name} gin_trgm_ops`),
+  ],
+);
+```
+
+Notes:
+- PK `(organisation_name, name)` + insert-current-first dedupes the "former name
+  == current name" case for free (the same data-quality issue Option A's runtime
+  filter handled), so it never double-lists.
+- `pg_trgm` is already enabled (used by the org-name index). The GIN trigram index
+  on `name` is what makes fuzzy former-name search possible.
+- Generate + apply the migration: `bun drizzle-kit generate` then the project's
+  migrate step. Confirm the migration emits the `gin_trgm_ops` index (drizzle-kit
+  sometimes needs the raw `sql` form, as above).
+
+## 2. Build / sync script
+
+New `apps/web/scripts/build-search-names.ts` (mirrors `generate-sitemap.ts` — same
+HMRC→mapping→CH join). Full rebuild (truncate + repopulate; ~170k rows is fast):
+
+```sql
+TRUNCATE search_names;
+
+-- current names (one per distinct org name)
+INSERT INTO search_names (organisation_name, name, kind)
+SELECT DISTINCT organisation_name, organisation_name, 'current'
+FROM hmrc_skilled_workers
+ON CONFLICT DO NOTHING;
+
+-- former names (only for org names that have HMRC rows; drop blanks)
+INSERT INTO search_names (organisation_name, name, kind)
+SELECT DISTINCT m.organisation_name, btrim(pn), 'previous'
+FROM hmrc_company_mapping m
+JOIN companies_house_profiles p ON p.company_number = m.company_number
+CROSS JOIN LATERAL unnest(p.previous_company_names) AS pn
+WHERE btrim(pn) <> ''
+  AND EXISTS (
+    SELECT 1 FROM hmrc_skilled_workers h
+    WHERE h.organisation_name = m.organisation_name
+  )
+ON CONFLICT DO NOTHING;  -- drops former names equal to the current name
+```
+
+**When to run:** after each HMRC ingestion, alongside sitemap regeneration (the
+post-ingestion step from the "regenerate sitemaps after ingestion" chore). Names
+change rarely, so a periodic full rebuild is fine for v1.
+
+## 3. Rewrite `searchHmrc`
+
+Scan `search_names`, take the best-scoring name per org (current or former),
+then join back to `hmrc_skilled_workers` for the per-route display rows. Easiest
+as a raw `sql` CTE via `db.execute` (the multi-CTE shape is awkward in the query
+builder):
+
+```sql
+WITH scored AS (
+  SELECT sn.organisation_name, sn.name, sn.kind,
+         (CASE
+            WHEN sn.name ~* $prefix       THEN 2.0 + word_similarity($q, sn.name)
+            WHEN sn.name ~* $wordBoundary THEN 1.0 + word_similarity($q, sn.name)
+            ELSE word_similarity($q, sn.name)
+          END) - CASE WHEN sn.kind = 'previous' THEN 0.05 ELSE 0 END AS score
+  FROM search_names sn
+  WHERE sn.name ~* $wordBoundary
+     OR word_similarity($q, sn.name) > 0.6
+     OR similarity($q, sn.name) > 0.5
+),
+best AS (                                   -- one winning name per org (dedupe)
+  SELECT DISTINCT ON (organisation_name)
+         organisation_name, score, kind, name AS matched_name
+  FROM scored
+  ORDER BY organisation_name, score DESC
+)
+SELECT hsw.hash AS slug_id, hsw.organisation_name, hsw.name_slug, hsw.town_city,
+       hsw.county, hsw.type_rating, hsw.route, b.score,
+       CASE WHEN b.kind = 'previous' THEN b.matched_name END AS matched_former_name
+FROM best b
+JOIN hmrc_skilled_workers hsw ON hsw.organisation_name = b.organisation_name
+ORDER BY b.score DESC, hsw.organisation_name ASC
+LIMIT $pageSizePlus1 OFFSET $offset;
+```
+
+- Keep the `< 3` chars early-return, `regexEscaped`, `$prefix = '^'+escaped`,
+  `$wordBoundary = '\m'+escaped`, and the `PAGE_SIZE + 1` / `hasMore` logic.
+- The join fans an org back out to its per-route rows — identical result shape to
+  today; `offset` paginates over those rows as before.
+- `-0.05` on previous matches keeps an exact current-name match ahead of an equal
+  former-name match of a *different* company. Tunable.
+- `pg_trgm` lowercases trigrams, so uppercase-stored former names match lowercase
+  queries; `~*` is explicitly case-insensitive. No casing work needed.
+
+## 4. API + type changes
+
+- `HmrcRow` gains `matchedFormerName: string | null`.
+- `searchHmrc` returns it per row (raw `matched_former_name` → camelCase).
+- Existing `slugId/organisationName/...` fields unchanged → home page + MCP keep working.
+
+## 5. UI: explain *why* a result matched
+
+When `matchedFormerName` is set, show a hint on the card so a result for
+`Motodynamics` reading "PhysicsX" isn't confusing.
+
+- [apps/web/src/components/HmrcCard.tsx](../apps/web/src/components/HmrcCard.tsx):
+  small muted line, e.g. `Formerly {titleCase(matchedFormerName)}`.
+- Thread `matchedFormerName` through
+  [HmrcResults.tsx](../apps/web/src/components/HmrcResults.tsx) (and its
+  `useCardMetrics` height config — a new line affects card height; see the
+  CLAUDE.md "Pretext virtual list sizing" notes).
+
+## Rollout order
+
+1. Add `searchNames` to schema → generate + apply migration (table + GIN trgm index).
+2. Add `build-search-names.ts`; run once to populate.
+3. Wire the rebuild into the post-ingestion step.
+4. Rewrite `searchHmrc` to scan `search_names` + join.
+5. Add `matchedFormerName` to `HmrcRow` and the return.
+6. `HmrcCard` "Formerly …" hint + thread through `HmrcResults` (+ card-height config).
+
+## Edge cases & risks
+
+- **Former == current name** → dropped by PK + insert order. ✓
+- **A former name shared by multiple companies** → each org keeps its own row; all
+  surface. ✓
+- **Multiple routes per org** → join fan-out, same as today. ✓
+- **Staleness** → previous-name search lags until the rebuild runs post-ingestion;
+  acceptable for v1. *Future:* incremental upserts when a profile's
+  `previous_company_names` changes, via the ch-stream / cache-invalidation pipeline.
+- **Card height** → adding the "Formerly" line must be reflected in the pretext
+  height config or the virtual list will mis-measure (CLAUDE.md).
+- **Perf** → GIN trgm over ~170k rows is well within budget; scoring runs only on
+  index-matched rows, comparable to today.
+
+## Verification
+
+- `Motodynamics Ltd` and `motodynamics` both surface PhysicsX with "Formerly
+  Motodynamics Ltd".
+- Current-name search results + ordering unchanged vs. today (same scoring math).
+- MCP search tool returns the new field without breaking.
+- `EXPLAIN` confirms the GIN trgm index is used (no seq scan on `search_names`).
+
+## Effort
+
+~1–2 days: schema + migration (S) · build script + wiring (S) · `searchHmrc`
+rewrite (M) · type + card UI + height config (M) · verification (S).
+
+## Dependency note
+
+Independent of the `feat/previous-companies` PR — this reads
+`companies_house_profiles.previous_company_names` directly via the new
+`search_names` table, not via the `getCompanyProfile` server fn. Branched off
+`main` so it merges cleanly as its own PR.