feat: add core tags infrastructure — TagType dataclass, CLI, and registry by Chisomnwa · Pull Request #11 · Open-Book-Genome-Project/tags

Chisomnwa · 2026-05-29T16:40:54Z

Core Infrastructure PR — Part of Issue #9

This introduces the shared core for the tags project, following the architecture outlined in Issue #9 and its Appendix.

This is PR #1 of the Issue #9 work breakdown. PR #2 will migrate existing type folders into tag_types/<name>/ structure.

What This PR Adds

New files:

File	Purpose
`tags/tag_type.py`	`TagType` dataclass — the profile card for each tag type (name, directory, vocabulary, mappings, optional classify_fn)
`tags/classify.py`	`default_classify` function — normalized lookup against a type's mappings
`tags/__init__.py`	`load_all()` — reads `tag_types/registry.json`, loads each type's vocabulary and mappings, returns sorted list of `TagType` instances
`tags/cli.py`	Single CLI entry point with `analyze` and `unmapped` subcommands
`tag_types/registry.json`	Explicit manifest of all registered tag types with load priorities
`pyproject.toml`	Makes `tags` package installable, registers `tags` CLI command

Modified files:

File	Change
`CONTRIBUTING.md`	Added Data Contracts section — normalization rules, slug-based mapping values, multi-type output policy, slug stability with `old_slugs` aliases
`.gitignore`	Added `ol_tags.egg-info/`

Old scripts left untouched (will be removed after CLI is verified):

scripts/analyze_genres.py
scripts/analyze_tags.py
scripts/analyze_unmapped.py

Data Contract Decisions

Settled with @mekarpeles in the May 26 1:1 and documented in CONTRIBUTING.md:

Normalization: lowercase + strip + NFC applied consistently
Mapping values: slugs, not display names
Multi-type output: pipeline collects matches from all types, no short-circuit
Slug stability: old_slugs alias list in vocabulary.json

CLI Verification: Old Scripts vs New CLI

The analysis logic from scripts/analyze_tags.py has been absorbed into tags analyze. Both should produce identical coverage numbers.

Old script:

python scripts/analyze_tags.py --dump path_to_dump> genres --limit 100000

New CLI:

tags analyze genres path_to_dump> --limit 100000

Here's the edited version:

The two screenshots above show identical coverage numbers when analyzing works with genre tags. This confirms the new core architecture produces the same results as the old per-type scripts — but with the key difference that all analysis tooling is now unified under a single tags command. Future analysis on any tag type (subgenres, moods, content_formats, etc.) will use this same shared core, with no new scripts needed per type.

Below is a walkthrough showing same stages of analysis that were carried out with the old architecture.

Coverage Analysis: Before and After Mapping Expansion

To demonstrate the CLI works end-to-end, I swapped the old 77-entry mappings/genres.json with the current 158-entry version from PR #7 and ran coverage before and after.

Step 1 — Baseline coverage (old mappings, 77 entries):

tags analyze genres <path_to_dump> --progress 1000000

Step 2 — Find unmapped subjects (old mappings):

tags unmapped genres <path_to_dump> --limit 500000 --top 50

step2_top50_unmapped_subject_strings(new_script)

Step 3 — Restore current mappings (swap back to the 158-entry mappings/genres.json from PR #7).

Step 4 — Expanded coverage (current mappings, 158 entries):

tags analyze genres --progress 1000000

Coverage improvement: 1.72% → 4.34% (3x increase, roughly matching PR #7 results).

Step 5 - Verified Remaining Unmapped Subjects

tags unmapped genres <path_to_dump> --top 50

step5_top50_unmapped_subject_strings_after_mapping(new_script)

What's Next (Future PRs)

PR Number 2 : Migrate existing type folders (genres/, subgenres/, etc.) into tag_types/<name>/ and move mappings/*.json into each type's directory
Future: tags classify, tags validate, tags types subcommands
Future: TagMatch dataclass with evidence tracking (from Issue #2 cross-reference)
Future: Testing suite (pytest)

Closes: None yet — tracking Issue #9.

Reviewers: @mekarpeles

… handling

Chisomnwa added 12 commits May 28, 2026 22:50

Edit CONTRIBUTING.md file to add Data Contract Decisions

1e17386

feat: add TagType dataclass, classify function, pyproject.toml

9d752ac

docs: fix typo in tag_type.py module docstring

eb565ed

feat: add tag type registry and load_all() function

698d943

chore: add newline to end of registry.json

be04298

fix: add priority attribute to TagType dataclass

27d1396

fix: add priority attribute to TagType dataclass

2dcbeda

feat: implement tags CLI with analyze and unmapped commands

7b40b07

feat: implement tags CLI with analyze and unmapped commands

5c98ac5

fix: correct pyproject.toml syntax - requires-python

c06fbc1

feat: finalize CLI implementation with full code, comments, and error…

aa0d477

… handling

chore: add blank lines for readability between cli.py sections

464a23e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add core tags infrastructure — TagType dataclass, CLI, and registry#11

feat: add core tags infrastructure — TagType dataclass, CLI, and registry#11
Chisomnwa wants to merge 12 commits into
Open-Book-Genome-Project:mainfrom
Chisomnwa:core/cli-analysis-tools

Chisomnwa commented May 29, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Chisomnwa commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Core Infrastructure PR — Part of Issue #9

What This PR Adds

Data Contract Decisions

CLI Verification: Old Scripts vs New CLI

Coverage Analysis: Before and After Mapping Expansion

What's Next (Future PRs)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Chisomnwa commented May 29, 2026 •

edited

Loading