Skip to content

feat: add core tags infrastructure — TagType dataclass, CLI, and registry#11

Draft
Chisomnwa wants to merge 12 commits into
Open-Book-Genome-Project:mainfrom
Chisomnwa:core/cli-analysis-tools
Draft

feat: add core tags infrastructure — TagType dataclass, CLI, and registry#11
Chisomnwa wants to merge 12 commits into
Open-Book-Genome-Project:mainfrom
Chisomnwa:core/cli-analysis-tools

Conversation

@Chisomnwa
Copy link
Copy Markdown
Contributor

@Chisomnwa Chisomnwa commented May 29, 2026

Core Infrastructure PR — Part of Issue #9

This introduces the shared core for the tags project, following the architecture outlined in Issue #9 and its Appendix.

This is PR #1 of the Issue #9 work breakdown. PR #2 will migrate existing type folders into tag_types/<name>/ structure.


What This PR Adds

New files:

File Purpose
tags/tag_type.py TagType dataclass — the profile card for each tag type (name, directory, vocabulary, mappings, optional classify_fn)
tags/classify.py default_classify function — normalized lookup against a type's mappings
tags/__init__.py load_all() — reads tag_types/registry.json, loads each type's vocabulary and mappings, returns sorted list of TagType instances
tags/cli.py Single CLI entry point with analyze and unmapped subcommands
tag_types/registry.json Explicit manifest of all registered tag types with load priorities
pyproject.toml Makes tags package installable, registers tags CLI command

Modified files:

File Change
CONTRIBUTING.md Added Data Contracts section — normalization rules, slug-based mapping values, multi-type output policy, slug stability with old_slugs aliases
.gitignore Added ol_tags.egg-info/

Old scripts left untouched (will be removed after CLI is verified):

  • scripts/analyze_genres.py
  • scripts/analyze_tags.py
  • scripts/analyze_unmapped.py

Data Contract Decisions

Settled with @mekarpeles in the May 26 1:1 and documented in CONTRIBUTING.md:

  • Normalization: lowercase + strip + NFC applied consistently
  • Mapping values: slugs, not display names
  • Multi-type output: pipeline collects matches from all types, no short-circuit
  • Slug stability: old_slugs alias list in vocabulary.json

CLI Verification: Old Scripts vs New CLI

The analysis logic from scripts/analyze_tags.py has been absorbed into tags analyze. Both should produce identical coverage numbers.

Old script:

python scripts/analyze_tags.py --dump path_to_dump> genres --limit 100000

general_coverage(old_script)

New CLI:

tags analyze genres path_to_dump> --limit 100000

general_coverage(new_script)

Here's the edited version:

The two screenshots above show identical coverage numbers when analyzing works with genre tags. This confirms the new core architecture produces the same results as the old per-type scripts — but with the key difference that all analysis tooling is now unified under a single tags command. Future analysis on any tag type (subgenres, moods, content_formats, etc.) will use this same shared core, with no new scripts needed per type.


Below is a walkthrough showing same stages of analysis that were carried out with the old architecture.


Coverage Analysis: Before and After Mapping Expansion

To demonstrate the CLI works end-to-end, I swapped the old 77-entry mappings/genres.json with the current 158-entry version from PR #7 and ran coverage before and after.

Step 1 — Baseline coverage (old mappings, 77 entries):

tags analyze genres <path_to_dump> --progress 1000000

step1_baseline_coverage(new_script)

Step 2 — Find unmapped subjects (old mappings):

tags unmapped genres <path_to_dump> --limit 500000 --top 50

step2_top50_unmapped_subject_strings(new_script)

Step 3 — Restore current mappings (swap back to the 158-entry mappings/genres.json from PR #7).

Step 4 — Expanded coverage (current mappings, 158 entries):

tags analyze genres --progress 1000000

step4_analysis_aftyer_mapping

Coverage improvement: 1.72% → 4.34% (3x increase, roughly matching PR #7 results).

Step 5 - Verified Remaining Unmapped Subjects

tags unmapped genres <path_to_dump> --top 50

step5_top50_unmapped_subject_strings_after_mapping(new_script)

What's Next (Future PRs)

  • PR Number 2 : Migrate existing type folders (genres/, subgenres/, etc.) into tag_types/<name>/ and move mappings/*.json into each type's directory
  • Future: tags classify, tags validate, tags types subcommands
  • Future: TagMatch dataclass with evidence tracking (from Issue #2 cross-reference)
  • Future: Testing suite (pytest)

Closes: None yet — tracking Issue #9.

Reviewers: @mekarpeles

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant