feat: add core tags infrastructure — TagType dataclass, CLI, and registry#11
Draft
Chisomnwa wants to merge 12 commits into
Draft
feat: add core tags infrastructure — TagType dataclass, CLI, and registry#11Chisomnwa wants to merge 12 commits into
Chisomnwa wants to merge 12 commits into
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Core Infrastructure PR — Part of Issue #9
This introduces the shared core for the tags project, following the architecture outlined in Issue #9 and its Appendix.
This is PR #1 of the Issue #9 work breakdown. PR #2 will migrate existing type folders into
tag_types/<name>/structure.What This PR Adds
New files:
tags/tag_type.pyTagTypedataclass — the profile card for each tag type (name, directory, vocabulary, mappings, optional classify_fn)tags/classify.pydefault_classifyfunction — normalized lookup against a type's mappingstags/__init__.pyload_all()— readstag_types/registry.json, loads each type's vocabulary and mappings, returns sorted list ofTagTypeinstancestags/cli.pyanalyzeandunmappedsubcommandstag_types/registry.jsonpyproject.tomltagspackage installable, registerstagsCLI commandModified files:
CONTRIBUTING.mdold_slugsaliases.gitignoreol_tags.egg-info/Old scripts left untouched (will be removed after CLI is verified):
scripts/analyze_genres.pyscripts/analyze_tags.pyscripts/analyze_unmapped.pyData Contract Decisions
Settled with @mekarpeles in the May 26 1:1 and documented in
CONTRIBUTING.md:old_slugsalias list invocabulary.jsonCLI Verification: Old Scripts vs New CLI
The analysis logic from
scripts/analyze_tags.pyhas been absorbed intotags analyze. Both should produce identical coverage numbers.Old script:
python scripts/analyze_tags.py --dump path_to_dump> genres --limit 100000New CLI:
tags analyze genres path_to_dump> --limit 100000Here's the edited version:
The two screenshots above show identical coverage numbers when analyzing works with genre tags. This confirms the new core architecture produces the same results as the old per-type scripts — but with the key difference that all analysis tooling is now unified under a single
tagscommand. Future analysis on any tag type (subgenres, moods, content_formats, etc.) will use this same shared core, with no new scripts needed per type.Below is a walkthrough showing same stages of analysis that were carried out with the old architecture.
Coverage Analysis: Before and After Mapping Expansion
To demonstrate the CLI works end-to-end, I swapped the old 77-entry
mappings/genres.jsonwith the current 158-entry version from PR #7 and ran coverage before and after.Step 1 — Baseline coverage (old mappings, 77 entries):
tags analyze genres <path_to_dump> --progress 1000000Step 2 — Find unmapped subjects (old mappings):
tags unmapped genres <path_to_dump> --limit 500000 --top 50Step 3 — Restore current mappings (swap back to the 158-entry
mappings/genres.jsonfrom PR #7).Step 4 — Expanded coverage (current mappings, 158 entries):
tags analyze genres --progress 1000000Coverage improvement: 1.72% → 4.34% (3x increase, roughly matching PR #7 results).
Step 5 - Verified Remaining Unmapped Subjects
tags unmapped genres <path_to_dump> --top 50What's Next (Future PRs)
genres/,subgenres/, etc.) intotag_types/<name>/and movemappings/*.jsoninto each type's directorytags classify,tags validate,tags typessubcommandsTagMatchdataclass with evidence tracking (from Issue #2 cross-reference)Closes: None yet — tracking Issue #9.
Reviewers: @mekarpeles