Skip to content

Centralize sample-dataset metadata (load_sample / tests / docs / MANIFEST.in) #774

@SaguaroDev

Description

@SaguaroDev

Description

In #771 (review of #745), @henrydingliu observed that sample-dataset column names are duplicated across three places:

  • the load_sample branch in chainladder/utils/utility_functions.py
  • the test that exercises the load (e.g. test_load_sample_clrd2025)
  • the dataset table in docs/library/sample_data.md

He suggested storing the metadata of available sample datasets and fields in one place, and using it to:

  • run load_sample()
  • key off tests
  • generate MANIFEST.in
  • generate sample_data.md

This applies to every sample dataset in chainladder/utils/data/, not just clrd2025 — the same triple-duplication exists today for clrd, berqsherm, xyz, the friedland family, etc. The fix is general:

  1. Define one manifest file (YAML or a Python dict in chainladder/utils/data/_manifest.py) keyed by sample name with origin, development, index, columns, cumulative, and any per-sample flags.
  2. Refactor load_sample to look up its config from the manifest rather than the long if key.lower() == ... chain.
  3. Generate the sample_data.md table at docs-build time (or commit the generated file with a regen script under scripts/).
  4. Have MANIFEST.in include chainladder/utils/data/*.csv via the manifest's listed files, or just keep the existing wildcard — whichever the maintainers prefer.
  5. Update tests to iterate over the manifest, so adding a new sample is a one-line change.

Is your feature request aligned with the scope of the package?

  • Yes, absolutely!

Describe the solution you'd like, or your current workaround.

See above. Current workaround is the existing pattern — every new sample dataset (including #745's clrd2025) updates three files by hand.

Do you have any additional supporting notes?

Filed at @henrydingliu's suggestion in #771 (comment). Keeping #771 / #745 scoped to landing the data; this refactor is its own piece of work.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions