Skip to content

feat: backfill changelog records for existing datasets#1746

Open
cka-y wants to merge 8 commits into
mainfrom
feat/1639
Open

feat: backfill changelog records for existing datasets#1746
cka-y wants to merge 8 commits into
mainfrom
feat/1639

Conversation

@cka-y

@cka-y cka-y commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Add a backfill_changelog task to tasks_executor that walks existing dataset history and dispatches Cloud Tasks to the gtfs-datasets-comparer for each consecutive (base, new) dataset pair missing a changelog record
  • Rename gtfs_dataset_changelog columns from (previous_dataset_id, current_dataset_id) to (base_dataset_id, new_dataset_id) via a Liquibase migration, aligning the DB schema with the comparer HTTP API naming
  • Add a dedicated Cloud Tasks queue (gtfs-datasets-comparer-backfill-queue) with rate limiting (1 req/s, 10 concurrent) and retry config for backfill dispatches

Key design decisions

  • Idempotent & restartable: pairs with existing changelog rows are skipped (unless force=True), and dispatched tasks run with disallow_overwrite=True by default
  • Rate-limited: limit caps feeds per invocation, datasets_per_feed controls how deep into history we look (default: 3 most recent → 2 pairs), and the Cloud Tasks queue throttles comparer invocations
  • Filtering: supports stable_feed_ids to target specific feeds and feeds_not_updated_days to focus on stale feeds
  • Dry-run by default: dry_run=True enumerates pairs without dispatching, making it safe to preview before running

Note about testing

Only 25 feeds in DEV have enough comparable datasets (with extracted files) to produce changelog pairs. Backfill ran successfully against all of them with no issues found.

Further testing in QA is impractical for this feature: QA mirrors the production database but does not have access to the production GCS bucket where extracted GTFS files are stored. Since the comparer reads those files to compute diffs, changelog generation fails without them. Meaningful end-to-end validation will need to happen in PROD (starting with a scoped dry-run, then a small stable_feed_ids batch).

Please make sure these boxes are checked before submitting your pull request - thanks!

  • Run the unit tests with ./scripts/api-tests.sh to make sure you didn't break anything
  • Add or update any needed documentation to the repo
  • Format the title like "feat: [new feature short description]". Title must follow the Conventional Commit Specification(https://www.conventionalcommits.org/en/v1.0.0/).
  • Linked all relevant issues
  • Include screenshot(s) showing how this pull request works and fixes the issue(s)

@cka-y cka-y linked an issue Jun 23, 2026 that may be closed by this pull request
@cka-y cka-y changed the title Feat/1639 feat: backfill changelog records for existing datasets Jun 23, 2026
@cka-y cka-y marked this pull request as ready for review June 23, 2026 20:03

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an operational backfill path for GTFS dataset changelog records by introducing a new tasks_executor task that enumerates missing consecutive dataset pairs and dispatches rate-limited Cloud Tasks to the existing gtfs-datasets-comparer function, while aligning DB schema/API vocabulary from (previous, current) to (base, new).

Changes:

  • Introduces backfill_changelog task + unit tests to enumerate/dispatch missing changelog pairs with idempotency controls (dry_run, force, filters, limits).
  • Renames gtfs_dataset_changelog columns/constraints/indexes to (base_dataset_id, new_dataset_id) via Liquibase and updates call sites accordingly.
  • Adds a dedicated Cloud Tasks queue for backfill dispatches and wires it into the tasks_executor function environment.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
liquibase/changes/feat_1639.sql Renames changelog table columns + associated constraints/index to base/new vocabulary.
liquibase/changelog.xml Registers the new Liquibase migration in the changelog sequence.
infra/functions-python/main.tf Adds a backfill-specific Cloud Tasks queue and injects it into tasks_executor env vars.
functions-python/tasks_executor/src/tasks/changelog/backfill_changelog.py Implements the backfill enumeration, idempotency checks, filtering, and task dispatch.
functions-python/tasks_executor/tests/tasks/changelog/test_backfill_changelog.py Adds unit tests covering dry-run, dispatch, force, filtering, and edge cases.
functions-python/tasks_executor/src/main.py Registers the new backfill_changelog task in the tasks map.
functions-python/tasks_executor/README.md Documents the new task, parameters, env vars, and response fields.
functions-python/helpers/utils.py Extends comparer Cloud Task helper to support disallow_overwrite flag.
functions-python/gtfs_datasets_comparer/src/main.py Renames internal variables and DB upsert fields to base/new; updates conflict constraint name.
functions-python/gtfs_datasets_comparer/tests/test_main.py Updates tests to match base/new naming and error messages.
functions-python/batch_process_dataset/src/pipeline_tasks.py Updates changelog existence check to use base/new columns.
api/src/shared/database/database.py Updates mapper cascade relationship comments to reflect base/new terminology.

Comment thread functions-python/tasks_executor/src/main.py
-- Rename gtfs_dataset_changelog dataset columns from (previous, current) to (base, new)
-- so the database matches the comparer HTTP API naming (base_dataset_stable_id /
-- new_dataset_stable_id) and a single, consistent vocabulary is used everywhere
ALTER TABLE gtfs_dataset_changelog RENAME COLUMN previous_dataset_id TO base_dataset_id;

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Backend] Backfill changelog records from existing datasets

3 participants