Skip to content

Fix duplicate TaxaList names causing MultipleObjectsReturned#1108

Merged
mihow merged 5 commits intomainfrom
fix/duplicate-taxalist-names
Apr 15, 2026
Merged

Fix duplicate TaxaList names causing MultipleObjectsReturned#1108
mihow merged 5 commits intomainfrom
fix/duplicate-taxalist-names

Conversation

@mihow
Copy link
Copy Markdown
Collaborator

@mihow mihow commented Jan 31, 2026

Summary

  • Adds TaxaListQuerySet.get_or_create_for_project(name, project=None) to scope TaxaList lookups by project and survive pre-existing duplicates.
  • Updates the three existing callers (pipeline.py, import_taxa, update_taxa) to use the new method (all with project=None — see regression note below).
  • Adds migration 0083_dedupe_taxalist_names that renames existing duplicate rows so (name, scope) is unique going forward.

Problem

TaxaList.name has no unique constraint. Code paths using get_or_create(name=...) raise MultipleObjectsReturned in prod when duplicate-named lists exist, crashing ML pipeline runs.

Solution

New query method (ami/main/models.py)

get_or_create_for_project(name, project=None, **defaults):

  • project=None → scope = "global" (lists with zero project associations)
  • project=<Project> → scope = that project (via projects__ M2M lookup)
  • On DoesNotExist: creates inside transaction.atomic() so create + projects.add are a unit
  • On MultipleObjectsReturned: returns the oldest match instead of raising (backstop for the data migration and for the race where two concurrent callers both create)

Data migration (ami/main/migrations/0083_dedupe_taxalist_names.py)

For each scope group (same name, same scope) with >1 row, keep the oldest and rename the rest with a globally-unique " (duplicate N)" suffix. Does not merge taxa — operators can review and clean up manually. Idempotent.

About uniqueness at the DB level

Django UniqueConstraint cannot target M2M fields, so we can't directly express unique(name, project) on this schema. Three possible paths, not in scope for this PR:

  • Swap projects M2M for a nullable project FK on TaxaList — native UniqueConstraint(['name', 'project']) works; changes semantics to one-list-per-project. Cleanest long-term option.
  • Custom through model with denormalized name — drift-prone.
  • Postgres trigger / check constraint — complex.

This PR relies on app-level uniqueness (via get_or_create_for_project) plus the dedupe migration. A follow-up issue should decide whether to commit to option 1.

Prod dupe-count

Checked prod (2026-04-15): 28 total TaxaLists, 0 global dupes, 0 per-project dupes. The MultipleObjectsReturned crash that inspired this PR must have come from a different environment (staging/demo/dev). The dedupe migration is a no-op against prod but a useful safety net for other maintainers' envs.

Known regression risk

The three existing call sites now pass project=None, meaning they only match TaxaLists with zero project associations. If any env has a "Taxa returned by <algorithm>" list that got attached to a project historically, pipeline.py:605 will no longer find it and will create a fresh global row alongside. The migration's dedupe does not re-merge these — it only renames same-scope duplicates. If --project support lands in the callers later, this scoping choice should be revisited.

Tradeoff in pipeline.py

Algorithm taxa lists are now declared "global" by convention. That means two projects sharing an algorithm share one taxa list, and taxa added by one project are visible to the other. If that's wrong, the fix is scoping algorithm lists per-project (requires the --project follow-up).

Test plan

  • Unit tests for get_or_create_for_project — 8 tests covering global/project creation & retrieval, same-name collision resistance across scopes, existing-duplicate handling, and defaults-only-on-create semantics
  • Unit tests for 0083_dedupe_taxalist_names — 8 tests covering global dupes, project-scoped dupes, unique-name no-ops, cross-scope non-collision, multi-project overlap, post-migration get_or_create_for_project behavior, and idempotency
  • makemigrations --check clean; migrate applies cleanly against a fresh DB
  • Seed-then-migrate integration run: seeded 9 rows (global dupes, project-scoped dupes, cross-scope same-name, multi-project overlap), ran manage.py migrate main 0083, confirmed correct renames + idempotency on re-run
  • Run import_taxa / update_taxa against a local DB to confirm end-to-end
  • Deploy to staging and verify no more MultipleObjectsReturned in ML pipeline logs
  • Check prod TaxaList counts pre-deploy to size the migration's blast radius (0 dupes on prod, 28 total rows)

Follow-ups (new issues)

  • Decide on M2M → FK schema change for DB-level uniqueness
  • Add --project <slug> flags to import_taxa / update_taxa so operators can create project-scoped lists
  • If algorithm taxa lists should be per-project, revisit pipeline.py:605 once --project lands

🤖 Drafted with Claude Code

Summary by CodeRabbit

  • Refactor
    • Taxa list management now supports both global and project-specific lists, with improved duplicate handling and predictable creation behavior.
  • User-facing Commands
    • Import and update commands now default to globally scoped taxa lists, preserve creation messages, and note a future option to target a project.
  • Migrations
    • Added a deduplication migration that renames duplicate taxa lists to enforce unique global vs project scopes.
  • Tests
    • Added tests for creation, retrieval, duplicate resolution, scope isolation, and migration idempotency.

…ate names

The TaxaList model allows multiple lists with the same name, but several
places in the codebase use get_or_create(name=...) which fails with
MultipleObjectsReturned when duplicates exist.

This adds a new get_or_create_for_project() method that:
- Scopes lookups to a specific project (or global lists if project=None)
- Handles existing duplicates gracefully by returning the oldest one
- Updates all callers (pipeline.py, import_taxa, update_taxa) to use it

Also adds TODO comments to management commands about adding --project
parameter support in the future.

Co-Authored-By: Claude <noreply@anthropic.com>
@netlify
Copy link
Copy Markdown

netlify bot commented Jan 31, 2026

Deploy Preview for antenna-ssec canceled.

Name Link
🔨 Latest commit d0deb0e
🔍 Latest deploy log https://app.netlify.com/projects/antenna-ssec/deploys/69e00bc1fb3f3d0008d3bf03

@netlify
Copy link
Copy Markdown

netlify bot commented Jan 31, 2026

Deploy Preview for antenna-preview canceled.

Name Link
🔨 Latest commit d0deb0e
🔍 Latest deploy log https://app.netlify.com/projects/antenna-preview/deploys/69e00bbfce5f3800086a3c74

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Jan 31, 2026

📝 Walkthrough

Walkthrough

Introduces a queryset-backed TaxaList.objects.get_or_create_for_project(name, project=None) for global vs. project-scoped taxa lists; updates management commands and ML pipeline to call it with project=None; adds deduplication migration and tests covering new lookup/creation semantics and migration behavior.

Changes

Cohort / File(s) Summary
Core Model Updates
ami/main/models.py
Adds TaxaListQuerySet.get_or_create_for_project(name, project=None, **defaults), TaxaListManager (queryset-backed), and sets TaxaList.objects = TaxaListManager() with logic for global vs project-scoped lookup, creation, and duplicate resolution.
Management Commands
ami/main/management/commands/import_taxa.py, ami/main/management/commands/update_taxa.py
Replaces TaxaList.objects.get_or_create(...) with TaxaList.objects.get_or_create_for_project(..., project=None), adds TODO notes about a future --project option, and adjusts messaging for created vs existing lists.
ML Pipeline
ami/ml/models/pipeline.py
Updates get_or_create_taxon_for_classification to call TaxaList.objects.get_or_create_for_project(..., project=None) for algorithm taxa lists (explicit global scope).
Tests
ami/main/tests.py
Adds TaxaListGetOrCreateForProjectTestCase and TaxaListDedupeMigrationTestCase covering creation/retrieval for global and project-scoped lists, M2M membership, collision avoidance, legacy-duplicates handling (oldest wins), description defaults, migration behavior and idempotence.
Migration
ami/main/migrations/0083_dedupe_taxalist_names.py
New migration dedupe_taxa_list_names that groups TaxaList by (name, scope) (global vs project), keeps oldest name, renames later duplicates using " (duplicate N)" suffix selecting smallest unused N, exits early if no duplicates; reverse is a no-op.

Sequence Diagram

sequenceDiagram
    participant Caller as Command / ML Code
    participant Manager as TaxaListManager / QuerySet
    participant DB as Database

    Caller->>Manager: get_or_create_for_project(name, project=None)
    alt project is None (global)
        Manager->>DB: SELECT TaxaList WHERE name=? AND project_count=0
        DB-->>Manager: rows or none
        alt match found
            Manager-->>Caller: (existing_list, False)
        else no match
            Manager->>DB: INSERT TaxaList(name=...), created_at=...
            DB-->>Manager: new record
            Manager-->>Caller: (new_list, True)
        end
    else project provided
        Manager->>DB: SELECT TaxaList JOIN projects WHERE name=? AND project=?
        DB-->>Manager: rows or none
        alt match found
            Manager-->>Caller: (existing_list, False)
        else no match
            Manager->>DB: INSERT TaxaList(name=...), then INSERT M2M(project)
            DB-->>Manager: new record
            Manager-->>Caller: (new_list, True)
        end
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐇 I hopped through code to mend the list’s song,
Global names kept steady, duplicates moved along.
Commands call me global with project=None bright,
Migrations tidy rows, the oldest wins the fight.
Tests clap their paws — everything's working right.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely identifies the main problem being fixed: duplicate TaxaList names causing MultipleObjectsReturned errors.
Description check ✅ Passed The PR description is comprehensive and well-structured, covering the problem, solution, migration strategy, deployment considerations, and thorough test plan with clear justification for design decisions.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/duplicate-taxalist-names

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copilot AI review requested due to automatic review settings February 27, 2026 09:30
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses a critical bug where duplicate TaxaList names were causing MultipleObjectsReturned exceptions. Since the TaxaList.name field lacks a unique constraint, the code's use of get_or_create(name=...) would fail when duplicates existed.

Changes:

  • Introduces TaxaListQuerySet.get_or_create_for_project() method to scope TaxaList lookups by project (or globally for project=None)
  • Updates all three callers (pipeline.py, import_taxa, update_taxa) to use the new method
  • Handles existing duplicates gracefully by returning the oldest matching list

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File Description
ami/main/models.py Adds TaxaListQuerySet with get_or_create_for_project() method and TaxaListManager to support project-scoped taxa list creation and retrieval
ami/ml/models/pipeline.py Updates algorithm taxa list creation to use new method with explicit project=None for global lists
ami/main/management/commands/update_taxa.py Updates taxa list lookup to use new method with project=None and adds TODO comment for future project parameter support
ami/main/management/commands/import_taxa.py Updates taxa list lookup to use new method with project=None and adds TODO comment for future project parameter support

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread ami/main/models.py
Comment thread ami/main/models.py
Comment thread ami/main/models.py
mihow and others added 2 commits April 15, 2026 13:27
Covers global lists, project-scoped lists, same-name collisions across
scopes, legacy duplicate handling, and defaults-on-create semantics.

Also wraps create+projects.add in transaction.atomic() so a failed
projects.add rolls back the new row, and clarifies the docstring
around the defaults kwarg and the MultipleObjectsReturned race path.

Co-Authored-By: Claude <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
ami/main/tests.py (1)

3699-3706: Optional: add duplicate fallback test for project-scoped duplicates.

You already validate oldest-record fallback for project=None; adding the same check for duplicates tied to one project would fully lock in the scoped contract.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ami/main/tests.py` around lines 3699 - 3706, Add a new test mirroring
test_handles_existing_duplicates_by_returning_oldest but scoped to a Project:
create a Project instance, create three TaxaList rows with the same name and
project set to that Project (saving the first as "first"), call
TaxaList.objects.get_or_create_for_project(name="Duplicate Name",
project=project_instance), then assert created is False and that the returned
taxa_list.pk equals first.pk; name the test something like
test_handles_existing_duplicates_for_project_scoped_returning_oldest and
reference TaxaList and get_or_create_for_project in the assertion.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@ami/main/tests.py`:
- Around line 3699-3706: Add a new test mirroring
test_handles_existing_duplicates_by_returning_oldest but scoped to a Project:
create a Project instance, create three TaxaList rows with the same name and
project set to that Project (saving the first as "first"), call
TaxaList.objects.get_or_create_for_project(name="Duplicate Name",
project=project_instance), then assert created is False and that the returned
taxa_list.pk equals first.pk; name the test something like
test_handles_existing_duplicates_for_project_scoped_returning_oldest and
reference TaxaList and get_or_create_for_project in the assertion.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 08ee3a61-b656-46b2-b8e4-c6e3e6930827

📥 Commits

Reviewing files that changed from the base of the PR and between 04f60b6 and 759e592.

📒 Files selected for processing (3)
  • ami/main/models.py
  • ami/main/tests.py
  • ami/ml/models/pipeline.py

Rename duplicate TaxaList rows so (name, scope) is unique within each
scope (global = no projects, or per-project). Oldest row in each group
keeps its name; later rows get a " (duplicate N)" suffix that is
globally unique.

The migration does not merge taxa from duplicate lists — it only
renames them so the MultipleObjectsReturned paths in
get_or_create_for_project stop firing. Operators can review renamed
rows and merge/delete manually.

Covered by TaxaListDedupeMigrationTestCase (8 tests).

Co-Authored-By: Claude <noreply@anthropic.com>
@mihow mihow merged commit 5090057 into main Apr 15, 2026
7 checks passed
@mihow mihow deleted the fix/duplicate-taxalist-names branch April 15, 2026 22:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants