Skip to content

Fix Small Dataset Clustering Issue in Adaptive Density-Aware Clustering#1330

Open
rohan-pandeyy wants to merge 2 commits into
AOSSIE-Org:mainfrom
rohan-pandeyy:fix/thresholds-in-face-clustering
Open

Fix Small Dataset Clustering Issue in Adaptive Density-Aware Clustering#1330
rohan-pandeyy wants to merge 2 commits into
AOSSIE-Org:mainfrom
rohan-pandeyy:fix/thresholds-in-face-clustering

Conversation

@rohan-pandeyy

@rohan-pandeyy rohan-pandeyy commented Jun 20, 2026

Copy link
Copy Markdown
Member

Fixes a bug for #1271

Description

This PR fixes an edge case in adaptive face clustering where the estimated DBSCAN eps value could exceed the maximum distance derived from the configured face similarity threshold.

In small or sparse datasets, the adaptive eps estimation could return values greater than 1.0. Since distances above the similarity threshold are intentionally set to 1.0 to represent clearly different identities, an oversized eps would cause DBSCAN to treat those pairs as neighbors, resulting in unrelated faces being merged into the same cluster.

To prevent this, the estimated eps is now clamped to max_distance (1 - similarity_threshold) before being passed to DBSCAN.

Changes

  • Clamp adaptive eps to max_distance before clustering
  • Add warning logs when the estimated value exceeds the allowed maximum
  • Preserve existing adaptive clustering behavior for valid estimates
  • Add a regression test covering the small-dataset scenario that previously caused incorrect identity merging

Testing

  • Added regression test for adaptive eps values exceeding the similarity threshold distance
  • Verified that distinct identities are no longer merged when estimate_eps() returns large values
  • Confirmed all existing tests continue to pass

Result

The adaptive clustering pipeline now respects the configured similarity threshold in all cases, preventing unrelated identities from being grouped together due to oversized eps estimates.

AI Usage Disclosure:

We encourage contributors to use AI tools responsibly when creating Pull Requests. While AI can be a valuable aid, it is essential to ensure that your contributions meet the task requirements, build successfully, include relevant tests, and pass all linters. Submissions that do not meet these standards may be closed without warning to maintain the quality and integrity of the project. Please take the time to understand the changes you are proposing and their impact. AI slop is strongly discouraged and may lead to banning and blocking. Do not spam our repos with AI slop.

Check one of the checkboxes below:

  • This PR does not contain AI-generated code at all.
  • This PR contains AI-generated code. I have read the AI Usage Policy and this PR complies with this policy. I have tested the code locally and I am responsible for it.

I have used the following AI models and tools: Claude, Gemini

Checklist

  • My PR addresses a single issue, fixes a single bug or makes a single improvement.
  • My code follows the project's code style and conventions
  • If applicable, I have made corresponding changes or additions to the documentation
  • If applicable, I have made corresponding changes or additions to tests
  • My changes generate no new warnings or errors
  • I have joined the Discord server and I will share a link to this PR with the project maintainers there
  • I have read the Contribution Guidelines
  • Once I submit my PR, CodeRabbit AI will automatically review it and I will address CodeRabbit's comments.
  • I have filled this PR template completely and carefully, and I understand that my PR may be closed without review otherwise.

Summary by CodeRabbit

  • Bug Fixes
    • Improved face clustering robustness by clamping adaptive distance calculations to respect the configured similarity threshold and preventing overly small epsilon values.
  • Chores
    • Updated the default similarity threshold used for clustering behavior.
    • Added a regression test to ensure separate identities stay in distinct clusters and singleton embeddings don’t cause unintended merges.
  • Documentation
    • Refreshed OpenAPI document formatting for improved readability (no functional API changes).

@coderabbitai

coderabbitai Bot commented Jun 20, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Walkthrough

Lowers the default PICTO_CLUSTERING_SIMILARITY_THRESHOLD from 0.85 to 0.65. In cluster_util_cluster_all_face_embeddings, the adaptively estimated DBSCAN eps is now clamped to max_distance with conditional warning/info logging. A regression test is added to verify two tight identity clusters remain distinct under singletons. The OpenAPI JSON file is reformatted without functional changes.

Changes

Clustering eps clamping and threshold default

Layer / File(s) Summary
Threshold default and eps clamping logic
backend/app/config/settings.py, backend/app/utils/face_clusters.py
Lowers PICTO_CLUSTERING_SIMILARITY_THRESHOLD default from 0.85 to 0.65. Updates the cluster_util_cluster_all_face_embeddings docstring to document the new threshold range. Adds an upper-bound clamp of estimated_eps to max_distance with a warning log when clamped and an info log otherwise, plus a minimum lower bound of 1e-6.
Regression test for eps clamping
backend/tests/test_face_clusters.py
Adds test_adaptive_eps_clamping_regression with synthetic two-cluster plus singleton embeddings to assert identities stay separate; renumbers test_quality_gate docstring from Test 4 to Test 5.

OpenAPI JSON reformatting

Layer / File(s) Summary
OpenAPI JSON array reformatting
docs/backend/backend_python/openapi.json
Expands tags and required arrays throughout the document from compact single-line to multi-line JSON format; no functional content changes.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related PRs

  • AOSSIE-Org/PictoPy#771: Directly modifies the same cluster_util_cluster_all_face_embeddings function in backend/app/utils/face_clusters.py with similarity-threshold-based clustering logic that this PR builds on.

Suggested labels

Python, Documentation

Suggested reviewers

  • rahulharpal1603

Poem

🐇 Hop hop, the eps won't roam too far,
Clamped to max_distance like a guiding star.
Two clusters stay apart, neat and true,
The singletons won't muddle what we knew.
Default threshold lowered, tests all green —
The cleanest clustering this warren's seen! 🌟

🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Title check ✅ Passed The title directly describes the main change: fixing a clustering issue in the adaptive density-aware clustering algorithm by clamping the adaptive eps parameter.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
backend/tests/test_face_clusters.py (1)

554-601: ⚡ Quick win

Make the regression deterministically exercise the clamping branch.

The current setup depends on generated geometry to produce a large adaptive eps. Mock estimate_eps to a value above max_distance so this test always validates the clamp path explicitly.

Proposed refactor
-    `@patch`("app.utils.face_clusters.db_get_all_faces_with_cluster_names")
-    def test_adaptive_eps_clamping_regression(self, mock_db_get):
+    `@patch`("app.utils.face_clusters.estimate_eps", return_value=1.2)
+    `@patch`("app.utils.face_clusters.db_get_all_faces_with_cluster_names")
+    def test_adaptive_eps_clamping_regression(self, mock_db_get, mock_estimate_eps):
@@
         results, _ = cluster_util_cluster_all_face_embeddings(
             min_samples=2, similarity_threshold=0.85
         )
+        mock_estimate_eps.assert_called_once()

As per coding guidelines, “Ensure that test code is automated, comprehensive, and follows testing best practices” and “Verify that all critical functionality is covered by tests.”

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@backend/tests/test_face_clusters.py` around lines 554 - 601, The test
test_adaptive_eps_clamping_regression currently relies on randomly generated
geometry to produce a large adaptive eps value that triggers the clamping logic.
This makes the test non-deterministic and may fail to exercise the clamping
branch in some runs. Add a patch decorator to mock the estimate_eps function
(the function that calculates adaptive eps during clustering) and set its return
value to a value explicitly greater than max_distance (0.15 in this case). This
ensures the clamping branch in cluster_util_cluster_all_face_embeddings is
always tested deterministically, regardless of the random embedding geometry.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@backend/app/config/settings.py`:
- Line 125: The default value for PICTO_CLUSTERING_SIMILARITY_THRESHOLD has been
changed to 0.65 with a valid range of 0.0 to 1.0, but the docstring in the
cluster_util_cluster_all_face_embeddings function still documents the old
default of 0.85 and the outdated range of 0.75-0.90. Update the docstring for
the similarity_threshold parameter in the
cluster_util_cluster_all_face_embeddings function to reflect the new default
value of 0.65 and the correct valid range of 0.0 to 1.0 to keep the
documentation in sync with the actual configuration.

In `@backend/app/utils/face_clusters.py`:
- Around line 289-298: The clamped_eps value can become 0.0 when
similarity_threshold is 1.0 (causing max_distance to be 0.0), but DBSCAN
requires eps to be strictly positive. After line 289 where clamped_eps is
calculated, add a guard to enforce a positive floor for clamped_eps (such as
using machine epsilon or a minimum threshold value) to ensure it never reaches
zero. Alternatively, validate that similarity_threshold is less than 1.0 before
performing the eps estimation to prevent max_distance from becoming zero in the
first place.

---

Nitpick comments:
In `@backend/tests/test_face_clusters.py`:
- Around line 554-601: The test test_adaptive_eps_clamping_regression currently
relies on randomly generated geometry to produce a large adaptive eps value that
triggers the clamping logic. This makes the test non-deterministic and may fail
to exercise the clamping branch in some runs. Add a patch decorator to mock the
estimate_eps function (the function that calculates adaptive eps during
clustering) and set its return value to a value explicitly greater than
max_distance (0.15 in this case). This ensures the clamping branch in
cluster_util_cluster_all_face_embeddings is always tested deterministically,
regardless of the random embedding geometry.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: c79ac217-accd-44eb-9f47-0ec0d52227a7

📥 Commits

Reviewing files that changed from the base of the PR and between 584f333 and f4c36a4.

📒 Files selected for processing (4)
  • backend/app/config/settings.py
  • backend/app/utils/face_clusters.py
  • backend/tests/test_face_clusters.py
  • docs/backend/backend_python/openapi.json

Comment thread backend/app/config/settings.py
Comment thread backend/app/utils/face_clusters.py
@rohan-pandeyy rohan-pandeyy changed the title fix(face-clustering): clamp adaptive eps to never exceed max_distance Fix Small Dataset Clustering Issue in Adaptive Density-Aware Clustering Jun 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant