Fix Small Dataset Clustering Issue in Adaptive Density-Aware Clustering#1330
Fix Small Dataset Clustering Issue in Adaptive Density-Aware Clustering#1330rohan-pandeyy wants to merge 2 commits into
Conversation
WalkthroughLowers the default ChangesClustering eps clamping and threshold default
OpenAPI JSON reformatting
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
backend/tests/test_face_clusters.py (1)
554-601: ⚡ Quick winMake the regression deterministically exercise the clamping branch.
The current setup depends on generated geometry to produce a large adaptive
eps. Mockestimate_epsto a value abovemax_distanceso this test always validates the clamp path explicitly.Proposed refactor
- `@patch`("app.utils.face_clusters.db_get_all_faces_with_cluster_names") - def test_adaptive_eps_clamping_regression(self, mock_db_get): + `@patch`("app.utils.face_clusters.estimate_eps", return_value=1.2) + `@patch`("app.utils.face_clusters.db_get_all_faces_with_cluster_names") + def test_adaptive_eps_clamping_regression(self, mock_db_get, mock_estimate_eps): @@ results, _ = cluster_util_cluster_all_face_embeddings( min_samples=2, similarity_threshold=0.85 ) + mock_estimate_eps.assert_called_once()As per coding guidelines, “Ensure that test code is automated, comprehensive, and follows testing best practices” and “Verify that all critical functionality is covered by tests.”
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@backend/tests/test_face_clusters.py` around lines 554 - 601, The test test_adaptive_eps_clamping_regression currently relies on randomly generated geometry to produce a large adaptive eps value that triggers the clamping logic. This makes the test non-deterministic and may fail to exercise the clamping branch in some runs. Add a patch decorator to mock the estimate_eps function (the function that calculates adaptive eps during clustering) and set its return value to a value explicitly greater than max_distance (0.15 in this case). This ensures the clamping branch in cluster_util_cluster_all_face_embeddings is always tested deterministically, regardless of the random embedding geometry.Source: Coding guidelines
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@backend/app/config/settings.py`:
- Line 125: The default value for PICTO_CLUSTERING_SIMILARITY_THRESHOLD has been
changed to 0.65 with a valid range of 0.0 to 1.0, but the docstring in the
cluster_util_cluster_all_face_embeddings function still documents the old
default of 0.85 and the outdated range of 0.75-0.90. Update the docstring for
the similarity_threshold parameter in the
cluster_util_cluster_all_face_embeddings function to reflect the new default
value of 0.65 and the correct valid range of 0.0 to 1.0 to keep the
documentation in sync with the actual configuration.
In `@backend/app/utils/face_clusters.py`:
- Around line 289-298: The clamped_eps value can become 0.0 when
similarity_threshold is 1.0 (causing max_distance to be 0.0), but DBSCAN
requires eps to be strictly positive. After line 289 where clamped_eps is
calculated, add a guard to enforce a positive floor for clamped_eps (such as
using machine epsilon or a minimum threshold value) to ensure it never reaches
zero. Alternatively, validate that similarity_threshold is less than 1.0 before
performing the eps estimation to prevent max_distance from becoming zero in the
first place.
---
Nitpick comments:
In `@backend/tests/test_face_clusters.py`:
- Around line 554-601: The test test_adaptive_eps_clamping_regression currently
relies on randomly generated geometry to produce a large adaptive eps value that
triggers the clamping logic. This makes the test non-deterministic and may fail
to exercise the clamping branch in some runs. Add a patch decorator to mock the
estimate_eps function (the function that calculates adaptive eps during
clustering) and set its return value to a value explicitly greater than
max_distance (0.15 in this case). This ensures the clamping branch in
cluster_util_cluster_all_face_embeddings is always tested deterministically,
regardless of the random embedding geometry.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: c79ac217-accd-44eb-9f47-0ec0d52227a7
📒 Files selected for processing (4)
backend/app/config/settings.pybackend/app/utils/face_clusters.pybackend/tests/test_face_clusters.pydocs/backend/backend_python/openapi.json
Fixes a bug for #1271
Description
This PR fixes an edge case in adaptive face clustering where the estimated DBSCAN
epsvalue could exceed the maximum distance derived from the configured face similarity threshold.In small or sparse datasets, the adaptive
epsestimation could return values greater than1.0. Since distances above the similarity threshold are intentionally set to1.0to represent clearly different identities, an oversizedepswould cause DBSCAN to treat those pairs as neighbors, resulting in unrelated faces being merged into the same cluster.To prevent this, the estimated
epsis now clamped tomax_distance(1 - similarity_threshold) before being passed to DBSCAN.Changes
epstomax_distancebefore clusteringTesting
epsvalues exceeding the similarity threshold distanceestimate_eps()returns large valuesResult
The adaptive clustering pipeline now respects the configured similarity threshold in all cases, preventing unrelated identities from being grouped together due to oversized
epsestimates.AI Usage Disclosure:
We encourage contributors to use AI tools responsibly when creating Pull Requests. While AI can be a valuable aid, it is essential to ensure that your contributions meet the task requirements, build successfully, include relevant tests, and pass all linters. Submissions that do not meet these standards may be closed without warning to maintain the quality and integrity of the project. Please take the time to understand the changes you are proposing and their impact. AI slop is strongly discouraged and may lead to banning and blocking. Do not spam our repos with AI slop.
Check one of the checkboxes below:
I have used the following AI models and tools: Claude, Gemini
Checklist
Summary by CodeRabbit