feat(graph): add suggest-collection command for collection optimization#1030
feat(graph): add suggest-collection command for collection optimization#1030smoparth wants to merge 1 commit intopython-wheel-build:mainfrom
Conversation
📝 WalkthroughWalkthroughAdds a new CLI command Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/fromager/commands/graph.py`:
- Around line 1031-1033: The current check in the onboarding output path always
prints a human message when toplevel_nodes is empty; update the block that
references toplevel_nodes so that when output_format == "json" it emits a valid
JSON empty list (e.g., using click.echo("[]")) and returns, otherwise keep the
human-friendly message "No top-level packages found in onboarding graph." This
change touches the conditional that checks toplevel_nodes in the function
handling the onboarding graph output and should reference the output_format
variable to decide which output to produce.
- Around line 916-918: The current sort used for selecting best_score
(scores.sort(...); best_score = scores[0]) only considers new_packages and
coverage_percentage, which leaves true ties to fall back to the original
collection_graphs order; make tie-breaking deterministic by adding a stable
filename-derived key as a tertiary sort field (e.g., change the sort key to
(score.new_packages, -score.coverage_percentage, score.filename) or, if Score
objects don’t carry filenames, derive the filename from the corresponding
collection_graphs entry when building scores) so best_score / best_fit is chosen
predictably by filename when the first two metrics tie.
In `@tests/test_suggest_collection.py`:
- Line 5: Replace the fragile regex-based JSON extraction in
tests/test_suggest_collection.py with direct JSON parsing of the entire CLI
output: remove the re import and any use of re.search that extracts a bracketed
block, and instead call json.loads(...) on the full output string produced by
the CLI (e.g., the variable holding stdout or `cli_output`/`result.stdout`) in
each affected test (the blocks around lines 24-29, 294-295, 327 etc.); ensure
the tests assert against the parsed dict/list returned by json.loads so the test
fails if extra bytes surround the JSON payload.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 13fe9c39-2bda-4331-802b-d350ab77b34e
📒 Files selected for processing (2)
src/fromager/commands/graph.pytests/test_suggest_collection.py
|
|
||
| import json | ||
| import pathlib | ||
| import re |
There was a problem hiding this comment.
Parse the full CLI payload in the JSON tests.
The regex extraction makes these tests pass even if the command writes extra bytes before or after the JSON. That would break automation while the suite stays green. Parse the exact CLI output directly instead of pulling out the first bracketed block. As per coding guidelines tests/**: Verify test actually tests the intended behavior.
Suggested fix
-import re
-
-def _extract_json_from_output(output: str) -> str:
- """Extract JSON array from output that may contain leading log lines."""
- json_match = re.search(r"\[.*\]", output, re.DOTALL)
- if json_match:
- return json_match.group(0)
- return "[]"- data = json.loads(_extract_json_from_output(result.output))
+ data = json.loads(result.output)- data = json.loads(_extract_json_from_output(result.output))
+ data = json.loads(result.output)Also applies to: 24-29, 294-295, 327-327
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/test_suggest_collection.py` at line 5, Replace the fragile regex-based
JSON extraction in tests/test_suggest_collection.py with direct JSON parsing of
the entire CLI output: remove the re import and any use of re.search that
extracts a bracketed block, and instead call json.loads(...) on the full output
string produced by the CLI (e.g., the variable holding stdout or
`cli_output`/`result.stdout`) in each affected test (the blocks around lines
24-29, 294-295, 327 etc.); ensure the tests assert against the parsed dict/list
returned by json.loads so the test fails if extra bytes surround the JSON
payload.
16bfc24 to
55faca2
Compare
Add fromager graph suggest-collection that analyzes dependency overlap between onboarding packages and existing collections to recommend the best-fit collection for each package. Helps pipeline maintainers make data-driven decisions when assigning new packages to permanent collections. - Iterative DFS traversal over all edge types for full dependency closure - Ranks collections by fewest new packages, then highest coverage - Rich table (default) and JSON output formats - Error handling for bad graph files and duplicate collection names - Tests covering helpers, CLI, and analysis logic Co-Authored-By: Claude-4.6-opus-high <claude@anthropic.com> Closes: python-wheel-build#971 Signed-off-by: Shanmukh Pawan <smoparth@redhat.com>
55faca2 to
1836b93
Compare
There was a problem hiding this comment.
🧹 Nitpick comments (1)
src/fromager/commands/graph.py (1)
1048-1067: Consider loading collection graphs in parallel for large inputs.For many collection graphs, sequential loading could be slow. This is fine for typical usage but worth noting if performance becomes a concern with many large graphs.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/fromager/commands/graph.py` around lines 1048 - 1067, Parallelize loading by extracting collection names first (call extract_collection_name for each graph_path, check for duplicates) and then use a concurrent.futures.ThreadPoolExecutor to call DependencyGraph.from_file(graph_path) for each unique path in parallel; for each future, on success call get_package_names(collection_graph) and populate collection_packages[collection_name], and on exception wrap and raise a click.ClickException that includes the graph_path and original error (preserve the existing logger.debug call per collection_name). Reference symbols: collection_graphs, extract_collection_name, DependencyGraph.from_file, get_package_names, collection_packages.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@src/fromager/commands/graph.py`:
- Around line 1048-1067: Parallelize loading by extracting collection names
first (call extract_collection_name for each graph_path, check for duplicates)
and then use a concurrent.futures.ThreadPoolExecutor to call
DependencyGraph.from_file(graph_path) for each unique path in parallel; for each
future, on success call get_package_names(collection_graph) and populate
collection_packages[collection_name], and on exception wrap and raise a
click.ClickException that includes the graph_path and original error (preserve
the existing logger.debug call per collection_name). Reference symbols:
collection_graphs, extract_collection_name, DependencyGraph.from_file,
get_package_names, collection_packages.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 6b23b6d9-1f42-421f-9b08-b2e1bdfe4769
📒 Files selected for processing (2)
src/fromager/commands/graph.pytests/test_suggest_collection.py
🚧 Files skipped from review as they are similar to previous changes (1)
- tests/test_suggest_collection.py
jlarkin09
left a comment
There was a problem hiding this comment.
LGTM other than the one suggestion by coderabbit for json.loads()
|
I'm not convinced that it makes sense to add this feature to Fromager upstream. Collections are a downstream concept, not a Fromager feature. |
@tiran, I had the same thought when I picked this. |
I want this feature here. Let's think of a better name for it. |
| print(f"\nBuilding {len(graph)} packages in {rounds} rounds.") | ||
|
|
||
|
|
||
| def _traverse_edges( |
There was a problem hiding this comment.
This should probably be a method of DependencyNode.
| } | ||
|
|
||
|
|
||
| def extract_collection_name(graph_path: str) -> str: |
There was a problem hiding this comment.
This encodes some very opinionated naming conventions in the filename. How about if we just use the filename as the identifier? Then users could name the files in ways that are meaningful to them.
| "best_fit": best_score.collection if best_score else "none", | ||
| "new_packages": best_score.new_packages if best_score else 0, | ||
| "existing_packages": best_score.existing_packages if best_score else 0, | ||
| "coverage_percentage": round(best_score.coverage_percentage, 1) |
There was a problem hiding this comment.
Splitting this expression over multiple lines without any () made it a bit confusing to read.
| Collections are ranked by fewest new packages required, then by highest | ||
| dependency coverage. | ||
|
|
||
| \b |
There was a problem hiding this comment.
Does this \b do something when printed as part of the command help?
| if output_format == "json": | ||
| click.echo("[]") | ||
| else: | ||
| click.echo("No top-level packages found in onboarding graph.") |
There was a problem hiding this comment.
Can we always send this message to stderr and then always print the data structure the same way, instead of checking the output format here?
Pull Request Description
What
Add fromager graph suggest-collection that analyzes dependency overlap between onboarding packages and existing collections to recommend the best-fit collection for each package. Helps pipeline maintainers make data-driven decisions when assigning new packages to permanent collections.
Why
Closes: #971