Skip to content

feat(graph): add suggest-collection command for collection optimization#1030

Open
smoparth wants to merge 1 commit intopython-wheel-build:mainfrom
smoparth:feat/suggest-collection
Open

feat(graph): add suggest-collection command for collection optimization#1030
smoparth wants to merge 1 commit intopython-wheel-build:mainfrom
smoparth:feat/suggest-collection

Conversation

@smoparth
Copy link
Copy Markdown
Contributor

@smoparth smoparth commented Apr 6, 2026

Pull Request Description

What

Add fromager graph suggest-collection that analyzes dependency overlap between onboarding packages and existing collections to recommend the best-fit collection for each package. Helps pipeline maintainers make data-driven decisions when assigning new packages to permanent collections.

  • Iterative DFS traversal over all edge types for full dependency closure
  • Ranks collections by fewest new packages, then highest coverage
  • Rich table (default) and JSON output formats
  • Error handling for bad graph files and duplicate collection names
  • Tests covering helpers, CLI, and analysis logic

Why

Closes: #971

@smoparth smoparth requested a review from a team as a code owner April 6, 2026 19:26
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 6, 2026

📝 Walkthrough

Walkthrough

Adds a new CLI command fromager graph suggest-collection that compares an onboarding dependency graph against one or more collection graphs. For each onboarding top-level package it computes the transitive dependency closure (iterative DFS with cycle protection), canonicalizes package names, and scores each collection by existing vs new packages and coverage percentage. Results are ranked (fewest new packages, then highest coverage) and emitted as either JSON or a Rich table. New helper functions and unit/integration tests for closure, name extraction, scoring, and output formats are included.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main change: adding a new suggest-collection command for optimizing collection organization by analyzing dependency overlap.
Description check ✅ Passed The description is clearly related to the changeset, explaining the feature, implementation approach, and closing issue #971.
Linked Issues check ✅ Passed The PR implementation meets all coding requirements from issue #971: CLI subcommand with required arguments, graph loading, top-level package identification, transitive closure computation with cycle handling, collection scoring with ranking, table/JSON output formats, helper functions, and comprehensive test coverage.
Out of Scope Changes check ✅ Passed All changes are directly scoped to issue #971: the new CLI command, helper functions, tests, and imports required for the suggest-collection feature. No unrelated modifications detected.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Comment @coderabbitai help to get the list of available commands and usage tips.

@mergify mergify bot added the ci label Apr 6, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/fromager/commands/graph.py`:
- Around line 1031-1033: The current check in the onboarding output path always
prints a human message when toplevel_nodes is empty; update the block that
references toplevel_nodes so that when output_format == "json" it emits a valid
JSON empty list (e.g., using click.echo("[]")) and returns, otherwise keep the
human-friendly message "No top-level packages found in onboarding graph." This
change touches the conditional that checks toplevel_nodes in the function
handling the onboarding graph output and should reference the output_format
variable to decide which output to produce.
- Around line 916-918: The current sort used for selecting best_score
(scores.sort(...); best_score = scores[0]) only considers new_packages and
coverage_percentage, which leaves true ties to fall back to the original
collection_graphs order; make tie-breaking deterministic by adding a stable
filename-derived key as a tertiary sort field (e.g., change the sort key to
(score.new_packages, -score.coverage_percentage, score.filename) or, if Score
objects don’t carry filenames, derive the filename from the corresponding
collection_graphs entry when building scores) so best_score / best_fit is chosen
predictably by filename when the first two metrics tie.

In `@tests/test_suggest_collection.py`:
- Line 5: Replace the fragile regex-based JSON extraction in
tests/test_suggest_collection.py with direct JSON parsing of the entire CLI
output: remove the re import and any use of re.search that extracts a bracketed
block, and instead call json.loads(...) on the full output string produced by
the CLI (e.g., the variable holding stdout or `cli_output`/`result.stdout`) in
each affected test (the blocks around lines 24-29, 294-295, 327 etc.); ensure
the tests assert against the parsed dict/list returned by json.loads so the test
fails if extra bytes surround the JSON payload.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 13fe9c39-2bda-4331-802b-d350ab77b34e

📥 Commits

Reviewing files that changed from the base of the PR and between d86f938 and 16bfc24.

📒 Files selected for processing (2)
  • src/fromager/commands/graph.py
  • tests/test_suggest_collection.py


import json
import pathlib
import re
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Parse the full CLI payload in the JSON tests.

The regex extraction makes these tests pass even if the command writes extra bytes before or after the JSON. That would break automation while the suite stays green. Parse the exact CLI output directly instead of pulling out the first bracketed block. As per coding guidelines tests/**: Verify test actually tests the intended behavior.

Suggested fix
-import re
-
-def _extract_json_from_output(output: str) -> str:
-    """Extract JSON array from output that may contain leading log lines."""
-    json_match = re.search(r"\[.*\]", output, re.DOTALL)
-    if json_match:
-        return json_match.group(0)
-    return "[]"
-        data = json.loads(_extract_json_from_output(result.output))
+        data = json.loads(result.output)
-        data = json.loads(_extract_json_from_output(result.output))
+        data = json.loads(result.output)

Also applies to: 24-29, 294-295, 327-327

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_suggest_collection.py` at line 5, Replace the fragile regex-based
JSON extraction in tests/test_suggest_collection.py with direct JSON parsing of
the entire CLI output: remove the re import and any use of re.search that
extracts a bracketed block, and instead call json.loads(...) on the full output
string produced by the CLI (e.g., the variable holding stdout or
`cli_output`/`result.stdout`) in each affected test (the blocks around lines
24-29, 294-295, 327 etc.); ensure the tests assert against the parsed dict/list
returned by json.loads so the test fails if extra bytes surround the JSON
payload.

@smoparth smoparth force-pushed the feat/suggest-collection branch from 16bfc24 to 55faca2 Compare April 6, 2026 20:00
Add fromager graph suggest-collection that analyzes dependency overlap
between onboarding packages and existing collections to recommend the
best-fit collection for each package. Helps pipeline maintainers make
data-driven decisions when assigning new packages to permanent collections.

- Iterative DFS traversal over all edge types for full dependency closure
- Ranks collections by fewest new packages, then highest coverage
- Rich table (default) and JSON output formats
- Error handling for bad graph files and duplicate collection names
- Tests covering helpers, CLI, and analysis logic

Co-Authored-By: Claude-4.6-opus-high <claude@anthropic.com>
Closes: python-wheel-build#971
Signed-off-by: Shanmukh Pawan <smoparth@redhat.com>
@smoparth smoparth force-pushed the feat/suggest-collection branch from 55faca2 to 1836b93 Compare April 6, 2026 20:01
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
src/fromager/commands/graph.py (1)

1048-1067: Consider loading collection graphs in parallel for large inputs.

For many collection graphs, sequential loading could be slow. This is fine for typical usage but worth noting if performance becomes a concern with many large graphs.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/fromager/commands/graph.py` around lines 1048 - 1067, Parallelize loading
by extracting collection names first (call extract_collection_name for each
graph_path, check for duplicates) and then use a
concurrent.futures.ThreadPoolExecutor to call
DependencyGraph.from_file(graph_path) for each unique path in parallel; for each
future, on success call get_package_names(collection_graph) and populate
collection_packages[collection_name], and on exception wrap and raise a
click.ClickException that includes the graph_path and original error (preserve
the existing logger.debug call per collection_name). Reference symbols:
collection_graphs, extract_collection_name, DependencyGraph.from_file,
get_package_names, collection_packages.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@src/fromager/commands/graph.py`:
- Around line 1048-1067: Parallelize loading by extracting collection names
first (call extract_collection_name for each graph_path, check for duplicates)
and then use a concurrent.futures.ThreadPoolExecutor to call
DependencyGraph.from_file(graph_path) for each unique path in parallel; for each
future, on success call get_package_names(collection_graph) and populate
collection_packages[collection_name], and on exception wrap and raise a
click.ClickException that includes the graph_path and original error (preserve
the existing logger.debug call per collection_name). Reference symbols:
collection_graphs, extract_collection_name, DependencyGraph.from_file,
get_package_names, collection_packages.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 6b23b6d9-1f42-421f-9b08-b2e1bdfe4769

📥 Commits

Reviewing files that changed from the base of the PR and between 16bfc24 and 1836b93.

📒 Files selected for processing (2)
  • src/fromager/commands/graph.py
  • tests/test_suggest_collection.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/test_suggest_collection.py

Copy link
Copy Markdown

@jlarkin09 jlarkin09 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM other than the one suggestion by coderabbit for json.loads()

@tiran
Copy link
Copy Markdown
Collaborator

tiran commented Apr 7, 2026

I'm not convinced that it makes sense to add this feature to Fromager upstream. Collections are a downstream concept, not a Fromager feature.

@smoparth
Copy link
Copy Markdown
Contributor Author

smoparth commented Apr 7, 2026

I'm not convinced that it makes sense to add this feature to Fromager upstream. Collections are a downstream concept, not a Fromager feature.

@tiran, I had the same thought when I picked this.
@dhellmann, would love to hear your take on right place this feature could land.

@dhellmann
Copy link
Copy Markdown
Member

I'm not convinced that it makes sense to add this feature to Fromager upstream. Collections are a downstream concept, not a Fromager feature.

I want this feature here. Let's think of a better name for it.

print(f"\nBuilding {len(graph)} packages in {rounds} rounds.")


def _traverse_edges(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably be a method of DependencyNode.

}


def extract_collection_name(graph_path: str) -> str:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This encodes some very opinionated naming conventions in the filename. How about if we just use the filename as the identifier? Then users could name the files in ways that are meaningful to them.

"best_fit": best_score.collection if best_score else "none",
"new_packages": best_score.new_packages if best_score else 0,
"existing_packages": best_score.existing_packages if best_score else 0,
"coverage_percentage": round(best_score.coverage_percentage, 1)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Splitting this expression over multiple lines without any () made it a bit confusing to read.

Collections are ranked by fewest new packages required, then by highest
dependency coverage.

\b
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this \b do something when printed as part of the command help?

if output_format == "json":
click.echo("[]")
else:
click.echo("No top-level packages found in onboarding graph.")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we always send this message to stderr and then always print the data structure the same way, instead of checking the output format here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add graph suggest-collection command to optimize collection organization

4 participants