Conversation
📝 WalkthroughWalkthroughRefactors KB origin enrichment to a bulk scan-job flow: adds a KB client that submits file hashes and polls for results, updates SourceItem to accept a kb_origin_urls mapping and apply origin info, and modifies CLI to collect hashes and fetch/apply origins in bulk. ChangesKB Scan Job Integration
Sequence Diagram(s)sequenceDiagram
participant CLI as CLI:merge_results
participant Collector as _collect_kb_file_hashes
participant KBClient as fetch_origin_urls_via_scan_job
participant KBServer as KB Scan Job API
participant Item as SourceItem.set_oss_item
CLI->>Collector: walk files, compute MD5 hashes
Collector-->>CLI: hashes + extra candidates
CLI->>KBClient: submit hashes for bulk lookup
KBClient->>KBServer: POST /scan/jobs (hash list)
KBServer-->>KBClient: job_id
KBClient->>KBServer: poll GET /scan/jobs/{job_id}
KBServer-->>KBClient: completed results (rows)
KBClient-->>CLI: file_hash -> origin_url mapping
CLI->>Item: set_oss_item(kb_origin_urls=mapping)
Item->>Item: _apply_kb_origin_url for each matched hash
Item-->>CLI: enriched OssItem instances
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested labels
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
src/fosslight_source/_kb_client.py (1)
73-75: 💤 Low valueConsider catching more specific exceptions instead of bare
Exception.The
(json.JSONDecodeError, Exception)catch effectively catches everything. This is intentional for network resilience, but could mask unexpected programming errors. Consider catchingOSError(covers socket/connection issues) alongsidejson.JSONDecodeErrorfor clearer intent.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/fosslight_source/_kb_client.py` around lines 73 - 75, Replace the overly-broad except (json.JSONDecodeError, Exception) with a narrower exception tuple to avoid masking programming errors: catch json.JSONDecodeError and OSError (or OSError plus any specific network library exceptions you use) instead, i.e. change the except to except (json.JSONDecodeError, OSError) as e:, keep the existing logger.warning(f"KB scan job create failed: {e}") and return {} so only JSON and OS/network issues are swallowed while other exceptions propagate.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/fosslight_source/cli.py`:
- Around line 440-446: The scancode_paths set is unused dead code: remove the
creation and any updates to scancode_paths in the block that checks run_kb
(i.e., delete the line creating scancode_paths and the scancode_paths.add(...)
call) and keep the existing logic that iterates extra_candidates and appends
extra_item to scancode_result when extra_item.download_location is present; note
that duplicate filtering is already handled by _collect_kb_file_hashes, so no
replacement logic is necessary.
---
Nitpick comments:
In `@src/fosslight_source/_kb_client.py`:
- Around line 73-75: Replace the overly-broad except (json.JSONDecodeError,
Exception) with a narrower exception tuple to avoid masking programming errors:
catch json.JSONDecodeError and OSError (or OSError plus any specific network
library exceptions you use) instead, i.e. change the except to except
(json.JSONDecodeError, OSError) as e:, keep the existing logger.warning(f"KB
scan job create failed: {e}") and return {} so only JSON and OS/network issues
are swallowed while other exceptions propagate.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: d9e026aa-46e8-4156-9780-2acc625081c3
📒 Files selected for processing (3)
src/fosslight_source/_kb_client.pysrc/fosslight_source/_scan_item.pysrc/fosslight_source/cli.py
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/fosslight_source/_kb_client.py`:
- Around line 82-92: The code calls _estimate_job_wait_timeout(int(accepted))
but accepted can be None or non-numeric; validate and coerce accepted before
using it. Replace direct int(accepted) with a small coercion block: read
accepted = created.get("accepted", created.get("total", len(unique_hashes))); if
accepted is not an int, try to convert with int(...) inside a try/except
catching ValueError/TypeError and fall back to created.get("total",
len(unique_hashes)) or 0; use that safe integer (e.g., accepted_count) when
calling _estimate_job_wait_timeout, and optionally log a warning if coercion
failed. Ensure this logic is applied where accepted is used (the logger, skipped
handling, and deadline calculation) so _estimate_job_wait_timeout receives a
valid integer.
- Around line 24-43: The _kb_request function currently builds a URL from
caller-controlled kb_url and calls urllib.request.urlopen without validating the
scheme; parse kb_url (use urllib.parse.urlparse) and enforce scheme in
("http","https") before creating the Request or calling urlopen, raising a
ValueError (or returning an error) for non-HTTP(S) schemes. Also update
fetch_origin_urls_via_scan_job to guard the int(accepted) conversion by
validating accepted is not None and numeric (wrap conversion in try/except
catching TypeError/ValueError) and handle invalid values gracefully
(log/raise/return early) instead of allowing an uncaught exception to abort
polling.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: fdd9a941-d65c-4c99-8eea-2dbfed3537b6
📒 Files selected for processing (3)
src/fosslight_source/_kb_client.pysrc/fosslight_source/_scan_item.pysrc/fosslight_source/cli.py
🚧 Files skipped from review as they are similar to previous changes (2)
- src/fosslight_source/cli.py
- src/fosslight_source/_scan_item.py
| def _kb_request( | ||
| kb_url: str, | ||
| path: str, | ||
| *, | ||
| method: str = "GET", | ||
| payload: dict | None = None, | ||
| kb_token: str = "", | ||
| timeout: int = _SCAN_JOB_REQUEST_TIMEOUT_SEC, | ||
| ) -> dict: | ||
| data = None | ||
| if payload is not None: | ||
| data = json.dumps(payload).encode("utf-8") | ||
| request = urllib.request.Request(f"{kb_url.rstrip('/')}/{path.lstrip('/')}", data=data, method=method) | ||
| request.add_header("Accept", "application/json") | ||
| if payload is not None: | ||
| request.add_header("Content-Type", "application/json") | ||
| if kb_token: | ||
| request.add_header("Authorization", f"Bearer {kb_token}") | ||
|
|
||
| with urllib.request.urlopen(request, timeout=timeout) as response: |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verify where kb_url is sourced/validated and where urlopen is used.
rg -nP --type=py -C3 '\bkb_url\b'
rg -nP --type=py -C3 '\burlopen\s*\(' src/fosslight_source
rg -nP --type=py -C3 '\burlparse\s*\(' src/fosslight_sourceRepository: fosslight/fosslight_source_scanner
Length of output: 12364
🏁 Script executed:
#!/bin/bash
# Inspect _kb_client.py around URL construction / urlopen and around accepted parsing.
sed -n '1,160p' src/fosslight_source/_kb_client.pyRepository: fosslight/fosslight_source_scanner
Length of output: 5016
🏁 Script executed:
#!/bin/bash
sed -n '1,120p' src/fosslight_source/_scan_item.pyRepository: fosslight/fosslight_source_scanner
Length of output: 4661
Restrict kb_url to HTTP(S) in _kb_request before urllib.request.urlopen.
src/fosslight_source/_kb_client.pybuilds the request URL from caller-controlledkb_urland passes it directly tourllib.request.urlopen, which accepts non-HTTP(S) schemes (e.g.,file:).fetch_origin_urls_via_scan_job()doesint(accepted)without guarding againstNone/unexpected types, which can raise and abort the polling flow.
🔒 Suggested fix
import json
import logging
import time
+import urllib.parse
import urllib.error
import urllib.request
@@
def _kb_request(
@@
) -> dict:
+ parsed = urllib.parse.urlparse(kb_url)
+ if parsed.scheme not in {"http", "https"}:
+ raise ValueError(f"Unsupported KB URL scheme: {parsed.scheme}")
+
data = None
if payload is not None:
data = json.dumps(payload).encode("utf-8")🧰 Tools
🪛 Ruff (0.15.14)
[error] 36-36: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.
(S310)
[error] 43-43: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.
(S310)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/fosslight_source/_kb_client.py` around lines 24 - 43, The _kb_request
function currently builds a URL from caller-controlled kb_url and calls
urllib.request.urlopen without validating the scheme; parse kb_url (use
urllib.parse.urlparse) and enforce scheme in ("http","https") before creating
the Request or calling urlopen, raising a ValueError (or returning an error) for
non-HTTP(S) schemes. Also update fetch_origin_urls_via_scan_job to guard the
int(accepted) conversion by validating accepted is not None and numeric (wrap
conversion in try/except catching TypeError/ValueError) and handle invalid
values gracefully (log/raise/return early) instead of allowing an uncaught
exception to abort polling.
| accepted = created.get("accepted", created.get("total", len(unique_hashes))) | ||
| skipped = created.get("skipped", 0) | ||
| logger.info( | ||
| f"KB scan job created: job_id={job_id}, total={created.get('total', len(unique_hashes))}, " | ||
| f"accepted={accepted}, skipped={skipped}" | ||
| ) | ||
| if skipped: | ||
| logger.warning(f"KB scan job rate-limited: {skipped} file_hash(es) skipped by server") | ||
|
|
||
| deadline = time.monotonic() + _estimate_job_wait_timeout(int(accepted)) | ||
| interval = _SCAN_JOB_POLL_INTERVAL_SEC |
There was a problem hiding this comment.
Harden accepted parsing before timeout computation.
Line 91 does int(accepted) without guarding type/shape. If the API returns null or a non-numeric value, this raises and aborts KB enrichment.
🛡️ Suggested fix
- accepted = created.get("accepted", created.get("total", len(unique_hashes)))
+ accepted = created.get("accepted", created.get("total", len(unique_hashes)))
@@
- deadline = time.monotonic() + _estimate_job_wait_timeout(int(accepted))
+ try:
+ accepted_count = max(0, int(accepted))
+ except (TypeError, ValueError):
+ accepted_count = len(unique_hashes)
+
+ deadline = time.monotonic() + _estimate_job_wait_timeout(accepted_count)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| accepted = created.get("accepted", created.get("total", len(unique_hashes))) | |
| skipped = created.get("skipped", 0) | |
| logger.info( | |
| f"KB scan job created: job_id={job_id}, total={created.get('total', len(unique_hashes))}, " | |
| f"accepted={accepted}, skipped={skipped}" | |
| ) | |
| if skipped: | |
| logger.warning(f"KB scan job rate-limited: {skipped} file_hash(es) skipped by server") | |
| deadline = time.monotonic() + _estimate_job_wait_timeout(int(accepted)) | |
| interval = _SCAN_JOB_POLL_INTERVAL_SEC | |
| accepted = created.get("accepted", created.get("total", len(unique_hashes))) | |
| skipped = created.get("skipped", 0) | |
| logger.info( | |
| f"KB scan job created: job_id={job_id}, total={created.get('total', len(unique_hashes))}, " | |
| f"accepted={accepted}, skipped={skipped}" | |
| ) | |
| if skipped: | |
| logger.warning(f"KB scan job rate-limited: {skipped} file_hash(es) skipped by server") | |
| try: | |
| accepted_count = max(0, int(accepted)) | |
| except (TypeError, ValueError): | |
| accepted_count = len(unique_hashes) | |
| deadline = time.monotonic() + _estimate_job_wait_timeout(accepted_count) | |
| interval = _SCAN_JOB_POLL_INTERVAL_SEC |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/fosslight_source/_kb_client.py` around lines 82 - 92, The code calls
_estimate_job_wait_timeout(int(accepted)) but accepted can be None or
non-numeric; validate and coerce accepted before using it. Replace direct
int(accepted) with a small coercion block: read accepted =
created.get("accepted", created.get("total", len(unique_hashes))); if accepted
is not an int, try to convert with int(...) inside a try/except catching
ValueError/TypeError and fall back to created.get("total", len(unique_hashes))
or 0; use that safe integer (e.g., accepted_count) when calling
_estimate_job_wait_timeout, and optionally log a warning if coercion failed.
Ensure this logic is applied where accepted is used (the logger, skipped
handling, and deadline calculation) so _estimate_job_wait_timeout receives a
valid integer.
Description
Summary by CodeRabbit
New Features
Refactor