Skip to content

Switch KB lookup from query to scan jobs polling#274

Open
soimkim wants to merge 1 commit into
mainfrom
prj
Open

Switch KB lookup from query to scan jobs polling#274
soimkim wants to merge 1 commit into
mainfrom
prj

Conversation

@soimkim
Copy link
Copy Markdown
Contributor

@soimkim soimkim commented May 28, 2026

Description

Summary by CodeRabbit

  • New Features

    • Added batch KB origin-URL retrieval via scan-job with automatic polling, timeout handling, and partial-result support.
  • Refactor

    • Replaced per-file KB lookups with a bulk enrichment flow that applies returned origin URLs to items, uses cached file hashes, and processes discovered extra files more efficiently.

Review Change Stack

@soimkim soimkim self-assigned this May 28, 2026
@soimkim soimkim added the enhancement [PR/Issue] New feature or request label May 28, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 28, 2026

📝 Walkthrough

Walkthrough

Refactors KB origin enrichment to a bulk scan-job flow: adds a KB client that submits file hashes and polls for results, updates SourceItem to accept a kb_origin_urls mapping and apply origin info, and modifies CLI to collect hashes and fetch/apply origins in bulk.

Changes

KB Scan Job Integration

Layer / File(s) Summary
KB Scan Job API Client
src/fosslight_source/_kb_client.py
New HTTP client module with _kb_request helper, polling/timeout constants, and fetch_origin_urls_via_scan_job that creates a remote scan job, polls for completion, and returns a file-hash→origin-url mapping.
SourceItem KB Origin URL Application
src/fosslight_source/_scan_item.py
SourceItem gains kb_origin_url state and _apply_kb_origin_url(); imports hashlib; set_oss_item() signature now accepts `kb_origin_urls: dict[str,str]
CLI Hash Collection and Bulk Origin Lookup
src/fosslight_source/cli.py
Adds _collect_kb_file_hashes() to compute/cache MD5s for scancode items and extra files, imports the KB client, calls fetch_origin_urls_via_scan_job() once when KB is enabled, and applies the returned mapping to items and extra candidates, replacing the prior per-file KB enrichment loop.

Sequence Diagram(s)

sequenceDiagram
  participant CLI as CLI:merge_results
  participant Collector as _collect_kb_file_hashes
  participant KBClient as fetch_origin_urls_via_scan_job
  participant KBServer as KB Scan Job API
  participant Item as SourceItem.set_oss_item

  CLI->>Collector: walk files, compute MD5 hashes
  Collector-->>CLI: hashes + extra candidates
  CLI->>KBClient: submit hashes for bulk lookup
  KBClient->>KBServer: POST /scan/jobs (hash list)
  KBServer-->>KBClient: job_id
  KBClient->>KBServer: poll GET /scan/jobs/{job_id}
  KBServer-->>KBClient: completed results (rows)
  KBClient-->>CLI: file_hash -> origin_url mapping
  CLI->>Item: set_oss_item(kb_origin_urls=mapping)
  Item->>Item: _apply_kb_origin_url for each matched hash
  Item-->>CLI: enriched OssItem instances
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested labels

enhancement

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 41.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and concisely describes the main change: replacing query-based KB lookup with scan job polling. It accurately reflects the primary architectural shift across the modified files.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch prj

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
src/fosslight_source/_kb_client.py (1)

73-75: 💤 Low value

Consider catching more specific exceptions instead of bare Exception.

The (json.JSONDecodeError, Exception) catch effectively catches everything. This is intentional for network resilience, but could mask unexpected programming errors. Consider catching OSError (covers socket/connection issues) alongside json.JSONDecodeError for clearer intent.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/fosslight_source/_kb_client.py` around lines 73 - 75, Replace the
overly-broad except (json.JSONDecodeError, Exception) with a narrower exception
tuple to avoid masking programming errors: catch json.JSONDecodeError and
OSError (or OSError plus any specific network library exceptions you use)
instead, i.e. change the except to except (json.JSONDecodeError, OSError) as e:,
keep the existing logger.warning(f"KB scan job create failed: {e}") and return
{} so only JSON and OS/network issues are swallowed while other exceptions
propagate.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/fosslight_source/cli.py`:
- Around line 440-446: The scancode_paths set is unused dead code: remove the
creation and any updates to scancode_paths in the block that checks run_kb
(i.e., delete the line creating scancode_paths and the scancode_paths.add(...)
call) and keep the existing logic that iterates extra_candidates and appends
extra_item to scancode_result when extra_item.download_location is present; note
that duplicate filtering is already handled by _collect_kb_file_hashes, so no
replacement logic is necessary.

---

Nitpick comments:
In `@src/fosslight_source/_kb_client.py`:
- Around line 73-75: Replace the overly-broad except (json.JSONDecodeError,
Exception) with a narrower exception tuple to avoid masking programming errors:
catch json.JSONDecodeError and OSError (or OSError plus any specific network
library exceptions you use) instead, i.e. change the except to except
(json.JSONDecodeError, OSError) as e:, keep the existing logger.warning(f"KB
scan job create failed: {e}") and return {} so only JSON and OS/network issues
are swallowed while other exceptions propagate.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: d9e026aa-46e8-4156-9780-2acc625081c3

📥 Commits

Reviewing files that changed from the base of the PR and between 62f153e and 255fa59.

📒 Files selected for processing (3)
  • src/fosslight_source/_kb_client.py
  • src/fosslight_source/_scan_item.py
  • src/fosslight_source/cli.py

Comment thread src/fosslight_source/cli.py Outdated
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/fosslight_source/_kb_client.py`:
- Around line 82-92: The code calls _estimate_job_wait_timeout(int(accepted))
but accepted can be None or non-numeric; validate and coerce accepted before
using it. Replace direct int(accepted) with a small coercion block: read
accepted = created.get("accepted", created.get("total", len(unique_hashes))); if
accepted is not an int, try to convert with int(...) inside a try/except
catching ValueError/TypeError and fall back to created.get("total",
len(unique_hashes)) or 0; use that safe integer (e.g., accepted_count) when
calling _estimate_job_wait_timeout, and optionally log a warning if coercion
failed. Ensure this logic is applied where accepted is used (the logger, skipped
handling, and deadline calculation) so _estimate_job_wait_timeout receives a
valid integer.
- Around line 24-43: The _kb_request function currently builds a URL from
caller-controlled kb_url and calls urllib.request.urlopen without validating the
scheme; parse kb_url (use urllib.parse.urlparse) and enforce scheme in
("http","https") before creating the Request or calling urlopen, raising a
ValueError (or returning an error) for non-HTTP(S) schemes. Also update
fetch_origin_urls_via_scan_job to guard the int(accepted) conversion by
validating accepted is not None and numeric (wrap conversion in try/except
catching TypeError/ValueError) and handle invalid values gracefully
(log/raise/return early) instead of allowing an uncaught exception to abort
polling.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: fdd9a941-d65c-4c99-8eea-2dbfed3537b6

📥 Commits

Reviewing files that changed from the base of the PR and between 255fa59 and 8af5be9.

📒 Files selected for processing (3)
  • src/fosslight_source/_kb_client.py
  • src/fosslight_source/_scan_item.py
  • src/fosslight_source/cli.py
🚧 Files skipped from review as they are similar to previous changes (2)
  • src/fosslight_source/cli.py
  • src/fosslight_source/_scan_item.py

Comment on lines +24 to +43
def _kb_request(
kb_url: str,
path: str,
*,
method: str = "GET",
payload: dict | None = None,
kb_token: str = "",
timeout: int = _SCAN_JOB_REQUEST_TIMEOUT_SEC,
) -> dict:
data = None
if payload is not None:
data = json.dumps(payload).encode("utf-8")
request = urllib.request.Request(f"{kb_url.rstrip('/')}/{path.lstrip('/')}", data=data, method=method)
request.add_header("Accept", "application/json")
if payload is not None:
request.add_header("Content-Type", "application/json")
if kb_token:
request.add_header("Authorization", f"Bearer {kb_token}")

with urllib.request.urlopen(request, timeout=timeout) as response:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify where kb_url is sourced/validated and where urlopen is used.
rg -nP --type=py -C3 '\bkb_url\b'
rg -nP --type=py -C3 '\burlopen\s*\(' src/fosslight_source
rg -nP --type=py -C3 '\burlparse\s*\(' src/fosslight_source

Repository: fosslight/fosslight_source_scanner

Length of output: 12364


🏁 Script executed:

#!/bin/bash
# Inspect _kb_client.py around URL construction / urlopen and around accepted parsing.
sed -n '1,160p' src/fosslight_source/_kb_client.py

Repository: fosslight/fosslight_source_scanner

Length of output: 5016


🏁 Script executed:

#!/bin/bash
sed -n '1,120p' src/fosslight_source/_scan_item.py

Repository: fosslight/fosslight_source_scanner

Length of output: 4661


Restrict kb_url to HTTP(S) in _kb_request before urllib.request.urlopen.

  • src/fosslight_source/_kb_client.py builds the request URL from caller-controlled kb_url and passes it directly to urllib.request.urlopen, which accepts non-HTTP(S) schemes (e.g., file:).
  • fetch_origin_urls_via_scan_job() does int(accepted) without guarding against None/unexpected types, which can raise and abort the polling flow.
🔒 Suggested fix
 import json
 import logging
 import time
+import urllib.parse
 import urllib.error
 import urllib.request
@@
 def _kb_request(
@@
 ) -> dict:
+    parsed = urllib.parse.urlparse(kb_url)
+    if parsed.scheme not in {"http", "https"}:
+        raise ValueError(f"Unsupported KB URL scheme: {parsed.scheme}")
+
     data = None
     if payload is not None:
         data = json.dumps(payload).encode("utf-8")
🧰 Tools
🪛 Ruff (0.15.14)

[error] 36-36: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.

(S310)


[error] 43-43: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.

(S310)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/fosslight_source/_kb_client.py` around lines 24 - 43, The _kb_request
function currently builds a URL from caller-controlled kb_url and calls
urllib.request.urlopen without validating the scheme; parse kb_url (use
urllib.parse.urlparse) and enforce scheme in ("http","https") before creating
the Request or calling urlopen, raising a ValueError (or returning an error) for
non-HTTP(S) schemes. Also update fetch_origin_urls_via_scan_job to guard the
int(accepted) conversion by validating accepted is not None and numeric (wrap
conversion in try/except catching TypeError/ValueError) and handle invalid
values gracefully (log/raise/return early) instead of allowing an uncaught
exception to abort polling.

Comment on lines +82 to +92
accepted = created.get("accepted", created.get("total", len(unique_hashes)))
skipped = created.get("skipped", 0)
logger.info(
f"KB scan job created: job_id={job_id}, total={created.get('total', len(unique_hashes))}, "
f"accepted={accepted}, skipped={skipped}"
)
if skipped:
logger.warning(f"KB scan job rate-limited: {skipped} file_hash(es) skipped by server")

deadline = time.monotonic() + _estimate_job_wait_timeout(int(accepted))
interval = _SCAN_JOB_POLL_INTERVAL_SEC
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Harden accepted parsing before timeout computation.

Line 91 does int(accepted) without guarding type/shape. If the API returns null or a non-numeric value, this raises and aborts KB enrichment.

🛡️ Suggested fix
-    accepted = created.get("accepted", created.get("total", len(unique_hashes)))
+    accepted = created.get("accepted", created.get("total", len(unique_hashes)))
@@
-    deadline = time.monotonic() + _estimate_job_wait_timeout(int(accepted))
+    try:
+        accepted_count = max(0, int(accepted))
+    except (TypeError, ValueError):
+        accepted_count = len(unique_hashes)
+
+    deadline = time.monotonic() + _estimate_job_wait_timeout(accepted_count)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
accepted = created.get("accepted", created.get("total", len(unique_hashes)))
skipped = created.get("skipped", 0)
logger.info(
f"KB scan job created: job_id={job_id}, total={created.get('total', len(unique_hashes))}, "
f"accepted={accepted}, skipped={skipped}"
)
if skipped:
logger.warning(f"KB scan job rate-limited: {skipped} file_hash(es) skipped by server")
deadline = time.monotonic() + _estimate_job_wait_timeout(int(accepted))
interval = _SCAN_JOB_POLL_INTERVAL_SEC
accepted = created.get("accepted", created.get("total", len(unique_hashes)))
skipped = created.get("skipped", 0)
logger.info(
f"KB scan job created: job_id={job_id}, total={created.get('total', len(unique_hashes))}, "
f"accepted={accepted}, skipped={skipped}"
)
if skipped:
logger.warning(f"KB scan job rate-limited: {skipped} file_hash(es) skipped by server")
try:
accepted_count = max(0, int(accepted))
except (TypeError, ValueError):
accepted_count = len(unique_hashes)
deadline = time.monotonic() + _estimate_job_wait_timeout(accepted_count)
interval = _SCAN_JOB_POLL_INTERVAL_SEC
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/fosslight_source/_kb_client.py` around lines 82 - 92, The code calls
_estimate_job_wait_timeout(int(accepted)) but accepted can be None or
non-numeric; validate and coerce accepted before using it. Replace direct
int(accepted) with a small coercion block: read accepted =
created.get("accepted", created.get("total", len(unique_hashes))); if accepted
is not an int, try to convert with int(...) inside a try/except catching
ValueError/TypeError and fall back to created.get("total", len(unique_hashes))
or 0; use that safe integer (e.g., accepted_count) when calling
_estimate_job_wait_timeout, and optionally log a warning if coercion failed.
Ensure this logic is applied where accepted is used (the logger, skipped
handling, and deadline calculation) so _estimate_job_wait_timeout receives a
valid integer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement [PR/Issue] New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant