Switch KB lookup from query to scan jobs polling by soimkim · Pull Request #274 · fosslight/fosslight_source_scanner

soimkim · 2026-05-28T03:07:08Z

Description

Summary by CodeRabbit

New Features
- Added batch KB origin-URL retrieval via scan-job with automatic polling, timeout handling, and partial-result support.
Refactor
- Replaced per-file KB lookups with a bulk enrichment flow that applies returned origin URLs to items, uses cached file hashes, and processes discovered extra files more efficiently.

coderabbitai · 2026-05-28T03:09:34Z

📝 Walkthrough

Walkthrough

Refactors KB origin enrichment to a bulk scan-job flow: adds a KB client that submits file hashes and polls for results, updates SourceItem to accept a kb_origin_urls mapping and apply origin info, and modifies CLI to collect hashes and fetch/apply origins in bulk.

Changes

KB Scan Job Integration

Layer / File(s)	Summary
KB Scan Job API Client `src/fosslight_source/_kb_client.py`	New HTTP client module with `_kb_request` helper, polling/timeout constants, and `fetch_origin_urls_via_scan_job` that creates a remote scan job, polls for completion, and returns a file-hash→origin-url mapping.
SourceItem KB Origin URL Application `src/fosslight_source/_scan_item.py`	`SourceItem` gains `kb_origin_url` state and `_apply_kb_origin_url()`; imports `hashlib`; `set_oss_item()` signature now accepts `kb_origin_urls: dict[str,str]
CLI Hash Collection and Bulk Origin Lookup `src/fosslight_source/cli.py`	Adds `_collect_kb_file_hashes()` to compute/cache MD5s for scancode items and extra files, imports the KB client, calls `fetch_origin_urls_via_scan_job()` once when KB is enabled, and applies the returned mapping to items and extra candidates, replacing the prior per-file KB enrichment loop.

Sequence Diagram(s)

sequenceDiagram
  participant CLI as CLI:merge_results
  participant Collector as _collect_kb_file_hashes
  participant KBClient as fetch_origin_urls_via_scan_job
  participant KBServer as KB Scan Job API
  participant Item as SourceItem.set_oss_item

  CLI->>Collector: walk files, compute MD5 hashes
  Collector-->>CLI: hashes + extra candidates
  CLI->>KBClient: submit hashes for bulk lookup
  KBClient->>KBServer: POST /scan/jobs (hash list)
  KBServer-->>KBClient: job_id
  KBClient->>KBServer: poll GET /scan/jobs/{job_id}
  KBServer-->>KBClient: completed results (rows)
  KBClient-->>CLI: file_hash -> origin_url mapping
  CLI->>Item: set_oss_item(kb_origin_urls=mapping)
  Item->>Item: _apply_kb_origin_url for each matched hash
  Item-->>CLI: enriched OssItem instances

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

fosslight/fosslight_source_scanner#273: Modifies KB integration plumbing in _scan_item.py and cli.py, updating how KB credentials are passed through the call chain.
fosslight/fosslight_source_scanner#254: Also modifies CLI KB processing control and when the KB scanner runs, which affects whether the new bulk KB merge path executes.

Suggested labels

enhancement

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 41.67% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and concisely describes the main change: replacing query-based KB lookup with scan job polling. It accurately reflects the primary architectural shift across the modified files.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch prj

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

src/fosslight_source/_kb_client.py (1)
73-75: 💤 Low value

Consider catching more specific exceptions instead of bare Exception.

The (json.JSONDecodeError, Exception) catch effectively catches everything. This is intentional for network resilience, but could mask unexpected programming errors. Consider catching OSError (covers socket/connection issues) alongside json.JSONDecodeError for clearer intent.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/fosslight_source/_kb_client.py` around lines 73 - 75, Replace the
overly-broad except (json.JSONDecodeError, Exception) with a narrower exception
tuple to avoid masking programming errors: catch json.JSONDecodeError and
OSError (or OSError plus any specific network library exceptions you use)
instead, i.e. change the except to except (json.JSONDecodeError, OSError) as e:,
keep the existing logger.warning(f"KB scan job create failed: {e}") and return
{} so only JSON and OS/network issues are swallowed while other exceptions
propagate.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/fosslight_source/cli.py`:
- Around line 440-446: The scancode_paths set is unused dead code: remove the
creation and any updates to scancode_paths in the block that checks run_kb
(i.e., delete the line creating scancode_paths and the scancode_paths.add(...)
call) and keep the existing logic that iterates extra_candidates and appends
extra_item to scancode_result when extra_item.download_location is present; note
that duplicate filtering is already handled by _collect_kb_file_hashes, so no
replacement logic is necessary.

---

Nitpick comments:
In `@src/fosslight_source/_kb_client.py`:
- Around line 73-75: Replace the overly-broad except (json.JSONDecodeError,
Exception) with a narrower exception tuple to avoid masking programming errors:
catch json.JSONDecodeError and OSError (or OSError plus any specific network
library exceptions you use) instead, i.e. change the except to except
(json.JSONDecodeError, OSError) as e:, keep the existing logger.warning(f"KB
scan job create failed: {e}") and return {} so only JSON and OS/network issues
are swallowed while other exceptions propagate.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: d9e026aa-46e8-4156-9780-2acc625081c3

📥 Commits

Reviewing files that changed from the base of the PR and between 62f153e and 255fa59.

📒 Files selected for processing (3)

src/fosslight_source/_kb_client.py
src/fosslight_source/_scan_item.py
src/fosslight_source/cli.py

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/fosslight_source/_kb_client.py`:
- Around line 82-92: The code calls _estimate_job_wait_timeout(int(accepted))
but accepted can be None or non-numeric; validate and coerce accepted before
using it. Replace direct int(accepted) with a small coercion block: read
accepted = created.get("accepted", created.get("total", len(unique_hashes))); if
accepted is not an int, try to convert with int(...) inside a try/except
catching ValueError/TypeError and fall back to created.get("total",
len(unique_hashes)) or 0; use that safe integer (e.g., accepted_count) when
calling _estimate_job_wait_timeout, and optionally log a warning if coercion
failed. Ensure this logic is applied where accepted is used (the logger, skipped
handling, and deadline calculation) so _estimate_job_wait_timeout receives a
valid integer.
- Around line 24-43: The _kb_request function currently builds a URL from
caller-controlled kb_url and calls urllib.request.urlopen without validating the
scheme; parse kb_url (use urllib.parse.urlparse) and enforce scheme in
("http","https") before creating the Request or calling urlopen, raising a
ValueError (or returning an error) for non-HTTP(S) schemes. Also update
fetch_origin_urls_via_scan_job to guard the int(accepted) conversion by
validating accepted is not None and numeric (wrap conversion in try/except
catching TypeError/ValueError) and handle invalid values gracefully
(log/raise/return early) instead of allowing an uncaught exception to abort
polling.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: fdd9a941-d65c-4c99-8eea-2dbfed3537b6

📥 Commits

Reviewing files that changed from the base of the PR and between 255fa59 and 8af5be9.

📒 Files selected for processing (3)

src/fosslight_source/_kb_client.py
src/fosslight_source/_scan_item.py
src/fosslight_source/cli.py

🚧 Files skipped from review as they are similar to previous changes (2)

src/fosslight_source/cli.py
src/fosslight_source/_scan_item.py

coderabbitai · 2026-05-28T04:32:34Z

+def _kb_request(
+    kb_url: str,
+    path: str,
+    *,
+    method: str = "GET",
+    payload: dict | None = None,
+    kb_token: str = "",
+    timeout: int = _SCAN_JOB_REQUEST_TIMEOUT_SEC,
+) -> dict:
+    data = None
+    if payload is not None:
+        data = json.dumps(payload).encode("utf-8")
+    request = urllib.request.Request(f"{kb_url.rstrip('/')}/{path.lstrip('/')}", data=data, method=method)
+    request.add_header("Accept", "application/json")
+    if payload is not None:
+        request.add_header("Content-Type", "application/json")
+    if kb_token:
+        request.add_header("Authorization", f"Bearer {kb_token}")
+
+    with urllib.request.urlopen(request, timeout=timeout) as response:


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Verify where kb_url is sourced/validated and where urlopen is used. rg -nP --type=py -C3 '\bkb_url\b' rg -nP --type=py -C3 '\burlopen\s*\(' src/fosslight_source rg -nP --type=py -C3 '\burlparse\s*\(' src/fosslight_source

Repository: fosslight/fosslight_source_scanner

Length of output: 12364

🏁 Script executed:

#!/bin/bash # Inspect _kb_client.py around URL construction / urlopen and around accepted parsing. sed -n '1,160p' src/fosslight_source/_kb_client.py

Repository: fosslight/fosslight_source_scanner

Length of output: 5016

🏁 Script executed:

#!/bin/bash sed -n '1,120p' src/fosslight_source/_scan_item.py

Repository: fosslight/fosslight_source_scanner

Length of output: 4661

Restrict kb_url to HTTP(S) in _kb_request before urllib.request.urlopen.

src/fosslight_source/_kb_client.py builds the request URL from caller-controlled kb_url and passes it directly to urllib.request.urlopen, which accepts non-HTTP(S) schemes (e.g., file:).

fetch_origin_urls_via_scan_job() does int(accepted) without guarding against None/unexpected types, which can raise and abort the polling flow.

🔒 Suggested fix

import json import logging import time +import urllib.parse import urllib.error import urllib.request @@ def _kb_request( @@ ) -> dict: + parsed = urllib.parse.urlparse(kb_url) + if parsed.scheme not in {"http", "https"}: + raise ValueError(f"Unsupported KB URL scheme: {parsed.scheme}") + data = None if payload is not None: data = json.dumps(payload).encode("utf-8")

🧰 Tools

🪛 Ruff (0.15.14)

[error] 36-36: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.

(S310)

[error] 43-43: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.

(S310)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/fosslight_source/_kb_client.py` around lines 24 - 43, The _kb_request function currently builds a URL from caller-controlled kb_url and calls urllib.request.urlopen without validating the scheme; parse kb_url (use urllib.parse.urlparse) and enforce scheme in ("http","https") before creating the Request or calling urlopen, raising a ValueError (or returning an error) for non-HTTP(S) schemes. Also update fetch_origin_urls_via_scan_job to guard the int(accepted) conversion by validating accepted is not None and numeric (wrap conversion in try/except catching TypeError/ValueError) and handle invalid values gracefully (log/raise/return early) instead of allowing an uncaught exception to abort polling.

coderabbitai · 2026-05-28T04:32:34Z

+    accepted = created.get("accepted", created.get("total", len(unique_hashes)))
+    skipped = created.get("skipped", 0)
+    logger.info(
+        f"KB scan job created: job_id={job_id}, total={created.get('total', len(unique_hashes))}, "
+        f"accepted={accepted}, skipped={skipped}"
+    )
+    if skipped:
+        logger.warning(f"KB scan job rate-limited: {skipped} file_hash(es) skipped by server")
+
+    deadline = time.monotonic() + _estimate_job_wait_timeout(int(accepted))
+    interval = _SCAN_JOB_POLL_INTERVAL_SEC


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Harden accepted parsing before timeout computation.

Line 91 does int(accepted) without guarding type/shape. If the API returns null or a non-numeric value, this raises and aborts KB enrichment.

🛡️ Suggested fix

- accepted = created.get("accepted", created.get("total", len(unique_hashes))) + accepted = created.get("accepted", created.get("total", len(unique_hashes))) @@ - deadline = time.monotonic() + _estimate_job_wait_timeout(int(accepted)) + try: + accepted_count = max(0, int(accepted)) + except (TypeError, ValueError): + accepted_count = len(unique_hashes) + + deadline = time.monotonic() + _estimate_job_wait_timeout(accepted_count)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

accepted = created.get("accepted", created.get("total", len(unique_hashes)))

skipped = created.get("skipped", 0)

logger.info(

f"KB scan job created: job_id={job_id}, total={created.get('total', len(unique_hashes))}, "

f"accepted={accepted}, skipped={skipped}"

)

if skipped:

logger.warning(f"KB scan job rate-limited: {skipped} file_hash(es) skipped by server")

deadline = time.monotonic() + _estimate_job_wait_timeout(int(accepted))

interval = _SCAN_JOB_POLL_INTERVAL_SEC

accepted = created.get("accepted", created.get("total", len(unique_hashes)))

skipped = created.get("skipped", 0)

logger.info(

f"KB scan job created: job_id={job_id}, total={created.get('total', len(unique_hashes))}, "

f"accepted={accepted}, skipped={skipped}"

)

if skipped:

logger.warning(f"KB scan job rate-limited: {skipped} file_hash(es) skipped by server")

try:

accepted_count = max(0, int(accepted))

except (TypeError, ValueError):

accepted_count = len(unique_hashes)

deadline = time.monotonic() + _estimate_job_wait_timeout(accepted_count)

interval = _SCAN_JOB_POLL_INTERVAL_SEC

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/fosslight_source/_kb_client.py` around lines 82 - 92, The code calls _estimate_job_wait_timeout(int(accepted)) but accepted can be None or non-numeric; validate and coerce accepted before using it. Replace direct int(accepted) with a small coercion block: read accepted = created.get("accepted", created.get("total", len(unique_hashes))); if accepted is not an int, try to convert with int(...) inside a try/except catching ValueError/TypeError and fall back to created.get("total", len(unique_hashes)) or 0; use that safe integer (e.g., accepted_count) when calling _estimate_job_wait_timeout, and optionally log a warning if coercion failed. Ensure this logic is applied where accepted is used (the logger, skipped handling, and deadline calculation) so _estimate_job_wait_timeout receives a valid integer.

soimkim self-assigned this May 28, 2026

soimkim added the enhancement [PR/Issue] New feature or request label May 28, 2026

coderabbitai Bot reviewed May 28, 2026

View reviewed changes

Comment thread src/fosslight_source/cli.py Outdated

feat: switch KB lookup from query to scan jobs polling

8af5be9

soimkim force-pushed the prj branch from 255fa59 to 8af5be9 Compare May 28, 2026 04:28

coderabbitai Bot reviewed May 28, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Switch KB lookup from query to scan jobs polling#274

Switch KB lookup from query to scan jobs polling#274
soimkim wants to merge 1 commit into
mainfrom
prj

soimkim commented May 28, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 28, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot May 28, 2026

Uh oh!

coderabbitai Bot May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

soimkim commented May 28, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 28, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 28, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

soimkim commented May 28, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 28, 2026 •

edited

Loading