Retry with browser and curl-like UA on mirror blocks by soimkim · Pull Request #266 · fosslight/fosslight_util

soimkim · 2026-04-21T23:09:43Z

Description

Retry with browser and curl-like UA on mirror blocks.

Enhanced download reliability with multi-attempt fallback logic for problematic servers
Improved filename extraction from server response headers
Better detection and handling of invalid downloads

coderabbitai · 2026-04-21T23:09:55Z

Warning

Rate limit exceeded

@soimkim has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 56 minutes and 59 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 56 minutes and 59 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: d777d6c7-8cbc-46aa-92ec-8f2e37f516ed

📥 Commits

Reviewing files that changed from the base of the PR and between 5e403f3 and 2cfd7ee.

📒 Files selected for processing (1)

src/fosslight_util/download.py

📝 Walkthrough

Walkthrough

The download.py module now includes retry logic with multiple HTTP User-Agent header configurations. is_downloadable() checks downloadability across header attempts, continuing only for archive extensions on HTTP 403/HTML responses. download_file() refactored to support per-attempt downloads with Content-Disposition filename extraction and retry behavior on HTTP 403 or HTML-for-archive scenarios.

Changes

Cohort / File(s)	Summary
HTTP Download Retry & Header Logic `src/fosslight_util/download.py`	Added retry mechanism with multiple User-Agent header configurations for `download_file()` and enhanced `is_downloadable()` to conditionally retry on HTTP 403 and HTML responses based on URL archive patterns. Introduced `_download_file_once()` helper for per-attempt operations, `_url_looks_like_binary_archive()` for extension detection, `_download_http_header_attempts()` for header building, and internal `_HtmlWhenBinaryExpected` exception for archive-returns-HTML scenarios. Filename derivation now prioritizes Content-Disposition header with URL basename fallback.

Sequence Diagram

sequenceDiagram
    actor Caller as Caller
    participant DL as download_file()
    participant HTTP as HTTP Server
    participant Helper as _download_file_once()

    Caller->>DL: download_file(url, target_dir)
    DL->>DL: _download_http_header_attempts()
    
    loop For each header attempt
        DL->>Helper: Call _download_file_once()
        Helper->>HTTP: HEAD request (with headers)
        HTTP-->>Helper: Response (status, Content-Disposition)
        
        alt 200 OK
            Helper->>HTTP: GET request
            HTTP-->>Helper: Binary data
            Helper->>Helper: Extract filename from Content-Disposition
            Helper-->>DL: Success (filepath)
            DL-->>Caller: Return filepath
        else 403 Forbidden
            Helper-->>DL: _HtmlWhenBinaryExpected or 403
            Note over DL: Continue to next header attempt
        else HTML when archive expected
            Helper-->>DL: _HtmlWhenBinaryExpected
            Note over DL: Retry next header attempt
        else Other status
            Helper-->>DL: Return error
            DL-->>Caller: Return None after all attempts exhausted
        end
    end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately summarizes the main change: implementing retry logic with alternate User-Agent headers when downloads are blocked.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch link

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

- Set oss_version from tarball/archive filename after wget for clarified_version

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (1)

src/fosslight_util/download.py (1)
87-112: Minor: broad except defeats UA retries on first-attempt network errors.

On line 109, any exception during the first attempt returns False immediately, bypassing the UA retry loop. If the intent is only to avoid retrying genuine network/TLS errors, this is fine; but if a mirror drops the connection specifically on the default UA (a common anti-bot behavior adjacent to the 403 case this PR addresses), retries never happen. Consider continue-ing (with a warning) when i < last_i and only returning False after the last attempt.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/fosslight_util/download.py` around lines 87 - 112, In is_downloadable,
the broad except currently returns False on any exception and prevents later
User-Agent retries; change the exception handling so that on failure you log the
warning (including the exception) and if i < last_i simply continue to the next
headers attempt, only returning False after the final attempt (i == last_i),
ensuring attempts (from _download_http_header_attempts) are honored; update the
except block in is_downloadable to implement this conditional continue/return
behavior.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/fosslight_util/download.py`:
- Around line 688-708: The parsed filename from the Content-Disposition branch
(the variable filename used before computing local_path and joining with
target_dir) must be sanitized to prevent path traversal: after extracting
filename in the code that handles Content-Disposition (the block that sets
filename via m_star or m), strip any directory components (e.g., run
os.path.basename on filename), remove or replace any NULs and path separators
(both os.sep and os.altsep, forward/back slashes) and collapse any .. sequences
so the final filename is a simple basename-safe token before using
os.path.join(target_dir, filename) to produce local_path; apply the same
sanitization helper to any header-derived filename to match the URL-path
fallback behavior.
- Around line 693-695: The regex and handling around m_star in download.py
incorrectly assume a literal UTF-8'' prefix; update the detection and parsing
for filename* to capture the charset, optional language tag, and the
percent-encoded value (e.g. match
filename\*=(?P<charset>[^']*)'(?:[^']*)'(?P<enc>[^;\r\n]+)), then replace the
current urllib.parse.unquote call: extract charset (default to 'utf-8' if
missing), percent-decode to bytes (urllib.parse.unquote_to_bytes) and decode
with the captured charset (fall back safely if decoding fails), assigning the
result back to filename; keep using the existing m_star and filename symbols so
the rest of the function continues to work.

---

Nitpick comments:
In `@src/fosslight_util/download.py`:
- Around line 87-112: In is_downloadable, the broad except currently returns
False on any exception and prevents later User-Agent retries; change the
exception handling so that on failure you log the warning (including the
exception) and if i < last_i simply continue to the next headers attempt, only
returning False after the final attempt (i == last_i), ensuring attempts (from
_download_http_header_attempts) are honored; update the except block in
is_downloadable to implement this conditional continue/return behavior.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: d4a28e32-2ca4-4504-a041-207f49483ea9

📥 Commits

Reviewing files that changed from the base of the PR and between c9d8671 and 5e403f3.

📒 Files selected for processing (1)

src/fosslight_util/download.py

coderabbitai · 2026-04-21T23:12:41Z

+        filename = ""
+        cd = r.headers.get("Content-Disposition") or head_headers.get(
+            "Content-Disposition"
+        )
+        if cd:
+            m_star = re.search(r"filename\*=(?:UTF-8'')?([^;\r\n]+)", cd)
+            if m_star:
+                filename = urllib.parse.unquote(m_star.group(1).strip('"\''))
            else:
-                local_path = target_dir
+                m = re.search(r"filename=([^;\r\n]+)", cd)
+                if m:
+                    filename = m.group(1).strip('"\'')
+        if not filename:
+            final_for_name = r.url or final_url
+            filename = os.path.basename(urllib.parse.urlparse(final_for_name).path)
+            if not filename:
+                filename = "downloaded_file"
+        if os.path.isdir(target_dir):
+            local_path = os.path.join(target_dir, filename)
+        else:
+            local_path = target_dir


⚠️ Potential issue | 🟠 Major

Path traversal risk: sanitize filenames from Content-Disposition.

The filename parsed from Content-Disposition is passed directly to os.path.join(target_dir, filename) with no sanitization. A hostile or misconfigured mirror (exactly the scenario this PR is hardening against) can respond with e.g. Content-Disposition: attachment; filename="../../../evil.sh" and cause the download to be written outside target_dir. Note the URL-path fallback already applies os.path.basename, but the header branch does not.

Strip directory components (and ideally NUL / path separators) before joining.

🛡️ Proposed fix

if cd: m_star = re.search(r"filename\*=(?:UTF-8'')?([^;\r\n]+)", cd) if m_star: filename = urllib.parse.unquote(m_star.group(1).strip('"\'')) else: m = re.search(r"filename=([^;\r\n]+)", cd) if m: filename = m.group(1).strip('"\'') + # Never trust server-provided filenames: strip any path components. + if filename: + filename = os.path.basename(filename.replace("\\", "/").strip()).lstrip(".") or "" if not filename: final_for_name = r.url or final_url filename = os.path.basename(urllib.parse.urlparse(final_for_name).path) if not filename: filename = "downloaded_file"

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/fosslight_util/download.py` around lines 688 - 708, The parsed filename from the Content-Disposition branch (the variable filename used before computing local_path and joining with target_dir) must be sanitized to prevent path traversal: after extracting filename in the code that handles Content-Disposition (the block that sets filename via m_star or m), strip any directory components (e.g., run os.path.basename on filename), remove or replace any NULs and path separators (both os.sep and os.altsep, forward/back slashes) and collapse any .. sequences so the final filename is a simple basename-safe token before using os.path.join(target_dir, filename) to produce local_path; apply the same sanitization helper to any header-derived filename to match the URL-path fallback behavior.

coderabbitai · 2026-04-21T23:12:41Z

+            m_star = re.search(r"filename\*=(?:UTF-8'')?([^;\r\n]+)", cd)
+            if m_star:
+                filename = urllib.parse.unquote(m_star.group(1).strip('"\''))


⚠️ Potential issue | 🟡 Minor

RFC 5987 filename* regex misses the language tag form.

Per RFC 5987, the filename* value is charset'lang'percent-encoded-value, and lang is frequently empty but can be non-empty (e.g. UTF-8'en'file.txt). The current pattern filename\*=(?:UTF-8'')?([^;\r\n]+) only strips the literal UTF-8'' prefix, so a value like filename*=UTF-8'en'file.txt is captured verbatim and urllib.parse.unquote leaves the UTF-8'en' prefix embedded in the filename. Also, non-UTF-8 charsets (e.g. ISO-8859-1''...) aren’t handled.

♻️ Proposed fix

- m_star = re.search(r"filename\*=(?:UTF-8'')?([^;\r\n]+)", cd) - if m_star: - filename = urllib.parse.unquote(m_star.group(1).strip('"\'')) + m_star = re.search( + r"filename\*=(?:[\w-]+)?'[^']*'([^;\r\n]+)", cd, re.IGNORECASE + ) + if m_star: + filename = urllib.parse.unquote(m_star.group(1).strip('"\''))

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/fosslight_util/download.py` around lines 693 - 695, The regex and handling around m_star in download.py incorrectly assume a literal UTF-8'' prefix; update the detection and parsing for filename* to capture the charset, optional language tag, and the percent-encoded value (e.g. match filename\*=(?P<charset>[^']*)'(?:[^']*)'(?P<enc>[^;\r\n]+)), then replace the current urllib.parse.unquote call: extract charset (default to 'utf-8' if missing), percent-decode to bytes (urllib.parse.unquote_to_bytes) and decode with the captured charset (fall back safely if decoding fails), assigning the result back to filename; keep using the existing m_star and filename symbols so the rest of the function continues to work.

fix(download): retry with browser and curl-like UA on mirror blocks

5e403f3

soimkim self-assigned this Apr 21, 2026

soimkim added the bug fix [PR] Fix the bug label Apr 21, 2026

feat(download): improve mirror downloads and wget archive version hints

2cfd7ee

- Set oss_version from tarball/archive filename after wget for clarified_version

coderabbitai Bot reviewed Apr 21, 2026

View reviewed changes

soimkim added enhancement [PR/Issue] New feature or request and removed bug fix [PR] Fix the bug labels Apr 21, 2026

soimkim merged commit 37e2027 into main Apr 21, 2026
7 of 8 checks passed

coderabbitai Bot mentioned this pull request Apr 27, 2026

Improve HTTP mirror and direct archive handling #271

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retry with browser and curl-like UA on mirror blocks#266

Retry with browser and curl-like UA on mirror blocks#266
soimkim merged 2 commits into
mainfrom
link

soimkim commented Apr 21, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented Apr 21, 2026 •

edited

Loading

Rate limit exceeded

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Apr 21, 2026

Uh oh!

coderabbitai Bot Apr 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

soimkim commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Uh oh!

coderabbitai Bot commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

soimkim commented Apr 21, 2026 •

edited

Loading

coderabbitai Bot commented Apr 21, 2026 •

edited

Loading