Skip to content

Retry with browser and curl-like UA on mirror blocks#266

Merged
soimkim merged 2 commits into
mainfrom
link
Apr 21, 2026
Merged

Retry with browser and curl-like UA on mirror blocks#266
soimkim merged 2 commits into
mainfrom
link

Conversation

@soimkim
Copy link
Copy Markdown
Contributor

@soimkim soimkim commented Apr 21, 2026

Description

Retry with browser and curl-like UA on mirror blocks.

  • Enhanced download reliability with multi-attempt fallback logic for problematic servers
  • Improved filename extraction from server response headers
  • Better detection and handling of invalid downloads

@soimkim soimkim self-assigned this Apr 21, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 21, 2026

Warning

Rate limit exceeded

@soimkim has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 56 minutes and 59 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 56 minutes and 59 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: d777d6c7-8cbc-46aa-92ec-8f2e37f516ed

📥 Commits

Reviewing files that changed from the base of the PR and between 5e403f3 and 2cfd7ee.

📒 Files selected for processing (1)
  • src/fosslight_util/download.py
📝 Walkthrough

Walkthrough

The download.py module now includes retry logic with multiple HTTP User-Agent header configurations. is_downloadable() checks downloadability across header attempts, continuing only for archive extensions on HTTP 403/HTML responses. download_file() refactored to support per-attempt downloads with Content-Disposition filename extraction and retry behavior on HTTP 403 or HTML-for-archive scenarios.

Changes

Cohort / File(s) Summary
HTTP Download Retry & Header Logic
src/fosslight_util/download.py
Added retry mechanism with multiple User-Agent header configurations for download_file() and enhanced is_downloadable() to conditionally retry on HTTP 403 and HTML responses based on URL archive patterns. Introduced _download_file_once() helper for per-attempt operations, _url_looks_like_binary_archive() for extension detection, _download_http_header_attempts() for header building, and internal _HtmlWhenBinaryExpected exception for archive-returns-HTML scenarios. Filename derivation now prioritizes Content-Disposition header with URL basename fallback.

Sequence Diagram

sequenceDiagram
    actor Caller as Caller
    participant DL as download_file()
    participant HTTP as HTTP Server
    participant Helper as _download_file_once()

    Caller->>DL: download_file(url, target_dir)
    DL->>DL: _download_http_header_attempts()
    
    loop For each header attempt
        DL->>Helper: Call _download_file_once()
        Helper->>HTTP: HEAD request (with headers)
        HTTP-->>Helper: Response (status, Content-Disposition)
        
        alt 200 OK
            Helper->>HTTP: GET request
            HTTP-->>Helper: Binary data
            Helper->>Helper: Extract filename from Content-Disposition
            Helper-->>DL: Success (filepath)
            DL-->>Caller: Return filepath
        else 403 Forbidden
            Helper-->>DL: _HtmlWhenBinaryExpected or 403
            Note over DL: Continue to next header attempt
        else HTML when archive expected
            Helper-->>DL: _HtmlWhenBinaryExpected
            Note over DL: Retry next header attempt
        else Other status
            Helper-->>DL: Return error
            DL-->>Caller: Return None after all attempts exhausted
        end
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: implementing retry logic with alternate User-Agent headers when downloads are blocked.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch link

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@soimkim soimkim added the bug fix [PR] Fix the bug label Apr 21, 2026
- Set oss_version from tarball/archive filename after wget for clarified_version
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
src/fosslight_util/download.py (1)

87-112: Minor: broad except defeats UA retries on first-attempt network errors.

On line 109, any exception during the first attempt returns False immediately, bypassing the UA retry loop. If the intent is only to avoid retrying genuine network/TLS errors, this is fine; but if a mirror drops the connection specifically on the default UA (a common anti-bot behavior adjacent to the 403 case this PR addresses), retries never happen. Consider continue-ing (with a warning) when i < last_i and only returning False after the last attempt.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/fosslight_util/download.py` around lines 87 - 112, In is_downloadable,
the broad except currently returns False on any exception and prevents later
User-Agent retries; change the exception handling so that on failure you log the
warning (including the exception) and if i < last_i simply continue to the next
headers attempt, only returning False after the final attempt (i == last_i),
ensuring attempts (from _download_http_header_attempts) are honored; update the
except block in is_downloadable to implement this conditional continue/return
behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/fosslight_util/download.py`:
- Around line 688-708: The parsed filename from the Content-Disposition branch
(the variable filename used before computing local_path and joining with
target_dir) must be sanitized to prevent path traversal: after extracting
filename in the code that handles Content-Disposition (the block that sets
filename via m_star or m), strip any directory components (e.g., run
os.path.basename on filename), remove or replace any NULs and path separators
(both os.sep and os.altsep, forward/back slashes) and collapse any .. sequences
so the final filename is a simple basename-safe token before using
os.path.join(target_dir, filename) to produce local_path; apply the same
sanitization helper to any header-derived filename to match the URL-path
fallback behavior.
- Around line 693-695: The regex and handling around m_star in download.py
incorrectly assume a literal UTF-8'' prefix; update the detection and parsing
for filename* to capture the charset, optional language tag, and the
percent-encoded value (e.g. match
filename\*=(?P<charset>[^']*)'(?:[^']*)'(?P<enc>[^;\r\n]+)), then replace the
current urllib.parse.unquote call: extract charset (default to 'utf-8' if
missing), percent-decode to bytes (urllib.parse.unquote_to_bytes) and decode
with the captured charset (fall back safely if decoding fails), assigning the
result back to filename; keep using the existing m_star and filename symbols so
the rest of the function continues to work.

---

Nitpick comments:
In `@src/fosslight_util/download.py`:
- Around line 87-112: In is_downloadable, the broad except currently returns
False on any exception and prevents later User-Agent retries; change the
exception handling so that on failure you log the warning (including the
exception) and if i < last_i simply continue to the next headers attempt, only
returning False after the final attempt (i == last_i), ensuring attempts (from
_download_http_header_attempts) are honored; update the except block in
is_downloadable to implement this conditional continue/return behavior.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: d4a28e32-2ca4-4504-a041-207f49483ea9

📥 Commits

Reviewing files that changed from the base of the PR and between c9d8671 and 5e403f3.

📒 Files selected for processing (1)
  • src/fosslight_util/download.py

Comment on lines +688 to +708
filename = ""
cd = r.headers.get("Content-Disposition") or head_headers.get(
"Content-Disposition"
)
if cd:
m_star = re.search(r"filename\*=(?:UTF-8'')?([^;\r\n]+)", cd)
if m_star:
filename = urllib.parse.unquote(m_star.group(1).strip('"\''))
else:
local_path = target_dir
m = re.search(r"filename=([^;\r\n]+)", cd)
if m:
filename = m.group(1).strip('"\'')
if not filename:
final_for_name = r.url or final_url
filename = os.path.basename(urllib.parse.urlparse(final_for_name).path)
if not filename:
filename = "downloaded_file"
if os.path.isdir(target_dir):
local_path = os.path.join(target_dir, filename)
else:
local_path = target_dir
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Path traversal risk: sanitize filenames from Content-Disposition.

The filename parsed from Content-Disposition is passed directly to os.path.join(target_dir, filename) with no sanitization. A hostile or misconfigured mirror (exactly the scenario this PR is hardening against) can respond with e.g. Content-Disposition: attachment; filename="../../../evil.sh" and cause the download to be written outside target_dir. Note the URL-path fallback already applies os.path.basename, but the header branch does not.

Strip directory components (and ideally NUL / path separators) before joining.

🛡️ Proposed fix
         if cd:
             m_star = re.search(r"filename\*=(?:UTF-8'')?([^;\r\n]+)", cd)
             if m_star:
                 filename = urllib.parse.unquote(m_star.group(1).strip('"\''))
             else:
                 m = re.search(r"filename=([^;\r\n]+)", cd)
                 if m:
                     filename = m.group(1).strip('"\'')
+        # Never trust server-provided filenames: strip any path components.
+        if filename:
+            filename = os.path.basename(filename.replace("\\", "/").strip()).lstrip(".") or ""
         if not filename:
             final_for_name = r.url or final_url
             filename = os.path.basename(urllib.parse.urlparse(final_for_name).path)
             if not filename:
                 filename = "downloaded_file"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/fosslight_util/download.py` around lines 688 - 708, The parsed filename
from the Content-Disposition branch (the variable filename used before computing
local_path and joining with target_dir) must be sanitized to prevent path
traversal: after extracting filename in the code that handles
Content-Disposition (the block that sets filename via m_star or m), strip any
directory components (e.g., run os.path.basename on filename), remove or replace
any NULs and path separators (both os.sep and os.altsep, forward/back slashes)
and collapse any .. sequences so the final filename is a simple basename-safe
token before using os.path.join(target_dir, filename) to produce local_path;
apply the same sanitization helper to any header-derived filename to match the
URL-path fallback behavior.

Comment on lines +693 to +695
m_star = re.search(r"filename\*=(?:UTF-8'')?([^;\r\n]+)", cd)
if m_star:
filename = urllib.parse.unquote(m_star.group(1).strip('"\''))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

RFC 5987 filename* regex misses the language tag form.

Per RFC 5987, the filename* value is charset'lang'percent-encoded-value, and lang is frequently empty but can be non-empty (e.g. UTF-8'en'file.txt). The current pattern filename\*=(?:UTF-8'')?([^;\r\n]+) only strips the literal UTF-8'' prefix, so a value like filename*=UTF-8'en'file.txt is captured verbatim and urllib.parse.unquote leaves the UTF-8'en' prefix embedded in the filename. Also, non-UTF-8 charsets (e.g. ISO-8859-1''...) aren’t handled.

♻️ Proposed fix
-            m_star = re.search(r"filename\*=(?:UTF-8'')?([^;\r\n]+)", cd)
-            if m_star:
-                filename = urllib.parse.unquote(m_star.group(1).strip('"\''))
+            m_star = re.search(
+                r"filename\*=(?:[\w-]+)?'[^']*'([^;\r\n]+)", cd, re.IGNORECASE
+            )
+            if m_star:
+                filename = urllib.parse.unquote(m_star.group(1).strip('"\''))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/fosslight_util/download.py` around lines 693 - 695, The regex and
handling around m_star in download.py incorrectly assume a literal UTF-8''
prefix; update the detection and parsing for filename* to capture the charset,
optional language tag, and the percent-encoded value (e.g. match
filename\*=(?P<charset>[^']*)'(?:[^']*)'(?P<enc>[^;\r\n]+)), then replace the
current urllib.parse.unquote call: extract charset (default to 'utf-8' if
missing), percent-decode to bytes (urllib.parse.unquote_to_bytes) and decode
with the captured charset (fall back safely if decoding fails), assigning the
result back to filename; keep using the existing m_star and filename symbols so
the rest of the function continues to work.

@soimkim soimkim added enhancement [PR/Issue] New feature or request and removed bug fix [PR] Fix the bug labels Apr 21, 2026
@soimkim soimkim merged commit 37e2027 into main Apr 21, 2026
7 of 8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement [PR/Issue] New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant