Skip to content

Add HTTP timeout to read_file network fetch#539

Open
Spectual wants to merge 1 commit into
openai:mainfrom
Spectual:fix/load-add-http-timeout
Open

Add HTTP timeout to read_file network fetch#539
Spectual wants to merge 1 commit into
openai:mainfrom
Spectual:fix/load-add-http-timeout

Conversation

@Spectual
Copy link
Copy Markdown

@Spectual Spectual commented May 9, 2026

Problem

read_file in tiktoken/load.py calls requests.get(blobpath) without a timeout= argument when fetching tokenizer data over HTTP/HTTPS:

# tiktoken/load.py (current main)
resp = requests.get(blobpath)
resp.raise_for_status()
return resp.content

requests documents this as a footgun in production code:

Without a timeout, your code may hang for minutes or more.
https://requests.readthedocs.io/en/latest/user/quickstart/#timeouts

In tiktoken's case, the failure mode is encoding_for_model("gpt-4o") (or any other not-yet-cached model) silently hanging on first use whenever the network path to openaipublic.blob.core.windows.net is unhealthy — DNS resolution stalls, SYN black-holes, captive portals, broken corporate proxies, mid-flight TCP resets. The user sees a wedged process with no traceback and no way to interrupt short of Ctrl-C/SIGKILL.

This is independent of #514 (TIKTOKEN_OFFLINE): that PR short-circuits before any network call when the user opts in to offline mode. This PR fixes the case where the network call does happen but the peer doesn't respond.

Fix

Pass an explicit timeout (default 60 s) to requests.get, configurable via the TIKTOKEN_HTTP_TIMEOUT environment variable:

try:
    timeout: float | None = float(os.environ.get(\"TIKTOKEN_HTTP_TIMEOUT\", \"60\"))
except ValueError:
    timeout = 60.0
resp = requests.get(blobpath, timeout=timeout)

Falls back to the default if the env var can't be parsed as a float, so a malformed value can't itself crash the tokenizer download.

Test plan

  • tests/test_load.py (new) covers three regression cases — default 60s, env override, unparseable env fallback — by patching requests via sys.modules so no real network traffic is generated.
  • Confirm existing tests still pass (CI).
  • Confirm encoding_for_model("gpt-4o") against a sinkhole IP (e.g. 127.0.0.2) now raises requests.exceptions.ReadTimeout after TIKTOKEN_HTTP_TIMEOUT=2 instead of hanging.

`read_file` calls `requests.get(blobpath)` without an explicit timeout
when fetching tokenizer data over HTTP/HTTPS. Without a timeout, the
request blocks indefinitely on DNS failures, SYN black-holes, TCP
resets, or unresponsive proxies — silently hanging
`encoding_for_model` on first use with no way to interrupt short of
killing the process.

Pass a default 60-second timeout, configurable via the
`TIKTOKEN_HTTP_TIMEOUT` environment variable. Falls back to the
default if the env var can't be parsed as a float, so a malformed
value can't crash the tokenizer download.

Add `tests/test_load.py` with three regression cases:
default, env override, and unparseable env fallback. Each replaces
`requests.get` via `sys.modules` so no real network traffic is made.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant