Add HTTP timeout to read_file network fetch#539
Open
Spectual wants to merge 1 commit into
Open
Conversation
`read_file` calls `requests.get(blobpath)` without an explicit timeout when fetching tokenizer data over HTTP/HTTPS. Without a timeout, the request blocks indefinitely on DNS failures, SYN black-holes, TCP resets, or unresponsive proxies — silently hanging `encoding_for_model` on first use with no way to interrupt short of killing the process. Pass a default 60-second timeout, configurable via the `TIKTOKEN_HTTP_TIMEOUT` environment variable. Falls back to the default if the env var can't be parsed as a float, so a malformed value can't crash the tokenizer download. Add `tests/test_load.py` with three regression cases: default, env override, and unparseable env fallback. Each replaces `requests.get` via `sys.modules` so no real network traffic is made.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
read_fileintiktoken/load.pycallsrequests.get(blobpath)without atimeout=argument when fetching tokenizer data over HTTP/HTTPS:requestsdocuments this as a footgun in production code:In tiktoken's case, the failure mode is
encoding_for_model("gpt-4o")(or any other not-yet-cached model) silently hanging on first use whenever the network path toopenaipublic.blob.core.windows.netis unhealthy — DNS resolution stalls, SYN black-holes, captive portals, broken corporate proxies, mid-flight TCP resets. The user sees a wedged process with no traceback and no way to interrupt short ofCtrl-C/SIGKILL.This is independent of #514 (
TIKTOKEN_OFFLINE): that PR short-circuits before any network call when the user opts in to offline mode. This PR fixes the case where the network call does happen but the peer doesn't respond.Fix
Pass an explicit timeout (default 60 s) to
requests.get, configurable via theTIKTOKEN_HTTP_TIMEOUTenvironment variable:Falls back to the default if the env var can't be parsed as a float, so a malformed value can't itself crash the tokenizer download.
Test plan
tests/test_load.py(new) covers three regression cases — default 60s, env override, unparseable env fallback — by patchingrequestsviasys.modulesso no real network traffic is generated.encoding_for_model("gpt-4o")against a sinkhole IP (e.g.127.0.0.2) now raisesrequests.exceptions.ReadTimeoutafterTIKTOKEN_HTTP_TIMEOUT=2instead of hanging.