How to handle rate limits when you can't change the rate-limit in the external model provider ? #10780

manuel-koch · 2026-02-24T11:23:13Z

manuel-koch
Feb 24, 2026

I just started using Claude models with Continue and immediately hit rate limit errors like the following when asking a question in the chat that involves multiple tool calls ( e.g. searching in code base for a topic / issue ).

{"type":"rate_limit_error","message":"This request would exceed your organization's rate limit of 10,000 input tokens per minute (org: xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx, model: claude-sonnet-4-6). For details, refer to: https://docs.claude.com/en/api/rate-limits. You can see the response headers for current usage. Please reduce the prompt length or the maximum tokens requested, or try again later. You may also contact sales at https://www.anthropic.com/contact-sales to discuss your options for a rate limit increase."}

Is there a way to "slow down" or "reduce speed" of Continue when sending request to the model provider ?
I.e. I can't change the minute-rate-limit within the Claude account, but just getting "slower" results would be fine for me.

Looking at the recent issues I saw tons of those regarding rate-limit, thus assuming this seems to be a general issue for lots of users.
How do you workaround such external limitations ?
Is there a configuration that could be used to stay below a given rate-limit and still being able to send a chat message that triggers more tool calls and automatic follow-up actions of the model ?

omni-front · 2026-05-16T07:50:39Z

omni-front
May 16, 2026

This is a bigger topic but the short version is to use a queue system to throttle requests or configure maxTokens in your config.yaml to reduce token usage per request.

0 replies

s-a-s-k-i-a · 2026-05-17T05:34:06Z

s-a-s-k-i-a
May 17, 2026

Worth being precise about which rate-limit you're hitting — your error says 10,000 input tokens per minute, which is the input counter, and that's not what maxTokens controls. maxTokens caps output (response) tokens; reducing it won't lower your input token usage at all. So the levers that actually move the needle here are different.

What Continue exposes today (in config.yaml, on a per-model defaultCompletionOptions):

models:
  - name: Claude Sonnet 4.6
    provider: anthropic
    model: claude-sonnet-4-6
    defaultCompletionOptions:
      contextLength: 100000   # ← this directly reduces input tokens per request
      maxTokens: 8000         # output cap (doesn't affect input rate)
    requestOptions:
      timeout: 600            # seconds; gives retries time on slow paths

contextLength is the input lever. Lower it and Continue trims the prompt window before sending, so each request consumes fewer input tokens against your 10k/min budget. Tool-call cascades amplify the issue (each step re-sends the running context); a tighter contextLength also caps the per-step cost.

The cost lever (separate from rate limit, but worth mentioning): Continue does support Anthropic prompt caching — packages/openai-adapters/src/apis/AnthropicCachingStrategies.ts injects cache_control: { type: 'ephemeral' } on appropriate message parts. Important caveat: cache reads still count toward your input_tokens_per_minute rate limit on Anthropic's side (they're just billed at ~10% of normal input cost). So caching alone won't pull you under the 10k cap, but it makes it much cheaper to move to a higher tier — which is the actual fix for "I'm hitting tier-1 limits with real work."

Model swap as a tactical fix: rate limits are per-model-tier. claude-haiku-4-5 has substantially higher per-minute limits than claude-sonnet-4-6 at the same Anthropic tier, so for tool-heavy Continue workflows where you don't need the strongest model on every step, configuring Haiku as a secondary model and selecting it for routine tool calls drops your Sonnet pressure.

What Continue doesn't have: a client-side request throttle / queue. There's no requestsPerMinute knob you can set in config.yaml today. The community comment about "queue system" is right in spirit but not something Continue ships — you'd have to put it in front (e.g. a reverse proxy with limit_req or a custom OpenAI-compat shim that throttles). For most users, contextLength + tier upgrade is the simpler path; the proxy approach only pays off for shared / team setups.

Rate-limit headers from Anthropic come back in every response (anthropic-ratelimit-input-tokens-remaining, etc.) — if you tail Continue's logs while working, you can see exactly how close each request runs to the cap, which makes calibrating contextLength much less guesswork.

1 reply

omni-front May 17, 2026

Thanks for clarifying that. To manage input tokens, consider batching requests or implementing a delay between them to stay within the limit.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to handle rate limits when you can't change the rate-limit in the external model provider ? #10780

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

How to handle rate limits when you can't change the rate-limit in the external model provider ? #10780

Uh oh!

Uh oh!

manuel-koch Feb 24, 2026

Replies: 2 comments · 1 reply

Uh oh!

omni-front May 16, 2026

Uh oh!

s-a-s-k-i-a May 17, 2026

Uh oh!

omni-front May 17, 2026

manuel-koch
Feb 24, 2026

Replies: 2 comments 1 reply

omni-front
May 16, 2026

s-a-s-k-i-a
May 17, 2026