Skip to content

LiteLLM reliability: hot-reload, zero-downtime restarts, and single-replica fragility #321

@bussyjd

Description

@bussyjd

Problem

LiteLLM is a single point of failure in the stack. Every configuration change (obol model setup, provider addition) requires a full pod restart, causing complete inference downtime. During obol stack up, LiteLLM is restarted 2-3 times.

Current issues

  1. Single replica — 1 pod, no PDB. Every restart = full downtime (30s-5min depending on image pull)
  2. No hot-reload — LiteLLM does not watch config.yaml for changes. Config patched via ConfigMap, then kubectl rollout restart required
  3. Non-fatal rollout timeoutRestartLiteLLM() returns success even when the 90s rollout times out, silently leaving LiteLLM in a broken state
  4. drop_params: true — silently drops request parameters that don't match downstream provider schema, making debugging difficult
  5. No Reloader annotation — Secret changes (API key rotation) don't trigger restart automatically

Impact

  • Agent chat unavailable during every obol model setup or provider configuration
  • Initial obol stack up has 270s+ of intermittent LiteLLM downtime
  • Silent parameter loss makes cross-provider routing unreliable

Solution

Implemented in #320

  1. Hot-add via /model/new API — model-only changes are applied immediately via LiteLLM's in-memory router API. ConfigMap still patched for persistence. Restart only needed for API key changes (Secret mount).
  2. 2 replicas + RollingUpdatemaxUnavailable: 0, maxSurge: 1 ensures a new pod is ready before any old pod terminates
  3. PodDisruptionBudgetminAvailable: 1 prevents both replicas from being down simultaneously
  4. preStop hook — 10s sleep before SIGTERM gives EndpointSlice time to deregister the pod
  5. Reloader annotationsecret.reloader.stakater.com/reload: litellm-secrets triggers rolling restart on Secret changes (API key rotation)
  6. terminationGracePeriodSeconds: 60 — gives long inference requests time to complete

Not yet addressed

  • drop_params: true behavior (needs per-model investigation)
  • ConfigMap size validation
  • Horizontal pod autoscaling for high concurrency

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions