Skip to content

perf: bound token-validation cost on adversarial input#121

Open
rmyndharis wants to merge 1 commit into
google-labs-code:mainfrom
rmyndharis:fix/harden-token-parsing-cost
Open

perf: bound token-validation cost on adversarial input#121
rmyndharis wants to merge 1 commit into
google-labs-code:mainfrom
rmyndharis:fix/harden-token-parsing-cost

Conversation

@rmyndharis

Copy link
Copy Markdown

Summary

The CLI parses arbitrary user-supplied markdown/YAML (file or stdin). Three validation paths can be driven into super-linear CPU or unbounded recursion by a crafted DESIGN.md. Each fix is a small guard that leaves results for legitimate input unchanged.

1. Quadratic backtracking in dimension validation

parseDimensionParts (model/spec.ts) uses /^(-?\d*\.?\d+)([a-zA-Z%]+)$/; the same shape is in token-like-ignored's CSS_DIMENSION_RE. On a long all-digit string the engine backtracks every split point (clean O(n²) — measurable seconds at ~80k chars). Real CSS dimensions are a handful of characters, so a leaf value over 64 chars is rejected before the regex runs.

2. Unbounded Levenshtein cost for unknown keys

unknown-key builds a full (m+1)×(n+1) DP matrix against every schema key for each unknown top-level key, with no length guard. Since edit distance is always ≥ the length difference, a key whose length differs from a schema key by more than MAX_TYPO_DISTANCE can never be a typo — those comparisons are now skipped. Suggestions are provably unchanged.

3. Unbounded color-mix() recursion

parseCssColor recurses into each inner color of a color-mix() with no depth bound; a deeply nested value throws RangeError: Maximum call stack size exceeded, which the model's catch-all turns into a single generic error that discards every other finding for the file. A depth counter (cap 32) makes an over-deep value resolve to a normal "invalid color" finding instead, leaving the rest of the model intact.

Testing

  • bun test: 285 pass, 1 skip, 0 fail (added 3 tests). The suite completes in ~0.3s, including a 100k-char dimension, a 50k-char unknown key, and a 50-deep color-mix — all previously slow/throwing, now instant/graceful.
  • bun run lint (tsc --noEmit): clean.
  • The new color-mix test asserts the over-deep value is rejected per-token and that a sibling valid color still resolves (i.e. no model-wide collapse).

Three small guards so a hostile DESIGN.md cannot pin CPU or exhaust the
call stack. All inputs are at the documented untrusted boundary (arbitrary
file/stdin), and none of the changes alter results for legitimate input.

- parseDimensionParts (and token-like-ignored's CSS_DIMENSION_RE) backtrack
  quadratically on long all-digit strings. Cap value length to 64 chars
  before matching; real CSS dimensions are far shorter.
- unknown-key runs an O(n*m) Levenshtein DP against every schema key for
  each unknown key. Skip a schema key whose length differs by more than the
  typo threshold — edit distance is at least the length difference, so the
  set of suggestions is unchanged.
- parseCssColor recurses for nested color-mix() with no depth bound. Thread
  a depth counter and stop at 32, so an over-deep value resolves to an
  invalid color (a precise error finding) instead of a RangeError that
  collapses the whole model build.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant