From cce96b47a8324652e11d86528a1f20914fe533cb Mon Sep 17 00:00:00 2001 From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com> Date: Wed, 22 Apr 2026 22:56:38 +0000 Subject: [PATCH] Add AUDIT.md: live trading bias and execution fidelity audit Agent-Logs-Url: https://github.com/nxd914/kinzie/sessions/22dc0867-2211-48fe-8ff5-df648624a3b2 Co-authored-by: nxd914 <214264706+nxd914@users.noreply.github.com> --- AUDIT.md | 174 +++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 174 insertions(+) create mode 100644 AUDIT.md diff --git a/AUDIT.md b/AUDIT.md new file mode 100644 index 0000000..7a2df02 --- /dev/null +++ b/AUDIT.md @@ -0,0 +1,174 @@ +# Kinzie — Walk-Forward & Look-Ahead Bias Audit + +**Context:** 57 paper fills, +$64,732 total P&L, 96.4% win rate, XRP/SOL expansion active. +**Question:** Would this performance hold if we converted to live trading? + +--- + +## Structural finding: the model math is causally clean + +The core signal pipeline has no look-ahead bias. Every input to `spot_to_implied_prob()` at decision time is strictly backward-looking: +- Spot price: live WebSocket tick +- Realized vol: Welford rolling window over the past 60s/15min of historical ticks +- Strike/expiry: Kalshi market metadata known at scan time + +Welford's algorithm processes ticks sequentially with no peek-ahead. No look-ahead bias in the math itself. + +The concerns below are all about **live vs. paper execution divergence**, not model math. + +--- + +## Concerns — ordered by severity + +### 1. Paper execution has no slippage model *(MOST SIGNIFICANT)* + +**File:** `agents/execution_agent.py` lines 69–72 + +```python +fill_price = ( + opp.market.yes_ask if opp.side == Side.YES + else opp.market.no_ask +) +``` + +Paper fills are guaranteed at the **quoted ask, instantly, for any size**. In live trading: + +- You're entering on a momentum signal — the price is already moving. The Kalshi ask may already be lifted by the time your REST order hits the book. +- Kalshi crypto markets are thin. A $10k position at a 30¢ YES ask = 33,333 contracts. If there are only 5,000 at 30¢, you walk up the book. +- This single gap is likely responsible for 2–4% of the apparent per-trade edge in paper results. + +**What to fix:** Model slippage in paper mode. Either simulate a fill at `(ask + k * spread)` for some constant `k`, or cap contracts filled at a simulated depth and book the remainder at a worse price. + +--- + +### 2. `BRACKET_CALIBRATION = 0.55` was fit to paper outcomes *(MEDIUM)* + +**File:** `core/pricing.py` line 22 and `core/config.py` lines 149–154 + +> "Lowered from 0.70 after -$31k paper loss analysis (model=0.81 vs market=0.51 on ATM bracket)." + +This constant was tuned post-hoc on one observed paper loss event, then applied going forward on the *same dataset* being evaluated. Any resolution metrics that include bracket trades after the calibration adjustment are partially in-sample. The comment itself acknowledges: "Needs 50+ bracket fills to validate statistically." The system has 57 *total* fills, not 50 bracket fills. + +**What to fix:** Track bracket fills separately. Do not count bracket trades placed after the `0.55` calibration change as validation of that change. The calibration is provisional until 50+ bracket-specific fills. + +--- + +### 3. Resolution heuristic can prematurely book wins *(MEDIUM)* + +**File:** `agents/resolution_agent.py` lines 345–368 + +```python +if yes_bid >= 0.99: # → resolve YES +if yes_ask <= 0.01: # → resolve NO +# Post-close heuristic: +if implied >= 0.95: # → resolve YES +elif implied <= 0.05: # → resolve NO +``` + +These price-level heuristics run *before* the authoritative `status == "settled" and result` path and fire on every polling cycle. A crypto-volatile market can briefly spike to `yes_bid=0.99` without having settled. If such a tick triggers a premature YES resolution on a winning position, the trade is booked as a win — and if it eventually settles NO, the position is already gone from the DB. + +**What to fix:** Require `status == "settled"` before applying any price-level heuristic. The heuristics should only run as a fallback for confirmed-settled markets where the `result` field is missing, not as a primary path. + +--- + +### 4. Signal age gate is effectively disabled in paper mode *(MEDIUM)* + +**File:** `agents/risk_agent.py` lines 159–165 + +```python +signal_age = (now - opp.signal.timestamp).total_seconds() +if signal_age > self._cfg.max_signal_age_seconds: # 2.0s + return None +``` + +In paper mode, asyncio processes this synchronously at near-zero latency — every signal has age ~0ms. In live mode, add: +- KalshiClient REST round-trip to place an order: ~200–500ms +- RSA-PSS signing: ~1ms +- Kalshi order processing: ~100–300ms +- Signal queue backpressure on volatile days + +The 2-second gate may reject a meaningful fraction of live opportunities that paper always counts as filled. This means the effective fill rate and P&L in paper overstates what live would achieve. + +**What to fix:** Simulate realistic execution latency in paper mode (even a synthetic 300ms delay in `_paper_order`) so the 2-second gate applies proportionally. + +--- + +### 5. The `replay_backtest` is not a walk-forward test *(STRUCTURAL)* + +**File:** `research/replay_backtest.py` + +The replay backtest reads the DB and recomputes model probabilities for past trades. This is a *calibration consistency check*, not an out-of-sample test. It can tell you whether the model's math is consistent with what was logged, but it cannot detect overfitting because: + +- Every trade in the DB was generated by the live model +- The model and the backtest use identical parameters +- There is no held-out test set, no time-based train/test split, and no parameter set evaluated on data it did not influence + +**What to fix:** To do a real walk-forward test, split the trade history at some cutoff date. Fit any adjustable parameter (`BRACKET_CALIBRATION`, `MIN_EDGE`, etc.) only on data before the cutoff, then evaluate on data after. Currently impossible because there is only one parameter that changed mid-run (`BRACKET_CALIBRATION`), and the change date is not recorded in the DB. + +--- + +### 6. Position sizing ignores actual book depth *(MEDIUM)* + +**File:** `agents/risk_agent.py` and `core/kelly.py` + +`position_size()` allocates up to 10% of $100k = $10,000 per position. There is no check that Kalshi's order book has $10k of liquidity at the quoted ask. The `liquidity` field is stored on `KalshiMarket` and logged, but it is never used as an upper bound on position size. + +Live fills at scale will execute at worse average prices than the quoted ask, making Kelly fractions calculated against the ask systematically optimistic. + +**What to fix:** Cap `size_usdc` at `min(kelly_size, market.liquidity * LIQUIDITY_FRACTION)` in `RiskAgent._evaluate()`. A conservative fraction (e.g. 0.20) of available liquidity prevents walking the book. + +--- + +### 7. Sharpe annualization assumption is arbitrary *(MINOR)* + +**File:** `core/config.py` line 169 + +```python +assumed_fills_per_day: int = 4 +``` + +The Sharpe annualization in `ResolutionAgent._running_sharpe()` uses `sqrt(4 * 365)`. If actual fill cadence is higher (the system reports fills every ~42 seconds during active trading), this underestimates the annualization factor. The Sharpe number is effectively meaningless at n=57 regardless — the 95% CI on a Sharpe estimate at n=57 with Sharpe≈2 is approximately [0.5, 3.5]. + +**What to fix:** Compute actual fill cadence from `placed_at` timestamps and use that for annualization, or switch to calendar-time Sharpe (daily P&L buckets as in `edge_analysis.py`) which is independent of fill rate assumptions. + +--- + +### 8. XRP/SOL expansion is not out-of-sample *(MINOR)* + +`BRACKET_CALIBRATION`, `MIN_EDGE`, `MIN_CRYPTO_VOL`, and all thresholds were set based on BTC/ETH experience. XRP/SOL now contribute trades under those same parameters. XRP has different volatility dynamics (thinner order book, different TWAP settlement behavior) than BTC/ETH. Parameter transfer from BTC/ETH to XRP/SOL is untested. Positive XRP results so far are consistent with luck at low n. + +**What to fix:** Track per-symbol win rate and P&L in `edge_analysis.py` to detect if XRP/SOL are performing differently from BTC/ETH. Add a `min_symbol_fills_before_trust: int = 20` config param and log a warning when a symbol is below that threshold. + +--- + +## What the 96.4% win rate actually means + +At a minimum edge of 4% over an average Kalshi mid of ~0.70, you'd expect model-implied win rates of ~74%. A 96.4% paper win rate over 57 fills decomposes as: + +| Source | Contribution | +|--------|-------------| +| Genuine model edge | Moderate — latency arb is real | +| Lucky variance (n=57) | High — this is the dominant explanation | +| Paper fill at ask with zero slippage | Captures full stated edge that live won't | +| Resolution heuristic booking early wins | Small but non-zero | + +The latency arbitrage thesis is structurally sound — Kalshi does lag CEX spot — but paper results are optimistic on every dimension that matters for live execution. + +--- + +## Summary: expected live vs. paper divergence + +| Concern | Estimated live degradation | +|---------|---------------------------| +| Slippage on momentum entries | 2–4% per-trade edge reduction | +| Signal age rejections at live latency | 15–30% fewer fills approved | +| Adverse selection (market already moved) | Edge on signal-triggered fills shrinks most | +| Resolution heuristic risk | Inflates paper win rate, magnitude unknown | +| Position sizing vs. book depth | Larger positions execute at worse average prices | +| `BRACKET_CALIBRATION` overfitting | Bracket P&L may regress at more fills | + +**Recommendation:** Do not convert to live until: +1. Slippage simulation is added to paper mode and win rate remains above 70% under realistic fill assumptions +2. 100+ fills are accumulated (system requirement) with calibrated Sharpe ≥ 1.0 +3. Bracket fills are tracked separately and `BRACKET_CALIBRATION` is validated on 50+ bracket-specific fills +4. Position size is capped relative to available book liquidity