hi, i'm charles
always open to talk · discord · twitter · email · site
infra oriented engineer across fintech, web3, and tooling
building to solve. focused on correctness, auditability, automation, efficiency
currently working on low latency infra, compliance systems, evaluation harnesses
polyedge · fintech infrastructure · prediction market platform · hardening paid tiers
tokio-powered data plane axum + tower parquet react · vercel · cloudflare
fortifai · cognitive infrastructure · LLM-proctored harness · hardening falsifiability
discord interfaced sqlite claude prompt-cached streaming agnostic pipeline
___
fortifai · self-audit loop · streak 1d
self-audit: scenario-based time-pressured recall, cross-domain breadth, b3-calibrated
invariant: zero outside assistance. no docs, no ai, no peers. 10m/response, 5m/single refinement
breadth: systems/distributed, backend, sre, ml, ai/llm, frontend, data, security
bar: consistent ≥3 across all 8 swe fields
industry swe updated 2026-05-25
scope cross-domain · grab-bag duration 58m 45s
calibration b3 "practitioner" rotation bias underindexed-weighted
b₂ b3 b₄
q1 sre dns-ttl-failover ₃ 3 ₂
q2 security clickjacking ₄ 3 ₂
q3 ml-engineering train-test ₃ 3 ₂
q4 data-engineering exactly-once-sink ₃ 3 ₂
q5 ai-llm llm-as-judge ₄ 3 ₂
score (dreyfus) 1 (novice) → 3 (competent) → 5 (mastered)
band (swecom) b1 (technician) → b3 (practitioner) → b5 (principal)
q1 · sre · dns-ttl-failover-tradeoff · pre 2 → post 3 · ceiling — · transitional b1–b3
Scenario: A regional SaaS company runs api.example.com behind a primary load balancer in us-east-1. Their disaster recovery plan calls for failing over to us-west-2 by updating the DNS A record. The team has set the record's TTL to 300 seconds. During a recent us-east-1 outage, they updated DNS within 2 minutes of the incident declaration, but customer reports of errors continued for over 30 minutes — far longer than the TTL would suggest. Explain the mechanism that causes traffic to persist on a failed endpoint longer than the configured TTL, and describe the tradeoff involved in setting the TTL much lower (e.g., 30 seconds) to mitigate it. What complementary mechanism does not depend on DNS at all, and why is it generally preferred for fast failover?
Refinement: You said 'direct sockets/control over the piping not reliant on dns propogation'. Clarify: what specific property of this mechanism allows a load balancer or traffic manager to detect a failed endpoint and redirect connections without waiting for any cache to expire?
Assessment: The response identifies the right problem space (caching / propagation / connectivity probing) but names the wrong mechanism for DNS failover persistence: it attributes traffic persistence to hop-by-hop propagation through a network topology rather than to downstream resolver and client-side cache non-compliance with the authoritative TTL. The refinement correctly observes that a TCP handshake fails closed regardless of DNS cache state — a valid property of the complementary mechanism — but never names the architectural pattern that actually delivers fast failover at the named layer. The cost axis of lowering TTL is framed quantitatively but through an incorrect model. The gap is in knowing which layer of the resolution-and-routing stack actually holds traffic past TTL, and what architectural construct removes DNS from the failover path entirely.
Literature
- [remediation] Site Reliability Engineering: How Google Runs Production Systems — Ch. 19 'Load Balancing at the Frontend' — full chapter, with particular attention to §DNS and §Virtual IP Addresses — ~45m
- [remediation] DNS and BIND, 5th Edition — Ch. 10 §TTLs and Negative Caching — the chapter on how resolvers actually treat TTL values, including the cases where TTLs are ignored or clamped — ~30m
q2 · security · clickjacking · pre 2 → post 3 · ceiling b2 · transitional b3
Scenario: A consumer banking web app exposes a 'Transfer Funds' page at /transfer that uses session cookies for authentication and a synchronizer CSRF token on the form. A security reviewer reports that an attacker can host a page on evil.example that loads /transfer in an invisible iframe, overlays decoy UI on top, and tricks a logged-in victim into clicking 'Confirm Transfer' through the overlay. Explain why the existing CSRF token does not defend against this attack, name the vulnerability class, and identify the specific HTTP response header (and its directive) that is the primary structural mitigation. Compare this header against the older X-Frame-Options header — what does the newer mechanism give you that the older one does not?
Refinement: You said 'the request does come from the "resource" that is legitimate aka the transfer page'. Clarify: what property of the CSRF token mechanism makes it blind to whether the legitimate page is rendered inside an attacker-controlled frame versus loaded directly by the user?
Assessment: The answer reached the correct intuition that CSRF tokens are structurally blind to the rendering context, and the refinement articulated that intuition cleanly (the token has no reconciliation with the root render or call-stack provenance — only that the user holds the session). What is missing is the standard vocabulary: the vulnerability class (clickjacking / UI redress), the specific header and directive that the question explicitly asked for, and the comparison against the older header. The answer also conflates CSP with CORS and SameSite, which are distinct mitigations for distinct threat classes — a B3 reviewer needs the primitives named precisely. The reading should land on the OWASP clickjacking taxonomy and the W3C CSP Level 2 definition of frame-ancestors.
Literature
- [remediation] OWASP Clickjacking Defense Cheat Sheet — §Defending with Content Security Policy frame-ancestors directive AND §Defending with X-Frame-Options Response Headers — read these two sections to get the named vulnerability class, the exact header/directive, and the XFO-vs-frame-ancestors comparison the question asked for — ~20m
- [remediation] Content Security Policy Level 3 (W3C Working Draft) — frame-ancestors directive — §6.1 frame-ancestors — the directive grammar (source-list with 'none', 'self', host-source, scheme-source), the rule that frame-ancestors takes precedence over X-Frame-Options when both are present, and why it is the standardized replacement — ~15m
q3 · ml-engineering · train-test-contamination · pre 3 → post 3 · ceiling b1 · transitional b2–b3
Scenario: A medical imaging team is training a binary classifier to detect a condition from chest X-rays. Their dataset contains 50,000 images from 8,000 patients (some patients have multiple images taken weeks apart). They use a standard random 80/20 train/test split at the image level and report 0.94 AUC on the held-out test set. When the model is deployed to a new hospital, AUC drops to 0.71. Explain the specific data-handling mechanism that inflated the offline metric, why the random image-level split was the wrong primitive for this dataset, and what splitting strategy you would commit to instead. Name one scikit-learn primitive that implements the correct strategy, and explain what residual generalization gap remains even after fixing the split.
Refinement: You said 'the sampling of 80-20 could have been reinforcing same patient/multiple image thus bias in the model'. Clarify: what specific statistical relationship between a patient's training images and that same patient's test images causes the AUC to be inflated, rather than merely unrepresentative?
Assessment: The response correctly identified patient-level as the right grouping unit but, when probed for the specific statistical mechanism that inflates AUC, defaulted to a representation/oversampling explanation rather than the dependence-and-leakage explanation the question targeted. The refinement was an opportunity to recover by naming patient-identity confounding (shared anatomy, positioning, scanner artifacts) as the channel through which train-set information bleeds into the test set; instead the answer doubled down on the oversampling framing and invented private terminology ('layer 1/layer 2') in place of canonical vocabulary. The sklearn primitive (GroupKFold/GroupShuffleSplit) was not retrieved, and the residual generalization gap was described abstractly rather than as concrete cross-site/cross-scanner domain shift.
Literature
- [remediation] Data Leakage in Machine Learning: A Survey — §4 'Train-test contamination' and §4.2 'Group leakage' — specifically the medical-imaging worked example where multiple scans per patient appear on both sides of the split, and the formal statement that the unit of statistical independence is the entity (patient), not the record (image). — ~45m
- [remediation] scikit-learn User Guide §3.1 Cross-validation: GroupKFold and StratifiedGroupKFold — The 'Cross-validation iterators for grouped data' subsection — definition of GroupKFold, GroupShuffleSplit, and StratifiedGroupKFold; the
groupsparameter contract; and the worked example where each sample's group label (e.g., patient_id) forces all of that group's samples into the same fold. — ~20m
q4 · data-engineering · exactly-once-sink · pre 2 → post 3 · ceiling — · transitional b1–b3
Scenario: A data engineering team operates a Kafka → Spark Structured Streaming → Postgres pipeline that aggregates event counts per user per hour and upserts them into an hourly_user_events table. They observe that after consumer restarts (e.g., during deploys or node failures), some hourly counts in Postgres are higher than ground truth — duplicates have leaked through despite the streaming job using checkpointing. The team's current code pattern is: read batch → write to Postgres via JDBC → commit Kafka offsets. Explain the specific failure-window mechanism that produces these duplicates, why checkpointing alone does not give exactly-once semantics across a heterogeneous sink, and what concrete change to the write logic would make the pipeline effectively-once at the Postgres sink. What property must the sink write possess for your fix to work?
Refinement: You said 'make the commit not at kafka level, but at postgres outbox level'. Clarify: what property of the Postgres write operation itself prevents a duplicate upsert from landing when the job replays the same batch after a crash between the write and the offset commit?
Assessment: The answer locates the failure window at the Postgres-write / Kafka-offset-commit boundary and the refinement surfaces ON CONFLICT as the relevant Postgres primitive, but the reasoning around why that primitive closes the replay gap is missing: the response frames the fix as a control-flow inversion ('commit pre-approved at PG level, kafka simply confirms') rather than as an idempotency property on a deterministic key with replace-not-add merge semantics. The standard merge-upsert vocabulary and the explanation of why heterogeneous sinks cannot be covered by checkpoint atomicity alone are the gaps. The refinement narrowed the gap from 'outbox label' to 'ON CONFLICT keyword' but did not close it to a stated mechanism-under-pressure argument.
Literature
- [remediation] Designing Data-Intensive Applications — Ch. 11 §End-to-end argument for databases (pp. 516–520) and §Idempotence (pp. 478–479) — ~1h 30m
- [remediation] Spark Structured Streaming Programming Guide — Output Sinks and Fault Tolerance Semantics — §Fault Tolerance Semantics and §Using Foreach and ForeachBatch (idempotent JDBC sink pattern) — ~30m
q5 · ai-llm · llm-as-judge-evaluation · pre 3 → post 3 · ceiling b2 · transitional b3
Scenario: A team building a customer support chatbot wants to compare two candidate LLM-backed answer generators (Model A and Model B) on 500 historical support tickets. Without budget for human labeling, they set up an LLM-as-judge pipeline: for each ticket, both A's and B's answers are sent to GPT-4 with a prompt asking 'Which answer is better, Answer 1 or Answer 2?' GPT-4 returns the winner, and they aggregate win rates. A skeptical PM points out the results may be untrustworthy. Identify at least two specific, named biases that LLM-as-judge introduces in this setup, explain the mechanism by which each distorts the win rate, and describe concrete mitigations for each. What is the principled fallback — what kind of evaluation does an LLM-as-judge pipeline approximate, and where does the approximation break down?
Refinement: You said 'pinning the LLM that is judgeing to only focus on the second position allows increased accuracy and reduces hallucination for some reason'. Clarify: what mechanism causes the second-position anchor to reduce bias rather than simply shifting the same bias to consistently favor the second answer?
Assessment: The response correctly names position bias and rubric-ambiguity as concerns and, under refinement, arrives at the structurally correct position-swap mitigation by reasoning (run both orderings, apply the bias symmetrically, then aggregate) — this matches the published Zheng et al. fix even though the pre-refinement version misstated it as 'always judge the second position'. The substantive gaps are: (1) only two biases surfaced when the canonical taxonomy includes at least position, verbosity, and self-enhancement; (2) the principled-fallback half of the question — what LLM-as-judge approximates and where the approximation breaks — is answered abstractly ('coherence to invariants') rather than concretely (it approximates a human preference panel, and breaks under correlated judge/candidate errors and when the preference signal diverges from the production outcome metric the PM actually cares about); (3) the PM's underlying decision question — can we trust this win rate to pick a model — is never addressed with the construct-validity caveat the situation requires.
Literature
- [remediation] Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena — §3 'LLM-as-a-Judge' and §4 'Limitations and Biases' — position bias, verbosity bias, self-enhancement bias, and the position-swap consistency aggregation — ~1h 15m
- [remediation] Evaluating Large Language Models: A Comprehensive Survey — §4 'Where to Evaluate' and §5 'How to Evaluate' — pairwise vs reference-based evaluation, construct validity, and when LLM-judge win rates diverge from human preference and from downstream task metrics — ~45m
widget regenerated from fortifai's data/session.json via profile_widget.py



