[experiments] Daily Experiment Report — 2026-05-13 #31896
Closed
Replies: 1 comment
-
|
This discussion has been marked as outdated by daily-experiment-report. A newer discussion is available at Discussion #32085. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
🧪 Daily Experiment Report — 2026-05-13
8 experiments analysed across 8 workflows, covering ~42 runs collected from git branch state. 0 reached statistical significance (p < 0.05). Recommendations at a glance: 7 × EXTEND (insufficient data), 1 × ABANDON (guardrail failure in
daily-astrostylelite-markdown-spellcheck).prompt_style·daily-astrostylelite-markdown-spellcheck· 🔴 ABANDONView ASCII comparison table
The
detailed(control) variant fails the declaredrun_success_rate ≥ 0.90guardrail at 85.7% over 7 runs. Theconcisevariant has only 1 run (insufficient to evaluate). Investigate the root cause of the failure run before continuing this experiment.prompt_style·daily-community-attribution· 🟡 EXTENDView ASCII comparison table
Both variants show 100% success rate over 4 runs each.
verboseruns ~126s faster on average (399s vs 525s). Cannot compute success-rate p-value when both proportions equal 1.0. Need more runs.output_format·daily-issues-report· 🟡 EXTENDView ASCII comparison table
prompt_style·daily-news· 🟡 EXTENDView ASCII comparison table
Only the
detailedvariant has runs (2 failures).concisehas 0 runs. Very early stage.reasoning_depth·daily-security-red-team· 🟡 EXTENDView ASCII comparison table
Only 1 run total, assigned to
iterative. Experiment just started.output_format·deep-report· 🟡 EXTENDView ASCII comparison table
All 5 runs succeeded. Means nearly identical (762s vs 770s). Wide CIs due to small n.
annotated_brief(3rd variant) has 0 runs so far.prompt_style·issue-arborist· 🟡 EXTENDView ASCII comparison table
Only 1 run (failure) for
concise. Experiment just started.caveman·smoke-copilot· 🟡 EXTENDView ASCII comparison table
Most balanced experiment: 5 correlated runs per variant (8 per variant in state; 6 total uncorrelated).
yesleads at 80% vsnoat 40% success, but p=0.197 — not significant with current sample size. Wide duration CIs reflect high variance (both variants occasionally have very long runs >30 min). Needs 15 more runs per variant before conclusions can be drawn.📊 Summary Table
References:
Warning
Firewall blocked 1 domain
The following domain was blocked by the firewall during workflow execution:
proxy.golang.orgSee Network Configuration for more information.
Beta Was this translation helpful? Give feedback.
All reactions