[experiments] Daily Experiment Report — 2026-05-13 #31896

2026-05-13T09:07:39Z

github-actions[bot]
Bot May 13, 2026

🧪 Daily Experiment Report — 2026-05-13

8 experiments analysed across 8 workflows, covering ~42 runs collected from git branch state. 0 reached statistical significance (p < 0.05). Recommendations at a glance: 7 × EXTEND (insufficient data), 1 × ABANDON (guardrail failure in daily-astrostylelite-markdown-spellcheck).

⚠️ Note on chart generation: The runner environment uses PyPy 3.10 without pre-installed numpy/matplotlib, so bar charts could not be generated this run.

`prompt_style` · `daily-astrostylelite-markdown-spellcheck` · 🔴 ABANDON

View ASCII comparison table

Experiment : prompt_style
Workflow   : daily-astrostylelite-markdown-spellcheck
Hypothesis : (not specified)
min_samples: 30 per variant

+--------------------+-------+--------+---------+-------------------+------------+----------------+
| Variant            |  n    | Succ%  | Mean(s) | 95% CI (s)        | p-success  | min_samples    |
+--------------------+-------+--------+---------+-------------------+------------+----------------+
| detailed  (ctrl)   |  7    | 85.7%  |  395    | [281, 508]        | (ref)      | 7/30 (23%)     |
| concise            |  1    | 100.0% |  478    | N/A               | 0.686      | 1/30 (3%)      |
+--------------------+-------+--------+---------+-------------------+------------+----------------+
p-value: two-tailed vs control. * p<0.05  ** p<0.01  *** p<0.001

Guardrails:
  run_success_rate >=0.90 : detailed=FAIL(0.86), concise=PASS(1.00)

Recommendation: ABANDON
Rationale     : Control variant "detailed" fails the run_success_rate >=0.90 guardrail (6/7 = 85.7%).

The detailed (control) variant fails the declared run_success_rate ≥ 0.90 guardrail at 85.7% over 7 runs. The concise variant has only 1 run (insufficient to evaluate). Investigate the root cause of the failure run before continuing this experiment.

`prompt_style` · `daily-community-attribution` · 🟡 EXTEND

View ASCII comparison table

Experiment : prompt_style
Workflow   : daily-community-attribution
Hypothesis : (not specified)
min_samples: 20 per variant

+--------------------+-------+--------+---------+-------------------+------------+----------------+
| Variant            |  n    | Succ%  | Mean(s) | 95% CI (s)        | p-success  | min_samples    |
+--------------------+-------+--------+---------+-------------------+------------+----------------+
| concise   (ctrl)   |  4    | 100.0% |  525    | [131, 919]        | (ref)      | 4/20 (20%)     |
| verbose            |  4    | 100.0% |  399    | [169, 630]        | N/A        | 4/20 (20%)     |
+--------------------+-------+--------+---------+-------------------+------------+----------------+

Recommendation: EXTEND
Rationale     : Insufficient data — both variants have fewer than 5 runs (n=4 each).

Both variants show 100% success rate over 4 runs each. verbose runs ~126s faster on average (399s vs 525s). Cannot compute success-rate p-value when both proportions equal 1.0. Need more runs.

`output_format` · `daily-issues-report` · 🟡 EXTEND

View ASCII comparison table

Experiment : output_format
Workflow   : daily-issues-report
Hypothesis : H0: no change in engagement. H1: inline ≥20% higher reactions+replies
min_samples: 20 per variant

+--------------------+-------+--------+---------+-------------------+------------+----------------+
| Variant            |  n    | Succ%  | Mean(s) | 95% CI (s)        | p-success  | min_samples    |
+--------------------+-------+--------+---------+-------------------+------------+----------------+
| collapsible (ctrl) |  3    | 0.0%   |  310    | [261, 360]        | (ref)      | 3/20 (15%)     |
| inline             |  4    | 0.0%   |  405    | [125, 685]        | N/A        | 4/20 (20%)     |
+--------------------+-------+--------+---------+-------------------+------------+----------------+

Recommendation: EXTEND
Rationale     : Insufficient data — at least one variant has fewer than 5 runs.

⚠️ All 7 recorded runs are failures (0% success across both variants). This suggests a persistent workflow bug unrelated to the A/B treatment. Recommend investigating the root cause before collecting more experiment data.

`prompt_style` · `daily-news` · 🟡 EXTEND

View ASCII comparison table

Experiment : prompt_style
Workflow   : daily-news
min_samples: 20 per variant

+--------------------+-------+--------+---------+-------------------+------------+----------------+
| Variant            |  n    | Succ%  | Mean(s) | 95% CI (s)        | p-success  | min_samples    |
+--------------------+-------+--------+---------+-------------------+------------+----------------+
| detailed  (ctrl)   |  2    | 0.0%   |  309    | [182, 436]        | (ref)      | 2/20 (10%)     |
| concise            |  0    | N/A    |  N/A    | N/A               | N/A        | 0/20 (0%)      |
+--------------------+-------+--------+---------+-------------------+------------+----------------+

Recommendation: EXTEND
Rationale     : Insufficient data — at least one variant has fewer than 5 runs.

Only the detailed variant has runs (2 failures). concise has 0 runs. Very early stage.

`reasoning_depth` · `daily-security-red-team` · 🟡 EXTEND

View ASCII comparison table

Experiment : reasoning_depth
Workflow   : daily-security-red-team
min_samples: 20 per variant

+--------------------+-------+--------+---------+-------------------+------------+----------------+
| Variant            |  n    | Succ%  | Mean(s) | 95% CI (s)        | p-success  | min_samples    |
+--------------------+-------+--------+---------+-------------------+------------+----------------+
| single_pass (ctrl) |  0    | N/A    |  N/A    | N/A               | (ref)      | 0/20 (0%)      |
| iterative          |  1    | 100.0% |  333    | N/A               | N/A        | 1/20 (5%)      |
+--------------------+-------+--------+---------+-------------------+------------+----------------+

Recommendation: EXTEND
Rationale     : Insufficient data — at least one variant has fewer than 5 runs.

Only 1 run total, assigned to iterative. Experiment just started.

`output_format` · `deep-report` · 🟡 EXTEND

View ASCII comparison table

Experiment : output_format
Workflow   : deep-report
min_samples: 20 per variant

+--------------------+-------+--------+---------+-------------------+------------+----------------+
| Variant            |  n    | Succ%  | Mean(s) | 95% CI (s)        | p-success  | min_samples    |
+--------------------+-------+--------+---------+-------------------+------------+----------------+
| exec_brief  (ctrl) |  2    | 100.0% |  762    | [-45, 1568]       | (ref)      | 2/20 (10%)     |
| full_briefing      |  3    | 100.0% |  770    | [545, 996]        | N/A        | 3/20 (15%)     |
+--------------------+-------+--------+---------+-------------------+------------+----------------+
Note: annotated_brief variant exists in config but has 0 runs.

Recommendation: EXTEND
Rationale     : Insufficient data — both variants have fewer than 5 runs.

All 5 runs succeeded. Means nearly identical (762s vs 770s). Wide CIs due to small n. annotated_brief (3rd variant) has 0 runs so far.

`prompt_style` · `issue-arborist` · 🟡 EXTEND

View ASCII comparison table

Experiment : prompt_style
Workflow   : issue-arborist
min_samples: 30 per variant

+--------------------+-------+--------+---------+-------------------+------------+----------------+
| Variant            |  n    | Succ%  | Mean(s) | 95% CI (s)        | p-success  | min_samples    |
+--------------------+-------+--------+---------+-------------------+------------+----------------+
| concise   (ctrl)   |  1    | 0.0%   |  202    | N/A               | (ref)      | 1/30 (3%)      |
| detailed           |  0    | N/A    |  N/A    | N/A               | N/A        | 0/30 (0%)      |
+--------------------+-------+--------+---------+-------------------+------------+----------------+

Recommendation: EXTEND
Rationale     : Insufficient data — at least one variant has fewer than 5 runs.

Only 1 run (failure) for concise. Experiment just started.

`caveman` · `smoke-copilot` · 🟡 EXTEND

View ASCII comparison table

Experiment : caveman
Workflow   : smoke-copilot
min_samples: 20 per variant

+--------------------+-------+--------+---------+-------------------+------------+----------------+
| Variant            |  n    | Succ%  | Mean(s) | 95% CI (s)        | p-success  | min_samples    |
+--------------------+-------+--------+---------+-------------------+------------+----------------+
| yes       (ctrl)   |  5    | 80.0%  | 1264    | [261, 2266]       | (ref)      | 5/20 (25%)     |
| no                 |  5    | 40.0%  | 1177    | [641, 1714]       | 0.197      | 5/20 (25%)     |
+--------------------+-------+--------+---------+-------------------+------------+----------------+
Note: State shows 8 runs/variant total; 5 per variant correlated from recent_runs.
      3 runs/variant could not be correlated (not in recent_runs window).

Recommendation: EXTEND
Rationale     : Below min_samples (20): yes: 5/20, no: 5/20.

Most balanced experiment: 5 correlated runs per variant (8 per variant in state; 6 total uncorrelated). yes leads at 80% vs no at 40% success, but p=0.197 — not significant with current sample size. Wide duration CIs reflect high variance (both variants occasionally have very long runs >30 min). Needs 15 more runs per variant before conclusions can be drawn.

📊 Summary Table

Workflow	Experiment	n (ctrl/var)	Succ% (ctrl/var)	p-value	Recommendation
daily-astrostylelite-markdown-spellcheck	prompt_style	7/1	85.7%/100%	0.686	🔴 ABANDON (guardrail)
daily-community-attribution	prompt_style	4/4	100%/100%	N/A	🟡 EXTEND
daily-issues-report	output_format	3/4	0%/0%	N/A	🟡 EXTEND ⚠️ all failing
daily-news	prompt_style	2/0	0%/N/A	N/A	🟡 EXTEND
daily-security-red-team	reasoning_depth	0/1	N/A/100%	N/A	🟡 EXTEND
deep-report	output_format	2/3	100%/100%	N/A	🟡 EXTEND
issue-arborist	prompt_style	1/0	0%/N/A	N/A	🟡 EXTEND
smoke-copilot	caveman	5/5	80%/40%	0.197	🟡 EXTEND

References:

§25788606097

Warning

Firewall blocked 1 domain

The following domain was blocked by the firewall during workflow execution:

proxy.golang.org

To allow these domains, add them to the network.allowed list in your workflow frontmatter:

network:
  allowed:
    - defaults
    - "proxy.golang.org"

See Network Configuration for more information.

Generated by daily-experiment-report · ● 53.6M · ◷

expires on May 16, 2026, 9:07 AM UTC

2026-05-14T09:04:44Z

github-actions[bot]
Bot May 14, 2026
Author

This discussion has been marked as outdated by daily-experiment-report.

A newer discussion is available at Discussion #32085.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[experiments] Daily Experiment Report — 2026-05-13 #31896

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

[experiments] Daily Experiment Report — 2026-05-13 #31896

Uh oh!

github-actions[bot] Bot May 13, 2026

🧪 Daily Experiment Report — 2026-05-13

prompt_style · daily-astrostylelite-markdown-spellcheck · 🔴 ABANDON

prompt_style · daily-community-attribution · 🟡 EXTEND

output_format · daily-issues-report · 🟡 EXTEND

prompt_style · daily-news · 🟡 EXTEND

reasoning_depth · daily-security-red-team · 🟡 EXTEND

output_format · deep-report · 🟡 EXTEND

prompt_style · issue-arborist · 🟡 EXTEND

caveman · smoke-copilot · 🟡 EXTEND

📊 Summary Table

Replies: 1 comment

Uh oh!

github-actions[bot] Bot May 14, 2026 Author

github-actions[bot]
Bot May 13, 2026

`prompt_style` · `daily-astrostylelite-markdown-spellcheck` · 🔴 ABANDON

`prompt_style` · `daily-community-attribution` · 🟡 EXTEND

`output_format` · `daily-issues-report` · 🟡 EXTEND

`prompt_style` · `daily-news` · 🟡 EXTEND

`reasoning_depth` · `daily-security-red-team` · 🟡 EXTEND

`output_format` · `deep-report` · 🟡 EXTEND

`prompt_style` · `issue-arborist` · 🟡 EXTEND

`caveman` · `smoke-copilot` · 🟡 EXTEND

github-actions[bot]
Bot May 14, 2026
Author