feat: eval results by Ki-Seki · Pull Request #23 · SculptAI/GIMBench

Ki-Seki · 2026-01-07T15:38:27Z

No description provided.

for more information, see https://pre-commit.ci

Ki-Seki · 2026-01-11T09:14:45Z

Note: Eval Bench Sizes

#(Sculpt-AI/GIM-SFT/*/train) = 3157829
#(Idavidrein/gpqa/gpqa_diamond/train) = 198
#(openlifescienceai/medmcqa/validation) = 4183
#(TIGER-Lab/MMLU-Pro/test) = 12102
#(allenai/qasc/validation) = 920

for more information, see https://pre-commit.ci

…tion

…eriments

* chore: backup results * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * chore: remove gim prompt about gim models * chore: update api model eval scripts * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * chore: fix api model eval scripts * chore: fix api model eval scripts * chore: ignore eval.log.* files (#76) * feat: add upper bound to auto reason_budget (#78) * chore: backup * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Shichao Song <60967965+Ki-Seki@users.noreply.github.com>

…odel configurations (#90) * feat: add humaneval infilling benchmark support Agent-Logs-Url: https://github.com/SculptAI/GIMBench/sessions/3aa3039a-4ca0-489f-be58-5c1f5197d1ad Co-authored-by: Ki-Seki <60967965+Ki-Seki@users.noreply.github.com> * feat: add code infilling eval task for humaneval_infilling (generate + unit test execution) Agent-Logs-Url: https://github.com/SculptAI/GIMBench/sessions/7bc171d3-b30c-4480-b88b-a18cb8dbd051 Co-authored-by: Ki-Seki <60967965+Ki-Seki@users.noreply.github.com> * fix: update dataset path and simplify loading logic for humaneval infilling * feat: enhance query formatting for code infilling with markdown code fences * feat: add eval script for humaneval infilling with multiple model configurations * upload eval results * feat: add code extraction method for fenced code blocks in CommonCodeInfillingEvaluator * update eval results --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>

Co-authored-by: Duguce <zhgyqc@163.com>

* feat: add throughput (tokens/s) and TTFT metrics for ppl and match evaluators Agent-Logs-Url: https://github.com/SculptAI/GIMBench/sessions/755d574f-33d3-4352-acfb-cf63ce27d892 Co-authored-by: Ki-Seki <60967965+Ki-Seki@users.noreply.github.com> * feat: move counter_tokenizer argument * feat: refactor TTFT measurement for OpenAI and vllm models * feat: add evaluation script for vllm models with various output types * feat: add timing metrics recording option to evaluators * feat: add evaluation script for vllm models with timing metrics * upload results * update eval script * add results * add results --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>

* upload eval results * upload

…ation

Co-authored-by: Copilot <copilot@github.com>

for more information, see https://pre-commit.ci

feat: add some old eval results

9c54e0d

Copilot AI review requested due to automatic review settings January 7, 2026 15:38

Ki-Seki added the do not merge label Jan 7, 2026

This comment was marked as outdated.

Sign in to view

pre-commit-ci Bot and others added 4 commits January 7, 2026 15:41

[pre-commit.ci] auto fixes from pre-commit.com hooks

f5b2e57

for more information, see https://pre-commit.ci

Merge branch 'main' into feat/eval-results

a5a769a

feat: add Match @ Qwen3-1.7B

0f879cb

[pre-commit.ci] auto fixes from pre-commit.com hooks

bc7f050

for more information, see https://pre-commit.ci

Ki-Seki and others added 9 commits January 11, 2026 17:15

Merge branch 'main' into feat/eval-results

7d8f4cc

feat: add evaluation script

ec310b1

[pre-commit.ci] auto fixes from pre-commit.com hooks

11cd71e

for more information, see https://pre-commit.ci

chore: backup

7c8f021

chore: backup

b27f382

feat: add evaluation and README scripts for KDD experiments

76dedf3

Merge branch 'main' into feat/eval-results

c05a661

fix: update model names in evaluation scripts for consistency

29c4c1f

chore: backup

c9eef17

Ki-Seki force-pushed the feat/eval-results branch from 3dd512e to c9eef17 Compare January 21, 2026 14:27

pre-commit-ci Bot and others added 11 commits January 21, 2026 14:30

[pre-commit.ci] auto fixes from pre-commit.com hooks

2950741

for more information, see https://pre-commit.ci

Merge branch 'main' into feat/eval-results

4167eb9

feat: update eval script to include model downloads and API configura…

d894217

…tion

chore: backup only

dc3dc95

Merge branch 'main' into feat/eval-results

266e570

feat: add evaluation scripts and README for KDD language modeling exp…

7ac1037

…eriments

chore: backup results

43008fe

Merge branch 'main' into feat/eval-results

9c6a617

refactor: update evaluation script to use new module structure

abc0c28

feat: add new models to evaluation script

030ecdf

chore: backup results

6187512

Ki-Seki added 2 commits February 4, 2026 15:42

Merge branch 'main' into feat/eval-results

ff9192d

chore: backup

d7ec2ea

Ki-Seki marked this pull request as draft February 6, 2026 03:56

Ki-Seki and others added 18 commits February 8, 2026 19:36

chore: backup

b5cc868

Merge branch 'main' into feat/eval-results

1a03fca

Delete results/260128-kdd-expt-3 directory

59ba658

Merge branch 'main' into feat/eval-results

1378edc

Merge branch 'main' into feat/eval-results

476f37f

Merge branch 'main' into feat/eval-results

184d7be

Merge branch 'main' into feat/eval-results

c4f8802

feat: upload rebuttal match models related results (#91)

4d2b025

Co-authored-by: Duguce <zhgyqc@163.com>

For rebuttal, long context eval (#92)

6e301f9

For rebuttal, AR baseline (#95)

0dac739

* upload eval results * upload

Merge branch 'main' into feat/eval-results

9a466de

Add SCIERC eval results

6501822

refactor

570b89d

Merge branch 'main' into feat/eval-results

0ecd27c

Add evaluation script and aggregated results CSV for KDD rebuttal abl…

18c921b

…ation

This comment was marked as outdated.

Sign in to view

Ki-Seki and others added 8 commits April 10, 2026 16:49

Update eval script to mask API key and remove unnecessary comments

63f0676

Merge branch 'main' into feat/eval-results

e03c332

Add evaluation script for gemma-4 with multiple model evaluations

a8468d9

Merge branch 'main' into feat/eval-results

43a7184

Add API key and base URL variables to eval.sh

b570781

Co-authored-by: Copilot <copilot@github.com>

update eval results

89613ff

add rlvr related

9bb49d1

[pre-commit.ci] auto fixes from pre-commit.com hooks

a552dc4

for more information, see https://pre-commit.ci

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: eval results#23

feat: eval results#23
Ki-Seki wants to merge 69 commits into
mainfrom
feat/eval-results

Ki-Seki commented Jan 7, 2026

Uh oh!

This comment was marked as outdated.

Uh oh!

Ki-Seki commented Jan 11, 2026

Uh oh!

This comment was marked as outdated.

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Ki-Seki commented Jan 7, 2026

Uh oh!

This comment was marked as outdated.

Uh oh!

Ki-Seki commented Jan 11, 2026

Note: Eval Bench Sizes

Uh oh!

This comment was marked as outdated.

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants