Skip to content

feat: eval results#23

Draft
Ki-Seki wants to merge 69 commits into
mainfrom
feat/eval-results
Draft

feat: eval results#23
Ki-Seki wants to merge 69 commits into
mainfrom
feat/eval-results

Conversation

@Ki-Seki
Copy link
Copy Markdown
Member

@Ki-Seki Ki-Seki commented Jan 7, 2026

No description provided.

Copilot AI review requested due to automatic review settings January 7, 2026 15:38

This comment was marked as outdated.

@Ki-Seki
Copy link
Copy Markdown
Member Author

Ki-Seki commented Jan 11, 2026

Note: Eval Bench Sizes

#(Sculpt-AI/GIM-SFT/*/train) = 3157829
#(Idavidrein/gpqa/gpqa_diamond/train) = 198
#(openlifescienceai/medmcqa/validation) = 4183
#(TIGER-Lab/MMLU-Pro/test) = 12102
#(allenai/qasc/validation) = 920

@Ki-Seki Ki-Seki marked this pull request as draft February 6, 2026 03:56
Ki-Seki and others added 18 commits February 8, 2026 19:36
* chore: backup results

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* chore: remove gim prompt about gim models

* chore: update api model eval scripts

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* chore: fix api model eval scripts

* chore: fix api model eval scripts

* chore: ignore eval.log.* files (#76)

* feat: add upper bound to auto reason_budget (#78)

* chore: backup

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Shichao Song <60967965+Ki-Seki@users.noreply.github.com>
…odel configurations (#90)

* feat: add humaneval infilling benchmark support

Agent-Logs-Url: https://github.com/SculptAI/GIMBench/sessions/3aa3039a-4ca0-489f-be58-5c1f5197d1ad

Co-authored-by: Ki-Seki <60967965+Ki-Seki@users.noreply.github.com>

* feat: add code infilling eval task for humaneval_infilling (generate + unit test execution)

Agent-Logs-Url: https://github.com/SculptAI/GIMBench/sessions/7bc171d3-b30c-4480-b88b-a18cb8dbd051

Co-authored-by: Ki-Seki <60967965+Ki-Seki@users.noreply.github.com>

* fix: update dataset path and simplify loading logic for humaneval infilling

* feat: enhance query formatting for code infilling with markdown code fences

* feat: add eval script for humaneval infilling with multiple model configurations

* upload eval results

* feat: add code extraction method for fenced code blocks in CommonCodeInfillingEvaluator

* update eval results

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: Duguce <zhgyqc@163.com>
* feat: add throughput (tokens/s) and TTFT metrics for ppl and match evaluators

Agent-Logs-Url: https://github.com/SculptAI/GIMBench/sessions/755d574f-33d3-4352-acfb-cf63ce27d892

Co-authored-by: Ki-Seki <60967965+Ki-Seki@users.noreply.github.com>

* feat: move counter_tokenizer argument

* feat: refactor TTFT measurement for OpenAI and vllm models

* feat: add evaluation script for vllm models with various output types

* feat: add timing metrics recording option to evaluators

* feat: add evaluation script for vllm models with timing metrics

* upload results

* update eval script

* add results

* add results

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
* upload eval results

* upload
@gitguardian

This comment was marked as outdated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants