Tune NN server threads and max batch size in genconfig by zsqdx · Pull Request #1201 · lightvector/KataGo

zsqdx · 2026-05-26T22:20:05Z

Summary

Let genconfig tune numNNServerThreadsPerModel separately from GPU count, capped to a small candidate set (1,2,4 on a single GPU).
Add genconfig tuning for nnMaxBatchSize, choosing among conservative batch/profile candidates by best nnEvals/s after search-thread and NN-server-thread tuning.
Preserve existing numNNServerThreadsPerModel and nnMaxBatchSize values when overwriting a config and skipping performance tuning.
Recommend TensorRT plan cache for repeated benchmark/genconfig runs via benchmark/genconfig output and generated/example config comments.

Motivation

On recent fast TensorRT systems, inference throughput can be sensitive to both the number of backend contexts and the TensorRT max-batch/profile shape. In Blackwell + TensorRT fdx6d testing, nnMaxBatchSize materially changed nnEvals/s, and values above the default were not always best, so genconfig should measure this instead of relying only on a static heuristic.

TensorRT plan caching also greatly reduces repeated startup/tuning time for fixed model/GPU/batch settings, so the workflow now points users toward -DUSE_CACHE_TENSORRT_PLAN=1 where appropriate.

Tests

cmake --build .\cpp\build-opencl-codex --target katago -j 4
From cpp/: .\build-opencl-codex\katago.exe runtests
From cpp/: .\build-opencl-codex\katago.exe runoutputtests

zsqdx · 2026-05-26T22:21:28Z

Additional validation on the Blackwell TensorRT test box:

GPU: RTX PRO 6000 Blackwell Workstation Edition
CUDA: 13.2
TensorRT: 10.16.1
Branch/commit: tune-nnserver-threads-genconfig / 8f6f705b
Commands:
- cmake .. -DUSE_BACKEND=TENSORRT -DCMAKE_BUILD_TYPE=Release
- cmake --build .

Build completed successfully. The TensorRT deprecation warnings are pre-existing for this backend/version combination.

zsqdx · 2026-05-26T23:06:23Z

Update after additional inference-side tuning work:

Added genconfig tuning for nnMaxBatchSize, selecting by nnEvals/s after numSearchThreads and numNNServerThreadsPerModel tuning.
Kept the NN server thread candidate set small (1,2,4 for a single GPU), matching the Blackwell TensorRT observations that higher counts fragmented batches.
Added TensorRT plan-cache recommendations to benchmark/genconfig output and generated/example config comments.

Validation:

Windows OpenCL build: cmake --build .\cpp\build-opencl-codex --target katago -j 4
Windows OpenCL tests: runtests, runoutputtests
Remote RTX PRO 6000 Blackwell TensorRT build: cmake --build build-trt --target katago -j 16

Tune NN server threads in genconfig

8f6f705

Tune NN max batch size in genconfig

ba7f5ba

zsqdx changed the title ~~Tune numNNServerThreadsPerModel in genconfig~~ Tune NN server threads and max batch size in genconfig May 26, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tune NN server threads and max batch size in genconfig#1201

Tune NN server threads and max batch size in genconfig#1201
zsqdx wants to merge 2 commits into
lightvector:masterfrom
zsqdx:tune-nnserver-threads-genconfig

zsqdx commented May 26, 2026 •

edited

Loading

Uh oh!

zsqdx commented May 26, 2026

Uh oh!

zsqdx commented May 26, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

zsqdx commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Tests

Uh oh!

zsqdx commented May 26, 2026

Uh oh!

zsqdx commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

zsqdx commented May 26, 2026 •

edited

Loading

zsqdx commented May 26, 2026 •

edited

Loading