Skip to content

Tune NN server threads and max batch size in genconfig#1201

Open
zsqdx wants to merge 2 commits into
lightvector:masterfrom
zsqdx:tune-nnserver-threads-genconfig
Open

Tune NN server threads and max batch size in genconfig#1201
zsqdx wants to merge 2 commits into
lightvector:masterfrom
zsqdx:tune-nnserver-threads-genconfig

Conversation

@zsqdx
Copy link
Copy Markdown

@zsqdx zsqdx commented May 26, 2026

Summary

  • Let genconfig tune numNNServerThreadsPerModel separately from GPU count, capped to a small candidate set (1,2,4 on a single GPU).
  • Add genconfig tuning for nnMaxBatchSize, choosing among conservative batch/profile candidates by best nnEvals/s after search-thread and NN-server-thread tuning.
  • Preserve existing numNNServerThreadsPerModel and nnMaxBatchSize values when overwriting a config and skipping performance tuning.
  • Recommend TensorRT plan cache for repeated benchmark/genconfig runs via benchmark/genconfig output and generated/example config comments.

Motivation

On recent fast TensorRT systems, inference throughput can be sensitive to both the number of backend contexts and the TensorRT max-batch/profile shape. In Blackwell + TensorRT fdx6d testing, nnMaxBatchSize materially changed nnEvals/s, and values above the default were not always best, so genconfig should measure this instead of relying only on a static heuristic.

TensorRT plan caching also greatly reduces repeated startup/tuning time for fixed model/GPU/batch settings, so the workflow now points users toward -DUSE_CACHE_TENSORRT_PLAN=1 where appropriate.

Tests

  • cmake --build .\cpp\build-opencl-codex --target katago -j 4
  • From cpp/: .\build-opencl-codex\katago.exe runtests
  • From cpp/: .\build-opencl-codex\katago.exe runoutputtests

@zsqdx
Copy link
Copy Markdown
Author

zsqdx commented May 26, 2026

Additional validation on the Blackwell TensorRT test box:

  • GPU: RTX PRO 6000 Blackwell Workstation Edition
  • CUDA: 13.2
  • TensorRT: 10.16.1
  • Branch/commit: tune-nnserver-threads-genconfig / 8f6f705b
  • Commands:
    • cmake .. -DUSE_BACKEND=TENSORRT -DCMAKE_BUILD_TYPE=Release
    • cmake --build .

Build completed successfully. The TensorRT deprecation warnings are pre-existing for this backend/version combination.

@zsqdx zsqdx changed the title Tune numNNServerThreadsPerModel in genconfig Tune NN server threads and max batch size in genconfig May 26, 2026
@zsqdx
Copy link
Copy Markdown
Author

zsqdx commented May 26, 2026

Update after additional inference-side tuning work:

  • Added genconfig tuning for nnMaxBatchSize, selecting by nnEvals/s after numSearchThreads and numNNServerThreadsPerModel tuning.
  • Kept the NN server thread candidate set small (1,2,4 for a single GPU), matching the Blackwell TensorRT observations that higher counts fragmented batches.
  • Added TensorRT plan-cache recommendations to benchmark/genconfig output and generated/example config comments.

Validation:

  • Windows OpenCL build: cmake --build .\cpp\build-opencl-codex --target katago -j 4
  • Windows OpenCL tests: runtests, runoutputtests
  • Remote RTX PRO 6000 Blackwell TensorRT build: cmake --build build-trt --target katago -j 16

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant