NVIDIA-NeMo · eric-tramel · Jun 5, 2026
@@ -130,7 +130,7 @@ DatasetBuilder.build()
   → collect TaskTraces, emit telemetry
 ```
 
-Row-group admission is fixed by default in the dataset-builder path: the configured row-group concurrency is the hard in-flight cap. The scheduler also has an internal adaptive row-group mode for direct use that only raises a soft target up to that cap; it is additive ramp-up, not AIMD shrink/recovery behavior.
+Row-group admission is fixed by default in the dataset-builder path: the default row-group concurrency is the hard in-flight cap. Public async runs can override that horizon with `RunConfig.row_group_admission`. Fixed mode uses `max_concurrent_row_groups` as the hard in-flight cap. The historical default fixed horizon remains row-group-count-only, while widened fixed horizons derive an active-row guard when `max_admitted_rows` is omitted. Adaptive mode starts from `adaptive_initial_target` and only raises a soft target up to that cap; it is additive ramp-up, not AIMD shrink/recovery behavior. Adaptive mode also derives an active-row guard when `max_admitted_rows` is omitted, and rejects row groups above that guard, so wide row groups cannot silently admit unbounded active state.
 
 When request admission is available, async scheduling may use request-pressure snapshots as a read-only advisory during fair-queue selection. A request-pressured task can be skipped for an eligible peer without mutating request-admission state; provider/model/domain request limits remain owned by request admission.
 

@@ -99,7 +99,8 @@ Within each column, cells are processed **in parallel** up to the configured lim
 
 ### Concurrency Formula
 
-At any moment, the number of concurrent LLM requests is:
+On the sync engine, each batch is processed one column at a time. At any moment,
+the number of concurrent LLM requests is:
 
 ```python
 concurrent_requests = min(
@@ -109,6 +110,21 @@ concurrent_requests = min(
 )
 ```
 
+On the async engine, ready cells can come from multiple active row groups:
+
+```python
+concurrent_requests = min(
+    active_ready_model_cells,   # Ready cells across admitted row groups
+    current_request_limit,      # AIMD-managed limit (≤ max_parallel_requests)
+    max_in_flight_tasks         # Scheduler task-lease ceiling
+)
+```
+
+`active_ready_model_cells` is bounded by row-group admission:
+`max_concurrent_row_groups`, the effective `max_admitted_rows` guard, the DAG
+dependencies that have become ready, and any rows already dropped by processors
+or failures.
+
 `max_parallel_requests` sets the **ceiling**. The actual limit (`current_request_limit`) is managed at runtime by adaptive request admission that reacts to rate-limit signals from the inference server:
 
 - **During optional startup ramp**: when `startup_ramp_seconds` is greater than 0, a new request resource starts at one concurrent request and increases linearly toward `max_parallel_requests` over that duration.
@@ -152,6 +168,53 @@ designer.set_run_config(run_config)
 
 ---
 
+### Row-Group Admission (RunConfig)
+
+Controls how many async row groups can be active at once. A row group contains
+`buffer_size` records, so this setting is the scheduler horizon above the batch
+size: a wider horizon can expose more ready model work to fast endpoints, while
+a smaller horizon tends to checkpoint completed records earlier and hold less
+active state.
+
+```python
+import data_designer.config as dd
+from data_designer.interface import DataDesigner
+
+run_config = dd.RunConfig(
+    buffer_size=1000,
+    max_in_flight_tasks=4096,
+    row_group_admission=dd.RowGroupAdmissionConfig(
+        mode="adaptive",
+        max_concurrent_row_groups=8,
+        adaptive_initial_target=2,
+        max_admitted_rows=16_000,
+    ),
+)
+
+designer = DataDesigner()
+designer.set_run_config(run_config)
+```
+
+| Parameter | Default | Effect |
+|-----------|---------|--------|
+| `mode` | `fixed` | `fixed` admits up to the hard cap immediately; `adaptive` starts lower and raises the target when scheduler pressure shows that more ready work can be useful. |
+| `max_concurrent_row_groups` | 3 | Hard cap on active row groups. Maximum is 64. |
+| `adaptive_initial_target` | 1 in adaptive mode | Initial soft target before adaptive additive ramp-up. |
+| `max_admitted_rows` | Engine-derived for adaptive mode and widened fixed horizons; unset for the default fixed horizon | Optional guardrail on total records held across active row groups. When omitted for adaptive mode or fixed mode with `max_concurrent_row_groups > 3`, the engine derives `max(max_concurrent_row_groups * buffer_size, 8192)`, bounded by the requested target record count when available, falling back to scheduled rows for direct scheduler plans, and a 1,000,000-row ceiling. Derived guards require `buffer_size` at or below that ceiling. Explicit values must be at least `buffer_size` and at most 1,000,000. |
+
+**When to use fixed mode**: You want predictable checkpoint cadence, lower
+active memory, or easier debugging.
+
+**When to use adaptive mode**: Large async DAGs, fan-out/fan-in flows, mixed
+latency columns, or high-capacity endpoints where the default horizon leaves
+capacity idle.
+
+Async scheduler telemetry includes the effective mode, active row-group target,
+observed maximum target, active row-group count, max admitted rows, and blocked
+reasons when scheduler event instrumentation is enabled.
+
+---
+
 ### `max_parallel_requests` (InferenceParams)
 
 Sets the **maximum** concurrent LLM API calls **per model**. This is the ceiling that adaptive request admission can ramp up to — the actual concurrency at runtime may be lower if the server signals rate limits.
@@ -319,7 +382,7 @@ DATA_DESIGNER_ASYNC_ENGINE=0 python my_pipeline.py
 | Problem | Symptom | Solution |
 |---------|---------|----------|
 | **Low throughput** | Low GPU utilization | Increase `max_parallel_requests` and/or `buffer_size`. If request admission has self-reduced due to earlier 429s (check logs for "concurrency reduced" messages), the server may need more capacity or you can wait for AIMD recovery. |
-| **Frequent 429 → recovery cycles** | Logs show repeated concurrency drops and ramp-ups | The `max_parallel_requests` ceiling is above the server's sustained capacity. This is handled automatically, but you can lower the ceiling to reduce the sawtooth. |
+| **Frequent 429 → recovery cycles** | Logs show repeated concurrency drops and ramp-ups | The `max_parallel_requests` ceiling is above the server's sustained capacity. This is handled automatically, but you can lower the ceiling to reduce the sawtooth or tune `request_admission` with `RequestAdmissionTuningConfig`. |
 | **Long tail of slow generations** | Most records fast, few very slow | Reduce `max_conversation_restarts`, simplify schemas, improve prompts |
 | **Multi-model idle periods** | One model busy, others idle | Reduce `buffer_size` for faster cycling, or consolidate models |
 | **Memory errors** | OOM crashes | Reduce `buffer_size` and `max_parallel_requests` |