| Goal | Recommended tool |
|---|---|
| Diagnose performance issues (load imbalance, stragglers, scheduling overhead, idle time) | tf::DigestObserver (see below) |
| Visualize task timeline (which task ran when, on which worker) | tf::ChromeObserver → chrome://tracing |
| Drive a custom analysis (your own metrics) | Subclass tf::ObserverInterface |
A custom observer that produces a ~2 KB structured digest in O(1) memory (vs ~450 MB for
TFProfObserver at 5M tasks). All key diagnostic thresholds are calibrated per run from
the observed gaps and timings — no hardcoded magic numbers.
#include "digest_observer.hpp" // single-header drop-in
auto obs = executor.make_observer<tf::DigestObserver>();
executor.run(taskflow).wait();
obs->digest(std::cerr); // ~2 KB structured diagnostic to stderrThe output covers task duration histogram, parallel efficiency (with calibrated scheduling
cost), parallelism shape (run-window-clipped bucket widths), per-worker load balance,
macro-idle vs micro-gap breakdown (with adaptive long-gap threshold), top-K outliers
(vs trimmed mean), and an auto-diagnosis classifying one of seven patterns
(SCHEDULING_OVERHEAD, SERIAL_BOTTLENECK, LOAD_IMBALANCE, STRAGGLER_TASKS,
RECURSIVE_OVERLAP, UNDER_DECOMPOSITION, HEALTHY).
For multi-phase apps (where you want to profile only one phase, not parsing/setup),
attach lazily and clear() between phases:
static std::shared_ptr<tf::DigestObserver> obs;
if (!obs) obs = executor.make_observer<tf::DigestObserver>();
obs->clear();
executor.run(taskflow).wait();
obs->digest(std::cerr);See ../INTEGRATION.md for the integration guide and
../field-notes/opentimer-granularity-experiments.md
for a worked case study showing the profile→hypothesize→experiment→re-profile loop.
TF_ENABLE_PROFILER=trace.json ./my_programGenerates a JSON file viewable with TFProf or chrome://tracing. Useful when you need
the full timeline rather than a summary; expect 300+ MB of output at million-task scale.
auto observer = executor.make_observer<tf::ChromeObserver>();
executor.run(taskflow).wait();
std::ofstream ofs("trace.json");
observer->dump(ofs);Open in chrome://tracing or edge://tracing for a visual timeline.
struct MyObserver : public tf::ObserverInterface {
void set_up(size_t num_workers) override final { }
void on_entry(tf::WorkerView wv, tf::TaskView tv) override final { }
void on_exit(tf::WorkerView wv, tf::TaskView tv) override final { }
};
auto obs = executor.make_observer<MyObserver>();taskflow.dump(std::cout); // prints DOT to stdout
// Pipe to: ./my_program | dot -Tpng -o graph.pngtf::Executor construction spawns N OS threads, initializes per-worker work-stealing
queues, and sets up internal scheduling state. Destruction joins all threads. This is a
meaningful cost — on the order of 100µs+ per thread — that can dominate short computations
when repeated.
// If the function is called repeatedly, reuse the executor:
void compute(tf::Executor& executor, const Data& data) {
tf::Taskflow taskflow;
// ... build graph using data ...
executor.run(taskflow).wait();
}
std::chrono::microseconds measure(tf::Executor& executor, const Data& data) {
auto t0 = std::chrono::high_resolution_clock::now();
compute(executor, data);
auto t1 = std::chrono::high_resolution_clock::now();
return std::chrono::duration_cast<std::chrono::microseconds>(t1 - t0);
}Choose the worker count at the call site and construct one executor per configuration you want to measure or reuse.
A fresh tf::Taskflow is lightweight to construct — it's just an empty graph. The executor
is the expensive part. When your workload naturally runs once (a batch job, a one-shot
computation), a local executor is perfectly fine. A static local executor is only appropriate
when the worker count is fixed for the lifetime of the process.
Control how parallel algorithms divide work among workers. Pass as the last argument to
for_each_index, for_each_by_index, reduce_by_index, and other parallel algorithms.
| Partitioner | Strategy | Best For |
|---|---|---|
tf::GuidedPartitioner |
Adaptive chunk sizes (large→small) | Default, reasonable for most cases |
tf::StaticPartitioner |
Equal chunks, assigned at start | Uniform workloads (same cost per element) |
tf::DynamicPartitioner |
Fixed chunks, assigned on demand | Irregular/unpredictable work per element |
tf::RandomPartitioner |
Random chunk sizes | Load-balancing experiments |
Default is GuidedPartitioner if none specified.
All partitioners accept an optional chunk size in their constructor:
tf::DynamicPartitioner(1) // chunk size 1 — finest granularity, best for highly irregular work
tf::DynamicPartitioner(64) // chunk size 64 — less scheduling overhead, good default
tf::StaticPartitioner() // equal division, zero scheduling overhead at runtimeChoosing the right partitioner:
- When every element takes roughly the same time (e.g., Black-Scholes pricing, uniform
matrix operations),
StaticPartitioneravoids dynamic scheduling overhead entirely. - When cost varies significantly per element (e.g., Mandelbrot set — boundary pixels
iterate far more than interior ones, or primality checking — large primes are more
expensive),
DynamicPartitionerwith a small chunk size ensures no worker is stuck with a disproportionately expensive chunk while others are idle. - When unsure, the default
GuidedPartitioneradapts reasonably. If profiling shows load imbalance (some workers finishing much earlier than others), try switching toDynamicPartitioner.
Each task should represent at least 10-100 microseconds of work. Millions of trivially
small tasks produce scheduling overhead that exceeds the work. Use for_each_index or
for_each_by_index to let Taskflow partition automatically, or use reduce_by_index for
reductions — these APIs handle chunking so you don't need to manually batch elements.
When multiple workers update a shared variable (counters, accumulators), contention on the underlying cache line can dominate runtime at high core counts. Common patterns and their alternatives:
| Pattern | Overhead | Better Alternative |
|---|---|---|
atomic::fetch_add per element in for_each_index |
Cache-line bouncing, serializes at scale | reduce_by_index with thread-local accumulation |
| Mutex-protected counter | Lock contention | reduce_by_index or atomic with coarser granularity |
atomic with low update frequency |
Minimal | Often acceptable as-is |
The key insight: if you're updating a shared variable once per element in a parallel loop,
you're likely paying more for contention than you're gaining from parallelism — especially
at 32+ cores. reduce_by_index keeps accumulation thread-local and only merges at the end.
tf::Executor executor; // default: std::thread::hardware_concurrency()
tf::Executor executor(4); // explicit: 4 workersFor CPU-bound work, the default (hardware concurrency) is usually optimal. For mixed CPU/IO, you may benefit from more threads.
Multi-socket NUMA systems — pin first, scale later. Default Linux scheduling lets Taskflow worker threads roam across both sockets, which causes:
- Cross-socket work-stealing — workers steal from queues on the remote socket, then fault remote-memory cache lines on every access (typically 2–4× the latency of local).
- First-touch allocations land arbitrarily — depending on which socket each thread first happens to touch a buffer, the data ends up local for some workers and remote for others.
- Apparent "scaling wall" at the single-socket core count — worker counts above the per-socket core count look like they regress, when in fact the issue is cross-socket memory traffic, not the scheduler.
Fix: pin the executor and its allocations to one NUMA node before benchmarking. On Linux this is one wrapper command, no code change:
# Pin to NUMA node 0 only (CPUs and memory)
numactl --cpunodebind=0 --membind=0 ./my_program
# Or, if numactl is unavailable, use taskset (CPU pinning + first-touch allocation
# achieves equivalent placement when threads stay on these CPUs):
taskset -c 0-19,40-59 ./my_programIn a measured OpenTimer benchmark on a 2-socket, 20-core/socket machine, NUMA pinning gave 2–4× wall-time reduction at 32 workers, with no other change. This often dwarfs anything you'll get from task-graph or partitioner tuning, so do it first.
Cap workers at physical cores within the pinned node when possible. Above the physical-core count of one socket, you're forcing SMT siblings to share L1/L2/execution units. In the same OpenTimer measurement, c=16 (16 physical cores, no SMT) consistently beat c=32 (16 physical + 12 SMT, all on one socket) by ~10% wall — adding the SMT threads was a regression even though aggregate "CPU count" doubled.
Quick checklist before drawing any conclusion from a Taskflow benchmark:
- Pinned to one NUMA node?
- Worker count ≤ physical cores in that node?
- Run multiple times — single-run wall is noisy at the few-percent level?
DAG construction is single-threaded regardless of worker count — only execution benefits from additional workers.
When building explicit task graphs, construction cost can dominate with fine-grained tasks:
| Technique | Impact |
|---|---|
| Flat vector for dependency lookup | O(1) vs O(log n) map; eliminates cache misses |
| Buffer-based dependency APIs | Eliminates per-query heap allocation |
| Fixed-size lambda captures | Eliminates per-task heap allocation |
| Minimize sync barriers | Monolithic DAG = 1 barrier vs batched = N barriers |
| Single dependency query per task | Avoid redundant work |
Measurement: Run with an empty kernel (tasks do no work). The elapsed time is your pure scheduling + DAG construction overhead — the floor below which no task-level optimization helps. Run at your target worker count, not just single-threaded.
Understanding what happens inside emplace() helps you make informed optimization decisions.
What taskflow.emplace(lambda) does internally:
- Heap allocation —
new Node(...)allocates a graph node (~160 bytes) - Type erasure — the lambda is stored via
std::function<void()>. The SBO (small-buffer optimization) capacity is only 16 bytes on libstdc++. Most realistic captures exceed this, triggering a second heap allocation. Even when SBO is missed, capture size still matters — copying 240 bytes per task is measurably slower than 48 bytes due to memcpy cost and cache pollution during single-threaded construction. - Edge storage — each
precede()/succeed()pushes aNode*into aSmallVector<Node*, 4>. For ≤4 edges, storage is inline (zero allocation); beyond 4, it spills to the heap.
Result: ~150-250ns per emplace() call, dominated by the heap allocation.
Decision table:
| Design decision | Per-task cost | Impact |
|---|---|---|
std::map vs flat vector for slot lookup |
~50-80ns (O(log n) pointer chases) | Significant — prefer flat vectors |
| Copying struct with vectors per task | ~30-100ns (deep copy = heap allocs) | Significant — use pointers/references |
shared_ptr capture per task |
~20-40ns (atomic ops + extra alloc) | Moderate — prefer raw ptrs when feasible |
| Lambda capture 240B vs 48B | ~10-30ns (memcpy + cache pressure) | Moderate — target ≤48 bytes per capture |
| Lazy init/resize inside lambda | ~5-20ns (branch + possible alloc) | Moderate — do all init outside the lambda |
| Flat buffer vs vector-of-vectors | ~1ns | Negligible — choose clearer code |
| Rolling window vs full task handle array | ~0ns (handles used only at construction) | Negligible for speed; helps memory |
3D handles[g][t][x] vs flat [g][t*W+x] |
~1ns | Negligible — choose clearer code |
Note on this_worker_id(): This method does an unordered_map::find() on
std::this_thread::get_id() — a hash lookup, not a thread_local read. It's fast enough
for per-task use, but the real performance difference between scratch strategies comes from
whether initialization happens inside or outside the lambda, not the lookup mechanism itself.
The emplace() + precede() loop is inherently serial — each task's dependencies reference
previously-created tasks. Only the executor.run() phase is parallel. On multi-socket systems:
- Construction time is constant regardless of worker count
- Execution time decreases with more workers (with NUMA-crossing penalty at socket boundary)
- At very fine-grained tasks with many workers, construction can dominate total runtime
tf::Task is just a Node* pointer (8 bytes). At 5M tasks, a full task handle array is
~40MB — fits in L3 cache on server CPUs with sequential access patterns handled by hardware
prefetchers. Rolling windows save memory but don't improve construction speed.
Pass via -D<MACRO>:
| Macro | Default | Effect |
|---|---|---|
TF_ENABLE_ATOMIC_NOTIFIER |
not defined | Use AtomicNotifier instead of default |
TF_DISABLE_EXCEPTION_HANDLING |
not defined | Remove try/catch from scheduler loop |
TF_ENABLE_TASK_POOL |
not defined | Enable poolable node allocation |
TF_DEFAULT_BOUNDED_TASK_QUEUE_LOG_SIZE |
8 (256 slots) |
Log2 per-worker queue capacity |
TF_DEFAULT_UNBOUNDED_TASK_QUEUE_LOG_SIZE |
10 (1024 slots) |
Log2 initial overflow queue capacity |
TF_CACHELINE_SIZE |
64 (128 on ARM64) | Cache-line size for alignment |