Taskflow Performance: Profiling & Tuning

7. Profiling and Debugging

Choosing a profiler

Goal	Recommended tool
Diagnose performance issues (load imbalance, stragglers, scheduling overhead, idle time)	`tf::DigestObserver` (see below)
Visualize task timeline (which task ran when, on which worker)	`tf::ChromeObserver` → `chrome://tracing`
Drive a custom analysis (your own metrics)	Subclass `tf::ObserverInterface`

DigestObserver — Diagnostic Profiling (recommended starting point)

A custom observer that produces a ~2 KB structured digest in O(1) memory (vs ~450 MB for TFProfObserver at 5M tasks). All key diagnostic thresholds are calibrated per run from the observed gaps and timings — no hardcoded magic numbers.

#include "digest_observer.hpp"   // single-header drop-in

auto obs = executor.make_observer<tf::DigestObserver>();
executor.run(taskflow).wait();
obs->digest(std::cerr);          // ~2 KB structured diagnostic to stderr

The output covers task duration histogram, parallel efficiency (with calibrated scheduling cost), parallelism shape (run-window-clipped bucket widths), per-worker load balance, macro-idle vs micro-gap breakdown (with adaptive long-gap threshold), top-K outliers (vs trimmed mean), and an auto-diagnosis classifying one of seven patterns (SCHEDULING_OVERHEAD, SERIAL_BOTTLENECK, LOAD_IMBALANCE, STRAGGLER_TASKS, RECURSIVE_OVERLAP, UNDER_DECOMPOSITION, HEALTHY).

For multi-phase apps (where you want to profile only one phase, not parsing/setup), attach lazily and clear() between phases:

static std::shared_ptr<tf::DigestObserver> obs;
if (!obs) obs = executor.make_observer<tf::DigestObserver>();
obs->clear();
executor.run(taskflow).wait();
obs->digest(std::cerr);

See ../INTEGRATION.md for the integration guide and ../field-notes/opentimer-granularity-experiments.md for a worked case study showing the profile→hypothesize→experiment→re-profile loop.

Environment Variable Profiler

TF_ENABLE_PROFILER=trace.json ./my_program

Generates a JSON file viewable with TFProf or chrome://tracing. Useful when you need the full timeline rather than a summary; expect 300+ MB of output at million-task scale.

ChromeObserver — Programmatic Timeline

auto observer = executor.make_observer<tf::ChromeObserver>();
executor.run(taskflow).wait();
std::ofstream ofs("trace.json");
observer->dump(ofs);

Open in chrome://tracing or edge://tracing for a visual timeline.

Custom Observer

struct MyObserver : public tf::ObserverInterface {
  void set_up(size_t num_workers) override final { }
  void on_entry(tf::WorkerView wv, tf::TaskView tv) override final { }
  void on_exit(tf::WorkerView wv, tf::TaskView tv) override final { }
};
auto obs = executor.make_observer<MyObserver>();

Graph Visualization (DOT/Graphviz)

taskflow.dump(std::cout);  // prints DOT to stdout
// Pipe to: ./my_program | dot -Tpng -o graph.png

8. Performance Tuning

Executor Lifetime and Reuse

tf::Executor construction spawns N OS threads, initializes per-worker work-stealing queues, and sets up internal scheduling state. Destruction joins all threads. This is a meaningful cost — on the order of 100µs+ per thread — that can dominate short computations when repeated.

// If the function is called repeatedly, reuse the executor:
void compute(tf::Executor& executor, const Data& data) {
  tf::Taskflow taskflow;
  // ... build graph using data ...
  executor.run(taskflow).wait();
}

std::chrono::microseconds measure(tf::Executor& executor, const Data& data) {
  auto t0 = std::chrono::high_resolution_clock::now();
  compute(executor, data);
  auto t1 = std::chrono::high_resolution_clock::now();
  return std::chrono::duration_cast<std::chrono::microseconds>(t1 - t0);
}

Choose the worker count at the call site and construct one executor per configuration you want to measure or reuse.

A fresh tf::Taskflow is lightweight to construct — it's just an empty graph. The executor is the expensive part. When your workload naturally runs once (a batch job, a one-shot computation), a local executor is perfectly fine. A static local executor is only appropriate when the worker count is fixed for the lifetime of the process.

Partitioners

Control how parallel algorithms divide work among workers. Pass as the last argument to for_each_index, for_each_by_index, reduce_by_index, and other parallel algorithms.

Partitioner	Strategy	Best For
`tf::GuidedPartitioner`	Adaptive chunk sizes (large→small)	Default, reasonable for most cases
`tf::StaticPartitioner`	Equal chunks, assigned at start	Uniform workloads (same cost per element)
`tf::DynamicPartitioner`	Fixed chunks, assigned on demand	Irregular/unpredictable work per element
`tf::RandomPartitioner`	Random chunk sizes	Load-balancing experiments

Default is GuidedPartitioner if none specified.

All partitioners accept an optional chunk size in their constructor:

tf::DynamicPartitioner(1)     // chunk size 1 — finest granularity, best for highly irregular work
tf::DynamicPartitioner(64)    // chunk size 64 — less scheduling overhead, good default
tf::StaticPartitioner()       // equal division, zero scheduling overhead at runtime

Choosing the right partitioner:

When every element takes roughly the same time (e.g., Black-Scholes pricing, uniform matrix operations), StaticPartitioner avoids dynamic scheduling overhead entirely.
When cost varies significantly per element (e.g., Mandelbrot set — boundary pixels iterate far more than interior ones, or primality checking — large primes are more expensive), DynamicPartitioner with a small chunk size ensures no worker is stuck with a disproportionately expensive chunk while others are idle.
When unsure, the default GuidedPartitioner adapts reasonably. If profiling shows load imbalance (some workers finishing much earlier than others), try switching to DynamicPartitioner.

Task Granularity

Each task should represent at least 10-100 microseconds of work. Millions of trivially small tasks produce scheduling overhead that exceeds the work. Use for_each_index or for_each_by_index to let Taskflow partition automatically, or use reduce_by_index for reductions — these APIs handle chunking so you don't need to manually batch elements.

Shared-State Contention

When multiple workers update a shared variable (counters, accumulators), contention on the underlying cache line can dominate runtime at high core counts. Common patterns and their alternatives:

Pattern	Overhead	Better Alternative
`atomic::fetch_add` per element in `for_each_index`	Cache-line bouncing, serializes at scale	`reduce_by_index` with thread-local accumulation
Mutex-protected counter	Lock contention	`reduce_by_index` or `atomic` with coarser granularity
`atomic` with low update frequency	Minimal	Often acceptable as-is

The key insight: if you're updating a shared variable once per element in a parallel loop, you're likely paying more for contention than you're gaining from parallelism — especially at 32+ cores. reduce_by_index keeps accumulation thread-local and only merges at the end.

Executor Thread Count

tf::Executor executor;      // default: std::thread::hardware_concurrency()
tf::Executor executor(4);   // explicit: 4 workers

For CPU-bound work, the default (hardware concurrency) is usually optimal. For mixed CPU/IO, you may benefit from more threads.

Multi-socket NUMA systems — pin first, scale later. Default Linux scheduling lets Taskflow worker threads roam across both sockets, which causes:

Cross-socket work-stealing — workers steal from queues on the remote socket, then fault remote-memory cache lines on every access (typically 2–4× the latency of local).
First-touch allocations land arbitrarily — depending on which socket each thread first happens to touch a buffer, the data ends up local for some workers and remote for others.
Apparent "scaling wall" at the single-socket core count — worker counts above the per-socket core count look like they regress, when in fact the issue is cross-socket memory traffic, not the scheduler.

Fix: pin the executor and its allocations to one NUMA node before benchmarking. On Linux this is one wrapper command, no code change:

# Pin to NUMA node 0 only (CPUs and memory)
numactl --cpunodebind=0 --membind=0 ./my_program

# Or, if numactl is unavailable, use taskset (CPU pinning + first-touch allocation
# achieves equivalent placement when threads stay on these CPUs):
taskset -c 0-19,40-59 ./my_program

In a measured OpenTimer benchmark on a 2-socket, 20-core/socket machine, NUMA pinning gave 2–4× wall-time reduction at 32 workers, with no other change. This often dwarfs anything you'll get from task-graph or partitioner tuning, so do it first.

Cap workers at physical cores within the pinned node when possible. Above the physical-core count of one socket, you're forcing SMT siblings to share L1/L2/execution units. In the same OpenTimer measurement, c=16 (16 physical cores, no SMT) consistently beat c=32 (16 physical + 12 SMT, all on one socket) by ~10% wall — adding the SMT threads was a regression even though aggregate "CPU count" doubled.

Quick checklist before drawing any conclusion from a Taskflow benchmark:

Pinned to one NUMA node?
Worker count ≤ physical cores in that node?
Run multiple times — single-run wall is noisy at the few-percent level?

DAG construction is single-threaded regardless of worker count — only execution benefits from additional workers.

DAG Construction Efficiency

When building explicit task graphs, construction cost can dominate with fine-grained tasks:

Technique	Impact
Flat vector for dependency lookup	O(1) vs O(log n) map; eliminates cache misses
Buffer-based dependency APIs	Eliminates per-query heap allocation
Fixed-size lambda captures	Eliminates per-task heap allocation
Minimize sync barriers	Monolithic DAG = 1 barrier vs batched = N barriers
Single dependency query per task	Avoid redundant work

Measurement: Run with an empty kernel (tasks do no work). The elapsed time is your pure scheduling + DAG construction overhead — the floor below which no task-level optimization helps. Run at your target worker count, not just single-threaded.

Taskflow's Per-Task Cost Model

Understanding what happens inside emplace() helps you make informed optimization decisions.

What taskflow.emplace(lambda) does internally:

Heap allocation — new Node(...) allocates a graph node (~160 bytes)
Type erasure — the lambda is stored via std::function<void()>. The SBO (small-buffer optimization) capacity is only 16 bytes on libstdc++. Most realistic captures exceed this, triggering a second heap allocation. Even when SBO is missed, capture size still matters — copying 240 bytes per task is measurably slower than 48 bytes due to memcpy cost and cache pollution during single-threaded construction.
Edge storage — each precede()/succeed() pushes a Node* into a SmallVector<Node*, 4>. For ≤4 edges, storage is inline (zero allocation); beyond 4, it spills to the heap.

Result: ~150-250ns per emplace() call, dominated by the heap allocation.

Decision table:

Design decision	Per-task cost	Impact
`std::map` vs flat vector for slot lookup	~50-80ns (O(log n) pointer chases)	Significant — prefer flat vectors
Copying struct with vectors per task	~30-100ns (deep copy = heap allocs)	Significant — use pointers/references
`shared_ptr` capture per task	~20-40ns (atomic ops + extra alloc)	Moderate — prefer raw ptrs when feasible
Lambda capture 240B vs 48B	~10-30ns (memcpy + cache pressure)	Moderate — target ≤48 bytes per capture
Lazy init/resize inside lambda	~5-20ns (branch + possible alloc)	Moderate — do all init outside the lambda
Flat buffer vs vector-of-vectors	~1ns	Negligible — choose clearer code
Rolling window vs full task handle array	~0ns (handles used only at construction)	Negligible for speed; helps memory
3D `handles[g][t][x]` vs flat `[g][t*W+x]`	~1ns	Negligible — choose clearer code

Note on this_worker_id(): This method does an unordered_map::find() on std::this_thread::get_id() — a hash lookup, not a thread_local read. It's fast enough for per-task use, but the real performance difference between scratch strategies comes from whether initialization happens inside or outside the lambda, not the lookup mechanism itself.

Why DAG Construction Is Single-Threaded

The emplace() + precede() loop is inherently serial — each task's dependencies reference previously-created tasks. Only the executor.run() phase is parallel. On multi-socket systems:

Construction time is constant regardless of worker count
Execution time decreases with more workers (with NUMA-crossing penalty at socket boundary)
At very fine-grained tasks with many workers, construction can dominate total runtime

`tf::Task` Is a Lightweight Handle

tf::Task is just a Node* pointer (8 bytes). At 5M tasks, a full task handle array is ~40MB — fits in L3 cache on server CPUs with sequential access patterns handled by hardware prefetchers. Rolling windows save memory but don't improve construction speed.

Compile-Time Tuning Macros

Pass via -D<MACRO>:

Macro	Default	Effect
`TF_ENABLE_ATOMIC_NOTIFIER`	not defined	Use `AtomicNotifier` instead of default
`TF_DISABLE_EXCEPTION_HANDLING`	not defined	Remove try/catch from scheduler loop
`TF_ENABLE_TASK_POOL`	not defined	Enable poolable node allocation
`TF_DEFAULT_BOUNDED_TASK_QUEUE_LOG_SIZE`	`8` (256 slots)	Log2 per-worker queue capacity
`TF_DEFAULT_UNBOUNDED_TASK_QUEUE_LOG_SIZE`	`10` (1024 slots)	Log2 initial overflow queue capacity
`TF_CACHELINE_SIZE`	64 (128 on ARM64)	Cache-line size for alignment

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Taskflow Performance: Profiling & Tuning

7. Profiling and Debugging

Choosing a profiler

DigestObserver — Diagnostic Profiling (recommended starting point)

Environment Variable Profiler

ChromeObserver — Programmatic Timeline

Custom Observer

Graph Visualization (DOT/Graphviz)

8. Performance Tuning

Executor Lifetime and Reuse

Partitioners

Task Granularity

Shared-State Contention

Executor Thread Count

DAG Construction Efficiency

Taskflow's Per-Task Cost Model

Why DAG Construction Is Single-Threaded

`tf::Task` Is a Lightweight Handle

Compile-Time Tuning Macros

FilesExpand file tree

taskflow-performance.md

Latest commit

History

taskflow-performance.md

File metadata and controls

Taskflow Performance: Profiling & Tuning

7. Profiling and Debugging

Choosing a profiler

DigestObserver — Diagnostic Profiling (recommended starting point)

Environment Variable Profiler

ChromeObserver — Programmatic Timeline

Custom Observer

Graph Visualization (DOT/Graphviz)

8. Performance Tuning

Executor Lifetime and Reuse

Partitioners

Task Granularity

Shared-State Contention

Executor Thread Count

DAG Construction Efficiency

Taskflow's Per-Task Cost Model

Why DAG Construction Is Single-Threaded

tf::Task Is a Lightweight Handle

Compile-Time Tuning Macros

`tf::Task` Is a Lightweight Handle