Hands-on GPU programming exercises from two serious learning resources:
- Programming Massively Parallel Processors (Kirk & Hwu) - the standard academic textbook on CUDA
- Oak Ridge National Laboratory CUDA Training Series - national lab curriculum covering GPU architecture and parallel programming patterns
cuda-parallel-programming/
├── 01_vector_addition/
│ ├── vectorAdd.cu # 1D parallelism, basic thread indexing
│ └── notes.md # concepts and observations
├── 02_matrix_multiplication/
│ ├── naive_matmul.cu # 2D thread indexing, global memory access
│ └── notes.md
├── 03_tiled_matmul/
│ ├── tiled_matmul.cu # shared memory optimization, __syncthreads()
│ └── notes.md
├── 04_parallel_reduction/
│ ├── reduction.cu # reduction pattern, warp-level thinking
│ └── notes.md
└── README.md
Concepts: Basic CUDA workflow, 1D thread indexing, host-device memory management
The simplest parallel operation - adds two arrays element-wise. Each thread handles exactly one element. Demonstrates the fundamental CUDA pattern:
- Allocate device memory (
cudaMalloc) - Copy data to device (
cudaMemcpy H2D) - Launch kernel
- Copy results back (
cudaMemcpy D2H) - Verify and free
idx = threadIdx.x + blockIdx.x * blockDim.x
C[idx] = A[idx] + B[idx]
With 4096 elements and block size 256 → 16 blocks × 256 threads = 4096 threads, each doing one addition in parallel.
Concepts: 2D thread indexing, global memory access patterns, performance baseline
Each thread computes one element of the output matrix - the dot product of a row of A and a column of B. Uses 2D thread indexing to map threads to matrix positions:
row = blockIdx.y * blockDim.y + threadIdx.y
col = blockIdx.x * blockDim.x + threadIdx.x
C[row][col] = sum(A[row][i] * B[i][col]) for i in range(N)
With 4096×4096 matrices: 16,777,216 threads each computing one dot product simultaneously.
Why this is called naive: Every thread reads directly from slow global memory (~400-600 cycles per access) on every iteration of the inner loop. Each thread makes 8192 global memory reads. This is the bottleneck that tiling addresses.
Concepts: Shared memory, __syncthreads(), memory hierarchy optimization
The most important GPU optimization technique. Threads cooperate to load tiles of A and B into fast shared memory (~4 cycles) before computing - instead of each thread reading independently from slow global memory.
__shared__ float tile_A[TILE_SIZE][TILE_SIZE];
__shared__ float tile_B[TILE_SIZE][TILE_SIZE];
// All threads load one element each into shared memory
tile_A[ty][tx] = A[row * ds + t * TILE_SIZE + tx];
tile_B[ty][tx] = B[(t * TILE_SIZE + ty) * ds + col];
__syncthreads(); // wait for all threads to finish loading
// Compute using fast shared memory
for (int i = 0; i < TILE_SIZE; i++)
temp += tile_A[ty][i] * tile_B[i][tx];
__syncthreads(); // wait before overwriting tiles next iteration
Performance improvement:
- Naive: N³ global memory reads
- Tiled: N³ / TILE_SIZE global memory reads
- With TILE_SIZE=16: 16x reduction in global memory traffic
Why __syncthreads() is critical:
Called twice per tile iteration - once after loading (all threads must finish writing before any reads) and once after computing (all threads must finish reading before tiles are overwritten). Without synchronization: data races produce wrong results.
Concepts: Reduction pattern, warp-level thinking, tree-based parallelism
Sums 4M floats using a tree-based reduction - the fundamental parallel algorithm pattern used everywhere in GPU computing.
Sequential sum: O(N) steps - 4,194,304 steps for 4M elements
Parallel reduction: O(log N) steps - 22 steps for 4M elements
Each iteration halves the number of active threads:
Stride 128: threads 0-127 each add one element at offset +128
Stride 64: threads 0-63 each add one element at offset +64
Stride 32: threads 0-31 each add one element at offset +32
...
Stride 1: thread 0 adds one element at offset +1
→ thread 0 holds sum of entire block
Two-level reduction: Kernel produces one partial sum per block. Final sum combines partial sums (CPU here, second GPU kernel in production).
__syncthreads() between every stride:
Threads must finish their addition before the next iteration reads those values. Faster threads would otherwise read stale data producing wrong results.
Register - per thread - ~1 cycle - fastest, very limited
Shared memory - per block - ~4 cycles - fast, 48KB typical
L2 cache - per device - ~200 cycles - automatic
Global memory - entire device - ~400-600 cycles - large, slow
Key insight:
Naive matmul reads from global memory on every access - slow.
Tiled matmul loads data into shared memory first - 100x faster reads.
This single optimization is responsible for most GPU performance gains
in real applications.
- CUDA Toolkit 11.0+
- NVIDIA GPU with compute capability 3.5+
- nvcc compiler
# Vector addition
cd 01_vector_addition
nvcc -o vectorAdd vectorAdd.cu
./vectorAdd
# Naive matrix multiplication
cd 02_matrix_multiplication
nvcc -o naive_matmul naive_matmul.cu
./naive_matmul
# Tiled matrix multiplication
cd 03_tiled_matmul
nvcc -o tiled_matmul tiled_matmul.cu
./tiled_matmul
# Parallel reduction
cd 04_parallel_reduction
nvcc -o reduction reduction.cu
./reduction# Run both and compare compute times
./naive_matmul # look for "Compute took X seconds"
./tiled_matmul # should be significantly faster-
Programming Massively Parallel Processors - Kirk & Hwu The standard academic textbook. Covers thread hierarchy, memory model, performance optimization, and parallel patterns from first principles.
-
Oak Ridge National Laboratory CUDA Training Series Practical exercises from a national lab curriculum. Focuses on hands-on implementation of real GPU programming patterns. Available at: https://github.com/olcf-tutorials/CUDA_training_series