CUDA Parallel Programming - Study Projects

Hands-on GPU programming exercises from two serious learning resources:

Programming Massively Parallel Processors (Kirk & Hwu) - the standard academic textbook on CUDA
Oak Ridge National Laboratory CUDA Training Series - national lab curriculum covering GPU architecture and parallel programming patterns

Repository Structure

cuda-parallel-programming/
├── 01_vector_addition/
│   ├── vectorAdd.cu       # 1D parallelism, basic thread indexing
│   └── notes.md           # concepts and observations
├── 02_matrix_multiplication/
│   ├── naive_matmul.cu    # 2D thread indexing, global memory access
│   └── notes.md
├── 03_tiled_matmul/
│   ├── tiled_matmul.cu    # shared memory optimization, __syncthreads()
│   └── notes.md
├── 04_parallel_reduction/
│   ├── reduction.cu       # reduction pattern, warp-level thinking
│   └── notes.md
└── README.md

Exercises

01 - Vector Addition

Concepts: Basic CUDA workflow, 1D thread indexing, host-device memory management

The simplest parallel operation - adds two arrays element-wise. Each thread handles exactly one element. Demonstrates the fundamental CUDA pattern:

Allocate device memory (cudaMalloc)
Copy data to device (cudaMemcpy H2D)
Launch kernel
Copy results back (cudaMemcpy D2H)
Verify and free

idx = threadIdx.x + blockIdx.x * blockDim.x
C[idx] = A[idx] + B[idx]

With 4096 elements and block size 256 → 16 blocks × 256 threads = 4096 threads, each doing one addition in parallel.

02 - Naive Matrix Multiplication

Concepts: 2D thread indexing, global memory access patterns, performance baseline

Each thread computes one element of the output matrix - the dot product of a row of A and a column of B. Uses 2D thread indexing to map threads to matrix positions:

row = blockIdx.y * blockDim.y + threadIdx.y
col = blockIdx.x * blockDim.x + threadIdx.x
C[row][col] = sum(A[row][i] * B[i][col]) for i in range(N)

With 4096×4096 matrices: 16,777,216 threads each computing one dot product simultaneously.

Why this is called naive: Every thread reads directly from slow global memory (~400-600 cycles per access) on every iteration of the inner loop. Each thread makes 8192 global memory reads. This is the bottleneck that tiling addresses.

03 - Tiled Matrix Multiplication

Concepts: Shared memory, __syncthreads(), memory hierarchy optimization

The most important GPU optimization technique. Threads cooperate to load tiles of A and B into fast shared memory (~4 cycles) before computing - instead of each thread reading independently from slow global memory.

__shared__ float tile_A[TILE_SIZE][TILE_SIZE];
__shared__ float tile_B[TILE_SIZE][TILE_SIZE];

// All threads load one element each into shared memory
tile_A[ty][tx] = A[row * ds + t * TILE_SIZE + tx];
tile_B[ty][tx] = B[(t * TILE_SIZE + ty) * ds + col];

__syncthreads();  // wait for all threads to finish loading

// Compute using fast shared memory
for (int i = 0; i < TILE_SIZE; i++)
    temp += tile_A[ty][i] * tile_B[i][tx];

__syncthreads();  // wait before overwriting tiles next iteration

Performance improvement:

Naive: N³ global memory reads
Tiled: N³ / TILE_SIZE global memory reads
With TILE_SIZE=16: 16x reduction in global memory traffic

Why __syncthreads() is critical: Called twice per tile iteration - once after loading (all threads must finish writing before any reads) and once after computing (all threads must finish reading before tiles are overwritten). Without synchronization: data races produce wrong results.

04 - Parallel Reduction

Concepts: Reduction pattern, warp-level thinking, tree-based parallelism

Sums 4M floats using a tree-based reduction - the fundamental parallel algorithm pattern used everywhere in GPU computing.

Sequential sum: O(N) steps        - 4,194,304 steps for 4M elements
Parallel reduction: O(log N) steps - 22 steps for 4M elements

Each iteration halves the number of active threads:

Stride 128: threads 0-127 each add one element at offset +128
Stride 64:  threads 0-63  each add one element at offset +64
Stride 32:  threads 0-31  each add one element at offset +32
...
Stride 1:   thread 0      adds  one element at offset +1
→ thread 0 holds sum of entire block

Two-level reduction: Kernel produces one partial sum per block. Final sum combines partial sums (CPU here, second GPU kernel in production).

__syncthreads() between every stride: Threads must finish their addition before the next iteration reads those values. Faster threads would otherwise read stale data producing wrong results.

Memory Hierarchy — Why It Matters

Register      - per thread    - ~1 cycle     - fastest, very limited
Shared memory - per block     - ~4 cycles    - fast, 48KB typical
L2 cache      - per device    - ~200 cycles  - automatic
Global memory - entire device - ~400-600 cycles - large, slow

Key insight:
Naive matmul reads from global memory on every access - slow.
Tiled matmul loads data into shared memory first - 100x faster reads.
This single optimization is responsible for most GPU performance gains
in real applications.

Build and Run

Requirements

CUDA Toolkit 11.0+
NVIDIA GPU with compute capability 3.5+
nvcc compiler

Compile and Run

# Vector addition
cd 01_vector_addition
nvcc -o vectorAdd vectorAdd.cu
./vectorAdd

# Naive matrix multiplication
cd 02_matrix_multiplication
nvcc -o naive_matmul naive_matmul.cu
./naive_matmul

# Tiled matrix multiplication
cd 03_tiled_matmul
nvcc -o tiled_matmul tiled_matmul.cu
./tiled_matmul

# Parallel reduction
cd 04_parallel_reduction
nvcc -o reduction reduction.cu
./reduction

Compare Naive vs Tiled Performance

# Run both and compare compute times
./naive_matmul   # look for "Compute took X seconds"
./tiled_matmul   # should be significantly faster

Learning Resources

Programming Massively Parallel Processors - Kirk & Hwu The standard academic textbook. Covers thread hierarchy, memory model, performance optimization, and parallel patterns from first principles.
Oak Ridge National Laboratory CUDA Training Series Practical exercises from a national lab curriculum. Focuses on hands-on implementation of real GPU programming patterns. Available at: https://github.com/olcf-tutorials/CUDA_training_series

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CUDA Parallel Programming - Study Projects

Repository Structure

Exercises

01 - Vector Addition

02 - Naive Matrix Multiplication

03 - Tiled Matrix Multiplication

04 - Parallel Reduction

Memory Hierarchy — Why It Matters

Build and Run

Requirements

Compile and Run

Compare Naive vs Tiled Performance

Learning Resources

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.idea		.idea
01_vector_addition		01_vector_addition
02_matrix_multiplication		02_matrix_multiplication
03_tiled_matmul		03_tiled_matmul
04_parallel_reduction		04_parallel_reduction
.DS_Store		.DS_Store
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

CUDA Parallel Programming - Study Projects

Repository Structure

Exercises

01 - Vector Addition

02 - Naive Matrix Multiplication

03 - Tiled Matrix Multiplication

04 - Parallel Reduction

Memory Hierarchy — Why It Matters

Build and Run

Requirements

Compile and Run

Compare Naive vs Tiled Performance

Learning Resources

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages