Session 8 Summary

Overview
Code Organization and Structure
- Source Directory Layout
Debugging and Error Handling
- Assertion System
- Error Logging Limitations
- Memory Debugging
Matrix Representation and Formats
- Internal Representation
Mask Handling
- GB_get_mask Function
User-Callable Method Structure
- Standard Workflow for All Methods
- Descriptor Handling
Type Compatibility and Validation
- Type Checking
Operator Management
- Operator Simplification
- Operator Flipping
Wait and Pending Operations
- Matrix State Management
Apply Operation Deep Dive
- Three Main Execution Paths
Operator Type Handling
- Positional Operators
- ISO Value Handling
Parallel Execution
- Task Slicing Strategy
- Column Index Lookup
Factory Kernel System
- Factory Organization
- Three-tier Execution Strategy
Bind First/Second Operations
- Binary Operator with Scalar
Index Unary Operators
- User-Defined Index Unary Operators
JIT Kernel Selection
- Algorithm Variants
Accum-Mask Phase
- Final Phase of All Operations
Key Insights and Design Principles
Technical Details
Session Notes

Overview

This session provides a detailed walkthrough of algorithm implementation in SuiteSparse GraphBLAS, focusing primarily on the GrB_apply operation. Dr. Davis demonstrates the internal structure of the library, explaining how algorithms are organized, optimized, and executed across different execution paths (CPU, JIT, CUDA).

Code Organization and Structure [00:00:18]

Source Directory Layout

"I just released GraphBLAS 9.3.1. Now all the source files have been restructured so it's easier to follow."

Factory folders: Template files used only to construct factory kernels, not needed for JIT
Include folders: Various definitions for algorithm sets
Template folders: Templatized kernels used in JIT and factory kernels, containing data-type-dependent algorithms
Source files (first level): Compiled directly into libgraphblas.so, not templatized, only work on integers/structure

Key principle: Algorithms that only work on structure (not data-dependent) are not JIT-compiled and reside in primary source files.

Debugging and Error Handling [00:19:30]

Assertion System

"If I want to turn on the assertions, I can go to GB_debug.h and recompile the code, then all these assertions become active."

GB_assert macros: Development-time checks that validate matrix integrity
GB_Matrix_check: Exhaustive validation that walks across entire matrix (very slow but comprehensive)
Used via assert(GB_Matrix_check(A, GB_0)) where GB_0 means don't print anything
Can change GB_0 to GB_5 to enable printing during debugging

Error Logging Limitations

"The spec committee decided to put the error string into the output object itself. But what if the error is so bad the output object is null or broken? There's no other place to put the error."

Error strings stored in output matrices (C matrix)
Some methods have no place to log errors, can only return error codes
Error strings are helpful for Python wrappers that raise exceptions with error messages

Memory Debugging [00:27:00]

"I have my own valgrind. I have a table of all the mallocs I've ever done, and I delete the ones I free. So it's the list of all live malloc spaces with their sizes."

Features:

Custom memory tracking table (order N search time, just for debugging)
Validates malloc sizes on free (catches memory corruption)
Tracks memory usage across the library
Easier to use than valgrind for quick memory leak detection

Matrix Representation and Formats [00:05:00]

Internal Representation

"All the kernels are agnostic to storage format. When you're looking at code internally, the code is written as if all matrices are stored by column. If I have a CSR matrix and I call this code, the i's and j's are reversed."

Key concepts:

Internal view: All matrices treated as column-stored (CSC)
CSR matrices: Treated as CSC with i and j reversed
Matrix structure: Collection of vectors with a flag indicating row or column orientation
Transposition handling: CSR transposed is identical to operating on CSC

Mask Handling [00:12:00]

GB_get_mask Function

"If the mask is present and iso-valued but not structural, I can make it structural. If the iso value is nonzero, it's always a structural mask. If the value is 0, you have no mask at all."

Optimizations:

Iso-valued masks with non-zero value → converted to structural masks
Iso-valued masks with zero value → mask discarded entirely, complement flag negated
Quick mask check: Empty complemented mask means nothing happens (early return)

User-Callable Method Structure [00:08:00]

Standard Workflow for All Methods

"Every GrB method has this workflow: I log where I am, I start the burble, I check my input matrices, I grab the descriptor, I grab the mask, and I call the workhorse internal method."

Typical sequence:

GB_WHERE - Establishes error logging location
Input validation (check for null/faulty matrices)
GB_GET_DESCRIPTOR - Extract descriptor components
GB_get_mask - Process mask matrix
Call internal workhorse method (e.g., GB_apply)

Descriptor Handling

"Once I extract its contents, I no longer use the descriptor. I don't pass it to my internal methods. I just pass in the pieces like is A transpose or not, is the mask complement, and so forth."

Type Compatibility and Validation [00:33:00]

Type Checking

"I have to check type compatibility, dimension compatibility, and the whole mask-accum process with the types."

Key validation points:

Operator input types vs matrix types (typecasting compatibility)
Accumulator compatibility (can C pass through accumulator?)
Dimension matching (considering transpose descriptors)
Temporary matrix T type (Z type of operator output)

Operator Management [00:31:00]

Operator Simplification

"I simplify some of the operators down. There's a lot of really strange Boolean operators in the spec, like Boolean divide. I prune them down to a smaller set of actual useful operators."

Binary operator renaming (GB_binop_rename):

Boolean operators simplified to minimal useful set
SECOND with bind_first becomes IDENTITY
Index unary operators mapped to equivalent binary operators where possible
Reduces code branches by eliminating duplicate functionality

Operator Flipping [00:50:00]

"Unfortunately I use the word 'flip' for 2 different senses."

Two meanings of "flip":

Flip_ij: Reversal of i and j indices (for CSR/CSC agnosticism)
- Used when matrix storage format differs from internal representation
- Applies to user-defined positional operators
Flip_xy: Reversal of operator arguments (X and Y)
- Used when rewriting operations (e.g., A*B → B^T * A^T)
- Non-commutative operators need explicit reversal (divide → reverse_divide, less_than → greater_than)

Third meaning - Zombie flip [00:56:30]:

GB_flip(i) = -(i+2)  // Maps 0→-2, 1→-3, 3→-5, etc.

Self-inverse function for marking entries as dead (zombies)
Should be renamed to "zombify" to avoid confusion

UPDATE: I have renamed GB_FLIP to GB_ZOMBIE and changed its definition. It now computes GB_ZOMBIE(i)=(~(i)), the one's complement of i. Like the original GB_FLIP, the function is its own inverse.

Wait and Pending Operations [00:42:00]

Matrix State Management

"This matrix wait will occur if the input matrix has any pending entries or if it has zombies. If it's just jumbled, I'll let it go because the algorithm's agnostic."

Wait conditions:

Pending tuples need assembly
Zombies need removal
Jumbled matrices (out-of-order indices) may be tolerated by some algorithms

Algorithm-specific tolerance:

Apply operations don't care if matrix is jumbled
Some algorithms require sorted indices, trigger full wait

Apply Operation Deep Dive [00:45:00]

Three Main Execution Paths

1. Transpose + Apply [01:03:00] "You may be doing C = F(A^T). That's a transpose and the apply of an operator at the same time. That's one kernel, one pass."

Fused operation handled by GB_transpose
Avoids creating temporary matrix
Single-pass algorithm

2. In-place Apply [01:06:00] "C = F(C): C and A are aliased to each other, no mask, no accumulator, no typecasting. I call the apply method directly and work on the data in place."

Special case: GB_apply_op called directly on matrix values
No memory allocation needed
Fastest execution path

3. Out-of-place Apply with Shallow Matrix [01:10:00] "I construct a new matrix C, but it's going to be a shallow matrix. The P array is shallow, the H array is shallow, the I array is shallow. Only the numerical values are different."

Creates shallow copy of structure via GB_shallow_op
Shares integer arrays (P, H, I) with input matrix
Only allocates new space for numerical values (X array)

Pure shallow case [01:11:00]: "If you're applying identity, this whole code takes constant time because the values are shallow too."

Operator Type Handling [01:19:00]

Positional Operators

"The positional operators are either 32-bit, 64-bit, or Boolean. You can apply TRIL as a Boolean operator."

Built-in positional operators hardcoded (not in factory):

POSITION_I / ROW_INDEX: Returns row index
POSITION_J / COL_INDEX: Returns column index
Boolean variants (TRIL, TRIU)
Direct implementation in switch-case (always available, cannot be disabled)

ISO Value Handling [01:13:00]

"What if you have a matrix that's all ones and you apply a unary operator that replaces all entries with their row index? It's not ISO now. I have to expand from convert an ISO matrix to a non-ISO matrix."

ISO value complications:

Input ISO → output non-ISO requires expansion
Must allocate full value array before applying operator
Detected by examining operator semantics

Parallel Execution [01:20:00]

Task Slicing Strategy

"I want to create 32 tasks per thread. If I have 10 threads, I'll create 320 tasks. I cut the matrix up into 320 pieces."

GB_SLICE_MATRIX design [01:23:00]:

Each task described by 3 integers:
- k_first: First vector to work on
- k_last: Last vector to work on
- p_start: Starting position in first vector

Advantages of this approach:

Handles single large vector: All threads work on pieces of one vector
Handles power-law matrices: Multiple threads on celebrity column, single threads on small columns
Uniform interface for all sparsity patterns

OpenMP pragmas [01:36:00]:

#pragma omp parallel for num_threads(nthreads) schedule(dynamic,1)

Always specify num_threads (for GraphBLAS control)
Dynamic scheduling with chunk size 1
Tasks are large, so dynamic scheduling important for load balancing

Column Index Lookup [01:35:00]

"If it's hypersparse, you have to look it up: j = Ah[k]. If it's not hypersparse, j = k."

Handles different sparsity formats:

Sparse/bitmap/full: j = k (direct mapping)
Hypersparse: j = Ah[k] (lookup required)
Macro abstracts this detail for algorithms

Factory Kernel System [01:45:00]

Factory Organization

"The factory kernels can be disabled entirely. They're a little bit limiting but rather than creating other methods, I just put them in a big switch case."

Factory structure:

Switch factory: Giant switch-case on operator opcode
Worker function: Actual implementation for specific (op, type) combination
Macro-based naming: GB_WORKER(z_name, a_name, op_name)
Disable macros: Can exclude operators to reduce binary size

Example factory kernel: GB_uop_apply__sqrt_fp32_fp32.c

Implements sqrt for float → float
Same source file provides both apply and transpose variants
Includes template algorithm code

Three-tier Execution Strategy [01:50:00]

Tier 1: Factory kernels (fastest)

Pre-compiled for common (operator, type) combinations
Direct C code, fully optimized
Can be disabled at compile-time with GB_COMPACT

Tier 2: JIT compilation (medium speed)

Compiles kernel at runtime for specific case
Requires operator definition string
Returns GrB_NO_VALUE if can't compile
Fast as Factory kernels once the JIT is compiled

Tier 3: Generic execution (slowest) [01:54:00] "I have a function pointer. I typecast a billion times on each scalar and apply the operator a billion times. It's crazy slow."

Function pointer calls for each element
Separate typecast function per element
Only fallback when factory and JIT both unavailable

Bind First/Second Operations [01:55:00]

Binary Operator with Scalar

Bind_first: C = op(scalar, A[i,j])

Scalar is first argument to binary operator
Requires scalar typecast to operator's X type
Uses GB_bind1st template

Bind_second: C = op(A[i,j], scalar)

Scalar is second argument
Typecast scalar to operator's Y type
Uses GB_bind2nd template

Both have separate factory kernels and JIT paths.

Index Unary Operators [01:59:00]

User-Defined Index Unary Operators

"This is calling a function pointer that has to be given the row and column indices, and I don't know if you're going to use them. Maybe you just want the theta. I'll give them to you anyway, sorry."

Challenges:

Must provide i, j, value, and theta to all user-defined index unary operators
Getting column index j is expensive (requires slicing and iteration)
No way for user to indicate which parameters are actually needed

Future optimization opportunity [02:01:00]: "If you could somehow tell me this operator only depends on the row index and theta, I could use faster algorithms. Version 3 of GraphBLAS - you could do a GrB_set to tell the operator, 'here's a hint: don't give me the input values to the matrix.'"

JIT Kernel Selection [02:02:00]

Algorithm Variants

"If I know at compile time it depends on J, I use the more complex algorithm. Otherwise I can use the simpler algorithm."

For built-in operators:

Compile-time knowledge of dependencies
Can choose optimal algorithm template
GB_apply_unop_ip vs GB_apply_unop_ijp

For user-defined operators:

Must assume worst case (depends on all parameters)
Always uses most general (slowest) algorithm
Could be optimized with user hints

Accum-Mask Phase [02:04:00]

Final Phase of All Operations

"At the end of all these primary methods there's an accum-mask phase."

Used by: apply, reduce, mxm, kronecker, transpose, eWiseAdd, eWiseMult, assign, extract, select

Special case [02:04:00]: "This may take O(1) time if all the stars align: no accumulator, no mask, T is T, C is empty. Just slap the data in and I'm done."

Optimal path:

No mask present
No accumulator present
C is empty
T can be moved into C directly (shallow copy)

Otherwise, calls GB_accum_mask which may invoke GB_eWiseAdd and other algorithms.

Key Insights and Design Principles

Algorithm Coupling

"These algorithms are coupled to each other. If you look at who calls what, it's a rat's nest, because they're all gluing together like Legos. That's what linear algebra looks like."

Optimization Philosophy

"I want to save time. If you want to take A = absolute value of A, why make a memory copy? I want to do this in place."

Multiple execution paths prioritized by speed:

Special cases (in-place, shallow copy, etc.)
CUDA acceleration when available
Factory kernels (pre-compiled)
JIT compilation (runtime compilation)
Generic fallback (function pointers)

Code Complexity Management

"I probably should have broken this up into 5 different functions. They're totally non-overlapping branches."

GB_apply_op.c has 5 completely independent algorithm branches
Could be separate functions but kept together
Each branch handles different operator type combinations

Memory Efficiency

"The output matrix is going to be shallow. Most of it's going to be shallow. The P array is shallow, the H array is shallow, the I array is shallow."

Aggressive use of shallow copies:

Reuse structure when only values change
Avoid memory allocation where possible
Constant-time operations when structure unchanged

Technical Details

Work Space Management [00:10:00]

"I have this little strangely misspelled thing called werkspace. It's a statically declared array of bytes on the stack used for small things that would otherwise require malloc."

Purpose:

Avoid alloca() (not portable, especially on Windows)
Small arrays (e.g., 400 integers for task descriptors)
Falls back to malloc if space insufficient

Typecasting in Factory Kernels [01:52:00]

"My operators don't do typecasting unless it's identity operator. Then I typecast - those are typecasting operators."

Only identity operator factories include typecast logic:

User explicitly wants typecast (e.g., float → double)
Built-in typecast kernels for common conversions
Other operators require exact type match or use JIT/generic path

Session Notes

GraphBLAS Version: 9.3.1 (recently released) Recording Time: Approximately 2 hours Next Session Topics: GB_accum_mask, transpose algorithm, eWiseAdd

FilesExpand file tree

README.md

Latest commit

History