Session 6 Summary

Interview with Dr. Tim Davis about SuiteSparse GraphBLAS JIT compilation system internals. Session Date: August 1, 2024

Overview
JIT Architecture and Global State Management
JIT Control Modes
- Five JIT Control States
- Cross-Compilation Pattern for Embedded Systems
JIT Load Process
Kernel Loading and Compilation
The Macrofy System
Error Handling and Robustness
- JIT Failure Philosophy
Implementation Statistics and Insights
- Code Organization
- CUDA Minimalism
Session Conclusion

Overview

This session continues from Session 5, diving deep into how the GraphBLAS JIT (Just-In-Time) compilation system works internally. The discussion covers kernel compilation, loading, caching, synchronization, and the macrofy process that generates source code.

JIT Architecture and Global State Management

Global State and Critical Sections [00:06:00 - 00:12:00]

Discussion: The JIT system requires global state to maintain persistent information across GraphBLAS sessions, including hash tables, cache folder paths, compiler settings, and C flags.

Key Quote: "I don't like globals. Globals are evil. But I have to have them here because you call GrB_mxm, and I jit a kernel, and I stick it a hash table, and you call it again. And I say, 'Oh, I've seen this one before.' I've got to go find the function pointer."

Technical Details:

Core JIT implementation in GB_jitifyer.c and GB_jitifyer.h (2,700 lines of code)
Hash table stores compiled kernel function pointers
OpenMP critical sections protect hash table access from multiple user threads
File locking protects against multiple GraphBLAS processes accessing same cache (see NOTE in Session 5)

Timestamp: [00:06:00]

Multi-Threading and Multi-Process Synchronization [00:08:00 - 00:09:30]

Discussion: How the JIT handles concurrent access from multiple threads within a process and multiple separate GraphBLAS processes.

Key Quote: "If one user thread compiles a jit kernel, goes into jit cache, the other thread wants to use it. I have to share it. So the hash table is protected with a critical section."

Technical Details:

User threads within same process share hash table via OpenMP critical sections
Separate processes share cache folder via file locking
Multiple layers of synchronization to prevent conflicts
Hash function depends on matrix properties (sparse/hypersparse, zombies, iso-valued, etc.)

Timestamp: [00:08:00]

Version Management and Cache Folders [00:11:00 - 00:13:00]

Discussion: How different GraphBLAS versions maintain separate cache folders and how version conflicts are handled.

Key Quote: "By default, the cache folder is like on Linux, it's ~/.SuiteSparse/GrB9.3.0, so if you have a different version like 9.3.1 or 9.2.1, it's a different folder."

Technical Details:

Default cache: ~/.SuiteSparse/GrB9.3.0 on Linux
Query function checks compiled kernel version on load
Mismatched versions cause kernel deletion and recompilation
Kernel query validates: version number, hash function, operator strings, monoid identity values

Timestamp: [00:11:00]

JIT Control Modes

Five JIT Control States [00:23:30 - 00:27:00]

Discussion: The JIT system has five operational modes that control compilation and kernel usage behavior.

States:

OFF - Hash table is empty, no JIT kernels used (not even PreJIT)
PAUSE - Don't use JIT kernels, but keep them in hash table
RUN - Use kernels already in hash table, but don't load from disk or compile new ones
LOAD - Can use hash table and load from disk, but cannot compile
ON - Full JIT enabled: compile, load, and use everything

Key Quote: "Off means the hash table is empty. Pause means turn off the jit, don't use any jit kernels, but keep them in the hash table. Run says if it's in the hash table you can call it and use it, but you can't load it from disk, you can't compile anything."

Timestamp: [00:23:30]

Cross-Compilation Pattern for Embedded Systems [00:26:00 - 00:28:30]

Discussion: Using RUN mode to deploy GraphBLAS on platforms without compilers (iPad, iPhone).

Key Quote: "You exercise GraphBLAS, create it in development, let it compile all its jit kernels at once, run through all its paces. You get a thousand jit kernels. You're done. Now you want to distribute this to the world. So you take these jit kernels, you put them back into GraphBLAS as a PreJIT kernel, you recompile GraphBLAS."

Workflow:

Development: Run with JIT fully ON, exercise all code paths
Collect compiled kernels
Bake kernels into GraphBLAS as PreJIT kernels
Deploy to target platform with JIT set to RUN mode
No compiler needed on target platform (iOS, etc.)

Timestamp: [00:26:00]

JIT Load Process

Primary Entry Point: jitifyer_load [00:21:00 - 00:23:00]

Discussion: The main function that handles JIT kernel lookup, loading, and compilation.

Key Quote: "jitifyer_load is the primary entry point from the rest of GraphBLAS into the jitifyer. The goal of this method is to produce the function pointer, and it may need to compile a kernel, or maybe to load it."

Process Flow:

Check if JIT is disabled (OFF/PAUSE) → return no value
Try fast lookup in hash table (RUN mode, no critical section)
If not found, enter critical section for full lookup/load/compile
Check PreJIT kernels
Load from disk if available
Compile if necessary

Timestamp: [00:21:00]

Fast Path Lookup (RUN Mode) [00:29:00 - 00:32:00]

Discussion: Optimized lookup path when JIT is in RUN mode - no critical section needed.

Key Quote: "This is greased lightning path for the run case. I don't even have to enter a critical section, because I know nobody within this GraphBLAS context is going to be touching the hash table."

Technical Details:

No locks/critical sections in RUN mode
Hash table is read-only
99.99% of calls use this fast path in tight loops
Returns function pointer immediately if kernel found

Timestamp: [00:29:00]

Hash Table Entry Structure [00:30:00 - 00:31:00]

Discussion: Structure stored in hash table for each compiled kernel.

Components:

16-byte encoding
64-bit hash function
Kernel name/suffix string (for user-defined operators/types)
dlopen handle (library pointer)
Function pointer to actual kernel
PreJIT index (for PreJIT kernels)
Checked flag (for validation status)

Timestamp: [00:30:00]

PreJIT Kernel Handling [00:36:00 - 00:38:30]

Discussion: Special handling for kernels baked into GraphBLAS at compile time.

Key Quote: "A PreJIT kernel has been baked into GraphBLAS already. I don't have to compile or load, it's just sitting right there. But I check it with the query function to make sure it's valid."

Process:

Look up in PreJIT kernel table (configured by CMake)
Call query function to validate
Check version, hash, operator strings, identity values
If valid: use kernel
If invalid (stale): remove from hash table, recompile if needed

Timestamp: [00:36:00]

User Operator Special Cases [00:39:00 - 00:44:00]

Discussion: How user-defined operators and types are handled differently than built-ins.

Key Quote: "If you give me a user-defined operator and you just give me the string, you don't give me the C function pointer, I can still work with that. I'll compile it myself, and I only need to use the function pointer in cases where I don't care about performance anyway."

Use Cases:

Query function kernels (return function pointer to user operator)
Size query kernels (return sizeof user-defined type)
ISO-valued matrix reductions (logarithmic time with function pointer calls)

Example: Reducing billion ISO-valued entries: only 30 function pointer calls (log₂ of billion), performance doesn't matter

Timestamp: [00:39:00]

Kernel Loading and Compilation

File Locking for Multi-Process Safety [00:51:00 - 00:53:00]

Discussion: Using file locks to prevent race conditions when multiple GraphBLAS processes compile the same kernel.

Technical Details:

Separate lock file in lock/ subfolder
Lock file named with hash code
Protects compilation/loading of same kernel by different processes
Lock failures cause JIT to be disabled (punt to generic kernel)

Error Handling Philosophy: On any JIT failure (full filesystem, permissions, compiler errors), disable JIT entirely to avoid repeated failures in tight loops.

Timestamp: [00:51:00]

Loading Compiled Kernels [00:55:00 - 00:58:00]

Discussion: Process of loading previously compiled shared libraries.

Process:

Construct filename: kernel_name + hash + .so/.dylib/.dll
Call dlopen (POSIX) or equivalent (Windows)
Get pointer to query function via dlsym
Call query function to validate kernel
If validation fails: close library, delete file, recompile
If validation succeeds: store handle and function pointer in hash

Key Quote: "I try to dlopen it. If I deal opened it, I've loaded it into memory. But this is the first time I've seen this kernel this run. So now I'm going to say, 'Okay, library I just loaded, give me a function to your query.'"

Timestamp: [00:55:00]

Compilation Trigger and Control Flow [00:57:00 - 01:00:00]

Discussion: When compilation is triggered and how LOAD vs ON modes differ.

Logic:

If JIT mode is LOAD and kernel not found on disk → fail (no compilation)
If JIT mode is ON and kernel not found → compile
Compilation happens inside critical section (one kernel at a time)
Only one compilation at a time to avoid complexity of concurrent builds

Timestamp: [00:57:00]

The Macrofy System

Source Code Generation Overview [01:00:00 - 01:02:00]

Discussion: Introduction to the macrofy process - converting encoded kernel specifications into C/CUDA source code.

Key Quote: "This is every CUDA kernel, every OpenMP kernel source code is built right here. One place. All of them both CUDA and OpenMP one place."

Components:

Preface: includes and headers
Family-specific code generation
Query function generation
All handled in GB_jitifyer.c

Timestamp: [01:00:00]

Macrofy Preface [01:01:00]

Discussion: Initial setup of generated source file.

Contents:

Standard includes
CUDA flag (#define for CUDA kernels)
User-provided custom includes (e.g., #include "my_wonky_math.h")
Comments identifying what's being compiled

Timestamp: [01:01:00]

Macrofy Family Structure [01:02:00 - 01:03:00]

Discussion: Top-level dispatcher for different kernel families.

Families:

Reduce
Apply
Assign
MxM (matrix multiply)
Build
Select
Sort
User operations
User types

Single switch statement routes to appropriate family-specific macrofy function

Timestamp: [01:02:00]

Macrofy Reduce Deep Dive [01:03:00 - 01:08:00]

Discussion: Detailed look at how reduce kernel source code is generated.

Process:

Unpack encoding bits (operation, sparse format, zombies, etc.)
Macrofy type definitions
Macrofy monoid (identity value, terminal value, operator)
Macrofy input matrix properties
Set algorithm parameters (panel size for chunked reduction)
Emit definitions and kernel code

Key Quote: "The job of macrofy reduce is to produce this whole file. I have to produce lots of things. I first have to get the code to know what I'm doing. I've encoded this already - which reduction operator, which id - as bits. I'm unpacking the bits that I packed for the code."

Timestamp: [01:03:00]

Macrofy Type Definitions [01:06:00 - 01:08:00]

Discussion: How type definitions are generated for kernels.

Approach:

Use #define instead of typedef for simplicity
Example: #define GBZ_TYPE FC32_t
Up to 6 different types possible in complex kernels
Skip duplicate types to avoid redefinition

Timestamp: [01:06:00]

Macrofy Monoid [01:08:00 - 01:12:00]

Discussion: Comprehensive look at generating monoid code.

Components Generated:

Binary operator function (static inline)
Identity value definition
Terminal value definition (if applicable)
Atomic operation macros
CUDA-specific atomic operations
Special case flags

Process Flow:

Macrofy binary operator
Macrofy identity value
Macrofy terminal value
Generate atomic macros
Emit CUDA variants (even for CPU kernels - unused but simpler)

Timestamp: [01:08:00]

Macrofy Binary Operator [01:11:00 - 01:12:00]

Discussion: Converting binary operators from codes back to source strings.

Cases:

ISO-valued matrix: No operator needed
User-defined: Emit function call
Built-in: Emit inline operation (e.g., z = x + y)

Key Quote: "If it's built-in, then this macro like for plus real becomes this: z = x + y. And the update becomes z += y. Some of them don't have update form, so they just have a call to the static inline function."

Every built-in operator defined as string template - "writing my own compiler here"

Timestamp: [01:11:00]

Macrofy Identity and Terminal Values [01:12:00 - 01:14:00]

Discussion: Generating code for monoid identity and terminal values.

Built-in Types:

Simple assignment: GB_IDENTITY = 0, GB_IDENTITY = 1.0, etc.
Handles: 0, 1, true, false, INT32_MAX, infinity, -infinity, etc.

User-Defined Types:

Identity value stored as raw bytes (e.g., 16 zeros)
Use memset for all-zero patterns
Otherwise emit byte array initialization

Key Quote: "I detected that when you defined the monoid, you gave me an identity value of all 16 zero bytes. So I can use memset to set them all to 0. I can't just use a simple assignment of 0, it's a struct."

Timestamp: [01:12:00]

Backend Selection (CPU vs CUDA) [00:59:00 - 01:00:00]

Discussion: How the system handles different compilation backends.

Technical Details:

File extension: .c for CPU, .cu for CUDA
Specialized compilation paths for each backend
Different compilers invoked (gcc/clang vs nvcc)
Same macrofy code generates both types

Key Quote: "There's only a few places where I have to specialize. The name of the file: is it a .c file or .cu file for CUDA."

Timestamp: [00:59:00]

Error Handling and Robustness

JIT Failure Philosophy [00:53:00 - 00:54:00]

Discussion: How the system handles JIT compilation failures.

Key Quote: "Whenever I have trouble in the JIT, I turn the JIT off, because I know I don't have to use the JIT. And to avoid repeated error, like if you do MxM in a kernel in a loop 1,000 times, and every time I'm going to try to compile 1,000 times - let's not do that. Let's only fail compiled once, and then punt to generic."

Philosophy:

On any error: disable JIT, fall back to generic kernel
Prevents repeated compilation attempts in tight loops
User sees verbose message about what happened
Potential improvement: add flag for "strict mode" to make JIT failures fatal

Timestamp: [00:53:00]

Implementation Statistics and Insights

Code Organization [00:05:00 - 00:06:30]

Key Files:

GB_jitifyer.c - 2,700 lines, core JIT logic
GB_jitifyer.h - header file
GB_file.c - low-level file operations (portable dlopen, file locking)
Macrofy functions - separate files for encodify, enumify, macrofy

Design Principle: Keep static variables in single file to avoid globals

Timestamp: [00:05:00]

CUDA Minimalism [01:17:00]

Discussion: How little CUDA-specific code exists in the JIT.

Key Quote: "If you look at GB_jitifyer.c to see where it talks about CUDA, that's it. Every word as they mention the word CUDA in the JIT. It's so very terse."

CUDA references are minimal:

Backend selection for .cu files
CUDA-specific #define in preface
Device function qualifiers in macros
Otherwise same code for CPU and GPU

Timestamp: [01:17:00]

Session Conclusion [01:17:00]

Coverage Summary:

JIT architecture and global state management
Thread and process synchronization
Five JIT control modes (OFF, PAUSE, RUN, LOAD, ON)
Hash table lookups and fast paths
PreJIT kernel handling
File locking and multi-process safety
Kernel loading from disk
Macrofy system for source code generation
Type, operator, and monoid code generation
Backend selection and compilation triggers

Not Covered (for Session 7):

How compilation is invoked (cmake, direct compiler calls)
Query function generation
Actual compilation commands and flags
Testing and coverage

Key Achievement: Complete understanding of the "meta-compiler" that generates GraphBLAS JIT kernels from encoded specifications.

Timestamp: [01:17:00]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Session 6 Summary

Table of Contents

Overview

JIT Architecture and Global State Management

Global State and Critical Sections [00:06:00 - 00:12:00]

Multi-Threading and Multi-Process Synchronization [00:08:00 - 00:09:30]

Version Management and Cache Folders [00:11:00 - 00:13:00]

JIT Control Modes

Five JIT Control States [00:23:30 - 00:27:00]

Cross-Compilation Pattern for Embedded Systems [00:26:00 - 00:28:30]

JIT Load Process

Primary Entry Point: jitifyer_load [00:21:00 - 00:23:00]

Fast Path Lookup (RUN Mode) [00:29:00 - 00:32:00]

Hash Table Entry Structure [00:30:00 - 00:31:00]

PreJIT Kernel Handling [00:36:00 - 00:38:30]

User Operator Special Cases [00:39:00 - 00:44:00]

Kernel Loading and Compilation

File Locking for Multi-Process Safety [00:51:00 - 00:53:00]

Loading Compiled Kernels [00:55:00 - 00:58:00]

Compilation Trigger and Control Flow [00:57:00 - 01:00:00]

The Macrofy System

Source Code Generation Overview [01:00:00 - 01:02:00]

Macrofy Preface [01:01:00]

Macrofy Family Structure [01:02:00 - 01:03:00]

Macrofy Reduce Deep Dive [01:03:00 - 01:08:00]

Macrofy Type Definitions [01:06:00 - 01:08:00]

Macrofy Monoid [01:08:00 - 01:12:00]

Macrofy Binary Operator [01:11:00 - 01:12:00]

Macrofy Identity and Terminal Values [01:12:00 - 01:14:00]

Backend Selection (CPU vs CUDA) [00:59:00 - 01:00:00]

Error Handling and Robustness

JIT Failure Philosophy [00:53:00 - 00:54:00]

Implementation Statistics and Insights

Code Organization [00:05:00 - 00:06:30]

CUDA Minimalism [01:17:00]

Session Conclusion [01:17:00]

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Session 6 Summary

Table of Contents

Overview

JIT Architecture and Global State Management

Global State and Critical Sections [00:06:00 - 00:12:00]

Multi-Threading and Multi-Process Synchronization [00:08:00 - 00:09:30]

Version Management and Cache Folders [00:11:00 - 00:13:00]

JIT Control Modes

Five JIT Control States [00:23:30 - 00:27:00]

Cross-Compilation Pattern for Embedded Systems [00:26:00 - 00:28:30]

JIT Load Process

Primary Entry Point: jitifyer_load [00:21:00 - 00:23:00]

Fast Path Lookup (RUN Mode) [00:29:00 - 00:32:00]

Hash Table Entry Structure [00:30:00 - 00:31:00]

PreJIT Kernel Handling [00:36:00 - 00:38:30]

User Operator Special Cases [00:39:00 - 00:44:00]

Kernel Loading and Compilation

File Locking for Multi-Process Safety [00:51:00 - 00:53:00]

Loading Compiled Kernels [00:55:00 - 00:58:00]

Compilation Trigger and Control Flow [00:57:00 - 01:00:00]

The Macrofy System

Source Code Generation Overview [01:00:00 - 01:02:00]

Macrofy Preface [01:01:00]

Macrofy Family Structure [01:02:00 - 01:03:00]

Macrofy Reduce Deep Dive [01:03:00 - 01:08:00]

Macrofy Type Definitions [01:06:00 - 01:08:00]

Macrofy Monoid [01:08:00 - 01:12:00]

Macrofy Binary Operator [01:11:00 - 01:12:00]

Macrofy Identity and Terminal Values [01:12:00 - 01:14:00]

Backend Selection (CPU vs CUDA) [00:59:00 - 01:00:00]

Error Handling and Robustness

JIT Failure Philosophy [00:53:00 - 00:54:00]

Implementation Statistics and Insights

Code Organization [00:05:00 - 00:06:30]

CUDA Minimalism [01:17:00]

Session Conclusion [01:17:00]