Skip to content

[cpu][x86] GEMM vectorization example#74

Open
adam-smnk wants to merge 9 commits intollvm:mainfrom
adam-smnk:cpu-gemm-schedule
Open

[cpu][x86] GEMM vectorization example#74
adam-smnk wants to merge 9 commits intollvm:mainfrom
adam-smnk:cpu-gemm-schedule

Conversation

@adam-smnk
Copy link
Member

Adds x86-specific vectorization example for matrix multiplication.
Comes with a collection of opinionated but reusable transforms and schedules.

The lowering schedule currently supports F32 (general) and BF16 (avx512, flat layout) matmuls.

@adam-smnk adam-smnk marked this pull request as draft March 13, 2026 15:48
@adam-smnk
Copy link
Member Author

Needs changes from #65 related to using multiple schedules in a workload.

The added transform module aims to provide small reusable transform "bundles" to simplify writing schedules.
The schedules ended up mostly wrapping the transforms to hide the schedule creation boilerplate.
Finally, the matmul example create a vectorization lowering using these building blocks plus a few problem specific bits that I didn't feel are generic enough for reuse.

All these helpers are opinionated by design, mostly modeled by what is needed for the example. The APIs could probably be refined. Also, schedules ended up being mostly simple wrappers around the transform bundles. Perhaps, it's not worth having both modules.
Open to suggestions.

@adam-smnk
Copy link
Member Author

Reworked transform module to provide simple APIs over transform ops.
Schedule module now takes care of op matching to provide simple reusable rewrites.

@adam-smnk adam-smnk marked this pull request as ready for review March 16, 2026 15:29
Copy link
Member

@rengolin rengolin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason why I wanted to add a python file as a schedule was to be able to reuse all of those new schedules you created and added to the lighthouse scope. We can discuss that later.

Some comments inline.

) -> bool:
A, B, C = self._input_arrays
out_ref = np.matmul(A, B, dtype=np.float32)
return np.allclose(C, out_ref)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this comparing with the kernel execution output?

if dtype == ml_dtypes.bfloat16:
# For BF16, enforce fixed tile size due to current rewriter pattern matching limitation.
# TODO: Relax when x86 BF16 pass supports dynamic indexing.
tile_size = 32
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps a warning message (stderr?) saying you did this, to avoid surprises.

dump_payload=args.dump_kernel,
dump_schedule=args.dump_schedule,
)
else:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need for else here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants