Skip to content

Add Cayley-Hamilton triangular inverse example#357

Draft
ChristosMatzoros wants to merge 1 commit into
hw-native-sys:mainfrom
ChristosMatzoros:triangular_inverse
Draft

Add Cayley-Hamilton triangular inverse example#357
ChristosMatzoros wants to merge 1 commit into
hw-native-sys:mainfrom
ChristosMatzoros:triangular_inverse

Conversation

@ChristosMatzoros
Copy link
Copy Markdown

Add Cayley-Hamilton triangular inverse example

Summary

Adds examples/intermediate/tri_inverse.py — a new intermediate example that
computes inv(I - A) for strict-lower-triangular A via the Cayley-Hamilton
doubling algorithm. Targets the chunked-GDN triangular-inverse used in
Qwen3-Next gated delta-rule attention.

Algorithm

A is strict-lower-triangular, therefore nilpotent (A^n = 0), so the
doubling iteration terminates exactly in ceil(log2(n)) steps:

X_0 = I,  Y_0 = A
for k in range(ceil(log2(n))):
    X_{k+1} = X_k + X_k @ Y_k     # equivalently X_k (I + Y_k)
    Y_{k+1} = Y_k @ Y_k
# Result: X = inv(I - A)

For n = 128 this is 7 doubling steps → 14 matmuls + 7 adds, all on
[128, 128] FP32 tiles.

Sign convention solves inv(I - A), matching pypto2's Qwen inverse_pto in
models/qwen3_next/gated_delta_rule_impl.py.

Kernel structure

Each iteration is laid out as four pl.at(CORE_GROUP, chunked_loop_optimizer)
scopes, all M-parallel (pl.parallel(0, n, m_tile, chunk=m_chunk)) with
K-blocked pl.matmul + pl.matmul_acc inside each scope:

scope reads writes
cayley_x_matmul X_state, Y_state X_acc
cayley_x_add X_state, X_acc X_state
cayley_y_snapshot Y_state Y_temp
cayley_y_square Y_temp Y_state

State (X_state, Y_state, Y_temp, X_acc) lives in pl.create_tensor GM
buffers and flows across scopes.

Why X-update is split into matmul + add. Keeping x_row (read),
acc (matmul output), and x_new (add output) all alive in one pl.at puts
three [m_tile, n] tiles in the live set per scope. With the cube/Vec
pipeline doublers this just reaches the 192 KB AIV Vec budget; the Vec
allocator now spills the last M-tile and corrupts its output. Splitting
matmul into one scope (writes a fresh GM tensor X_acc) and add into a
second (reads x_row + X_acc, writes X_state) halves the live set per
scope and keeps the kernel valid with the same m_tile=32, k_tile=64 config.

Y snapshot. Y_temp = Y_state before Y = Y @ Y so the matmul reads
from a stable tensor while another parallel iter is writing its row slab to
Y_state.

Validation

Validates against torch.linalg.inv(I - A) on a real Ascend 910B2 NPU with
rtol = atol = 2e-2. Random A seeded for reproducibility with a local
torch.Generator(seed=0); init scale 1 / (4 * sqrt(n)) targets
||A||_op ≈ 0.5 so the intermediate-tile magnitudes through 7 doubling
iterations stay bounded.

Tolerance rationale: 14 chained matmuls on the 910B2 cube (FP16 multiply +
FP32 accumulate) compound to ~1e-2 relative error, and matmul reduction
ordering is not bit-deterministic across runs. 2e-2 keeps the test green
across 5+ consecutive invocations.

n result
64 PASS
128 PASS
256 PASS
512 PASS

Test plan

  • Compile + run + validate on Ascend 910B2 at n=64, 128, 256, 512
  • Run with seed=0 5+ times — stable PASS within rtol=atol=2e-2
  • Reviewer runs python examples/intermediate/tri_inverse.py -p a2a3 -d <device>
  • Reviewer confirms script follows the canonical gemm.py style
    (M-parallel pl.parallel inside pl.at(chunked_loop_optimizer),
    K-blocked matmul/matmul_acc, GM state across pl.at scopes)

How to run

# On Ascend 910B2 device
python examples/intermediate/tri_inverse.py -p a2a3 -d <device_id>

# On simulator
python examples/intermediate/tri_inverse.py -p a2a3sim

# Other sizes (must be even and admit n // k_tile >= 1)
python examples/intermediate/tri_inverse.py -p a2a3 -d <device_id> -n 256

New intermediate example computing inv(I - A) for strict-lower-triangular
A via the Cayley-Hamilton doubling algorithm (X = X + X@Y, Y = Y@Y for
ceil(log2(n)) iterations).  Targets the chunked-GDN tri-inverse used in
Qwen3-Next.

Validates against torch.linalg.inv(I - A) on Ascend 910B2 (n=128, FP32).
Structure follows the canonical gemm.py pattern: M-parallel pl.parallel
inside pl.at(chunked_loop_optimizer) with K-blocked pl.matmul +
pl.matmul_acc to keep each InCore function inside the 192 KB AIV Vec
budget.  Y is snapshotted to Y_temp before the Y = Y@Y step so the
matmul reads from a stable tensor while writing Y_state.

Test setup:
- Seeded random init (local torch.Generator(seed=0)) for reproducibility.
- Init scale 1/(4*sqrt(n)) targets ||A||_op ~= 0.5, bounding the
  intermediate-tile magnitudes through the 7 doubling iterations.
- Tolerance rtol = atol = 2e-2: 14 chained matmuls on the Ascend 910B2
  cube (FP16 multiply + FP32 accumulate) compound to ~1e-2 relative
  error, and matmul reduction ordering is not bit-deterministic across
  runs.  2e-2 keeps the test green across 5+ consecutive invocations.
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 22, 2026

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: a7c1971f-c3c5-4b67-bd5b-a68e20ada545

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new example script, examples/intermediate/tri_inverse.py, which implements the Cayley-Hamilton doubling algorithm to compute the inverse of a strict-lower-triangular matrix. The implementation utilizes tiled matrix multiplications and additions within a specialized program structure. Feedback highlights that the current logic assumes the matrix dimension is a multiple of the tile sizes, which should be explicitly validated to prevent out-of-bounds access or incorrect results. Additionally, the reviewer suggests using double buffering for the state tensors to avoid inefficient global memory copies during each doubling iteration.

Comment on lines +54 to +55
n_steps = max(1, (n - 1).bit_length()) # 7 for n=128
k_blocks = n // k_tile # 2 for n=128, k_tile=64
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The current implementation assumes that n is a multiple of both m_tile and k_tile. If n is not a multiple, the pl.slice operations will either access out-of-bounds memory or the k_blocks loop will miss the remainder of the K dimension, leading to incorrect results or runtime errors. It is recommended to add an explicit check or handle the remainder.

Suggested change
n_steps = max(1, (n - 1).bit_length()) # 7 for n=128
k_blocks = n // k_tile # 2 for n=128, k_tile=64
n_steps = max(1, (n - 1).bit_length()) # 7 for n=128
if n % m_tile != 0 or n % k_tile != 0:
raise ValueError(f"n ({n}) must be a multiple of m_tile ({m_tile}) and k_tile ({k_tile})")
k_blocks = n // k_tile # 2 for n=128, k_tile=64

Comment on lines +130 to +135
with pl.at(level=pl.Level.CORE_GROUP,
optimization=pl.chunked_loop_optimizer,
name_hint="cayley_y_snapshot"):
for mb in pl.parallel(0, n, m_tile, chunk=m_chunk):
y_row = pl.slice(Y_state, [m_tile, n], [mb, 0])
Y_temp = pl.assemble(Y_temp, y_row, [mb, 0])
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The cayley_y_snapshot scope performs a full GM-to-GM copy of Y_state to Y_temp in every doubling iteration. This is inefficient as it consumes significant global memory bandwidth. A more efficient approach is to use double buffering: create two Y tensors and alternate between them as source and destination in each iteration. This avoids the explicit copy scope entirely.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant