Add Cayley-Hamilton triangular inverse example by ChristosMatzoros · Pull Request #357 · hw-native-sys/pypto-lib

ChristosMatzoros · 2026-05-22T09:47:12Z

Add Cayley-Hamilton triangular inverse example

Summary

Adds examples/intermediate/tri_inverse.py — a new intermediate example that
computes inv(I - A) for strict-lower-triangular A via the Cayley-Hamilton
doubling algorithm. Targets the chunked-GDN triangular-inverse used in
Qwen3-Next gated delta-rule attention.

Algorithm

A is strict-lower-triangular, therefore nilpotent (A^n = 0), so the
doubling iteration terminates exactly in ceil(log2(n)) steps:

X_0 = I,  Y_0 = A
for k in range(ceil(log2(n))):
    X_{k+1} = X_k + X_k @ Y_k     # equivalently X_k (I + Y_k)
    Y_{k+1} = Y_k @ Y_k
# Result: X = inv(I - A)

For n = 128 this is 7 doubling steps → 14 matmuls + 7 adds, all on
[128, 128] FP32 tiles.

Sign convention solves inv(I - A), matching pypto2's Qwen inverse_pto in
models/qwen3_next/gated_delta_rule_impl.py.

Kernel structure

Each iteration is laid out as four pl.at(CORE_GROUP, chunked_loop_optimizer)
scopes, all M-parallel (pl.parallel(0, n, m_tile, chunk=m_chunk)) with
K-blocked pl.matmul + pl.matmul_acc inside each scope:

scope	reads	writes
`cayley_x_matmul`	`X_state`, `Y_state`	`X_acc`
`cayley_x_add`	`X_state`, `X_acc`	`X_state`
`cayley_y_snapshot`	`Y_state`	`Y_temp`
`cayley_y_square`	`Y_temp`	`Y_state`

State (X_state, Y_state, Y_temp, X_acc) lives in pl.create_tensor GM
buffers and flows across scopes.

Why X-update is split into matmul + add. Keeping x_row (read),
acc (matmul output), and x_new (add output) all alive in one pl.at puts
three [m_tile, n] tiles in the live set per scope. With the cube/Vec
pipeline doublers this just reaches the 192 KB AIV Vec budget; the Vec
allocator now spills the last M-tile and corrupts its output. Splitting
matmul into one scope (writes a fresh GM tensor X_acc) and add into a
second (reads x_row + X_acc, writes X_state) halves the live set per
scope and keeps the kernel valid with the same m_tile=32, k_tile=64 config.

Y snapshot. Y_temp = Y_state before Y = Y @ Y so the matmul reads
from a stable tensor while another parallel iter is writing its row slab to
Y_state.

Validation

Validates against torch.linalg.inv(I - A) on a real Ascend 910B2 NPU with
rtol = atol = 2e-2. Random A seeded for reproducibility with a local
torch.Generator(seed=0); init scale 1 / (4 * sqrt(n)) targets
||A||_op ≈ 0.5 so the intermediate-tile magnitudes through 7 doubling
iterations stay bounded.

Tolerance rationale: 14 chained matmuls on the 910B2 cube (FP16 multiply +
FP32 accumulate) compound to ~1e-2 relative error, and matmul reduction
ordering is not bit-deterministic across runs. 2e-2 keeps the test green
across 5+ consecutive invocations.

n	result
64	PASS
128	PASS
256	PASS
512	PASS

Test plan

Compile + run + validate on Ascend 910B2 at n=64, 128, 256, 512
Run with seed=0 5+ times — stable PASS within rtol=atol=2e-2
Reviewer runs python examples/intermediate/tri_inverse.py -p a2a3 -d <device>
Reviewer confirms script follows the canonical gemm.py style
(M-parallel pl.parallel inside pl.at(chunked_loop_optimizer),
K-blocked matmul/matmul_acc, GM state across pl.at scopes)

How to run

# On Ascend 910B2 device
python examples/intermediate/tri_inverse.py -p a2a3 -d <device_id>

# On simulator
python examples/intermediate/tri_inverse.py -p a2a3sim

# Other sizes (must be even and admit n // k_tile >= 1)
python examples/intermediate/tri_inverse.py -p a2a3 -d <device_id> -n 256

New intermediate example computing inv(I - A) for strict-lower-triangular A via the Cayley-Hamilton doubling algorithm (X = X + X@Y, Y = Y@Y for ceil(log2(n)) iterations). Targets the chunked-GDN tri-inverse used in Qwen3-Next. Validates against torch.linalg.inv(I - A) on Ascend 910B2 (n=128, FP32). Structure follows the canonical gemm.py pattern: M-parallel pl.parallel inside pl.at(chunked_loop_optimizer) with K-blocked pl.matmul + pl.matmul_acc to keep each InCore function inside the 192 KB AIV Vec budget. Y is snapshotted to Y_temp before the Y = Y@Y step so the matmul reads from a stable tensor while writing Y_state. Test setup: - Seeded random init (local torch.Generator(seed=0)) for reproducibility. - Init scale 1/(4*sqrt(n)) targets ||A||_op ~= 0.5, bounding the intermediate-tile magnitudes through the 7 doubling iterations. - Tolerance rtol = atol = 2e-2: 14 chained matmuls on the Ascend 910B2 cube (FP16 multiply + FP32 accumulate) compound to ~1e-2 relative error, and matmul reduction ordering is not bit-deterministic across runs. 2e-2 keeps the test green across 5+ consecutive invocations.

coderabbitai · 2026-05-22T09:47:20Z

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: a7c1971f-c3c5-4b67-bd5b-a68e20ada545

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request introduces a new example script, examples/intermediate/tri_inverse.py, which implements the Cayley-Hamilton doubling algorithm to compute the inverse of a strict-lower-triangular matrix. The implementation utilizes tiled matrix multiplications and additions within a specialized program structure. Feedback highlights that the current logic assumes the matrix dimension is a multiple of the tile sizes, which should be explicitly validated to prevent out-of-bounds access or incorrect results. Additionally, the reviewer suggests using double buffering for the state tensors to avoid inefficient global memory copies during each doubling iteration.

gemini-code-assist · 2026-05-22T09:49:58Z

+    n_steps = max(1, (n - 1).bit_length())  # 7 for n=128
+    k_blocks = n // k_tile  # 2 for n=128, k_tile=64


The current implementation assumes that n is a multiple of both m_tile and k_tile. If n is not a multiple, the pl.slice operations will either access out-of-bounds memory or the k_blocks loop will miss the remainder of the K dimension, leading to incorrect results or runtime errors. It is recommended to add an explicit check or handle the remainder.

Suggested change

n_steps = max(1, (n - 1).bit_length()) # 7 for n=128

k_blocks = n // k_tile # 2 for n=128, k_tile=64

n_steps = max(1, (n - 1).bit_length()) # 7 for n=128

if n % m_tile != 0 or n % k_tile != 0:

raise ValueError(f"n ({n}) must be a multiple of m_tile ({m_tile}) and k_tile ({k_tile})")

k_blocks = n // k_tile # 2 for n=128, k_tile=64

gemini-code-assist · 2026-05-22T09:49:58Z

+                with pl.at(level=pl.Level.CORE_GROUP,
+                           optimization=pl.chunked_loop_optimizer,
+                           name_hint="cayley_y_snapshot"):
+                    for mb in pl.parallel(0, n, m_tile, chunk=m_chunk):
+                        y_row = pl.slice(Y_state, [m_tile, n], [mb, 0])
+                        Y_temp = pl.assemble(Y_temp, y_row, [mb, 0])


The cayley_y_snapshot scope performs a full GM-to-GM copy of Y_state to Y_temp in every doubling iteration. This is inefficient as it consumes significant global memory bandwidth. A more efficient approach is to use double buffering: create two Y tensors and alternate between them as source and destination in each iteration. This avoids the explicit copy scope entirely.

gemini-code-assist Bot reviewed May 22, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Cayley-Hamilton triangular inverse example#357

Add Cayley-Hamilton triangular inverse example#357
ChristosMatzoros wants to merge 1 commit into
hw-native-sys:mainfrom
ChristosMatzoros:triangular_inverse

ChristosMatzoros commented May 22, 2026

Uh oh!

coderabbitai Bot commented May 22, 2026

Review skipped

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 22, 2026

Uh oh!

gemini-code-assist Bot May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		n_steps = max(1, (n - 1).bit_length()) # 7 for n=128
		k_blocks = n // k_tile # 2 for n=128, k_tile=64

Conversation

ChristosMatzoros commented May 22, 2026

Add Cayley-Hamilton triangular inverse example

Summary

Algorithm

Kernel structure

Validation

Test plan

How to run

Uh oh!

coderabbitai Bot commented May 22, 2026

Review skipped

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 22, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 22, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant