[Performance] `--enable-insert-sync` on a dynamic GEMM example generates 10% slower kernel than manual sync version

### Summary

Record a practical use case where `ptoas --enable-insert-sync` still has  ~10% room for performance improvement, compared to a known manual-sync plan. 

### Background

I wrote a dynamic-shape matmul that is 2x faster than [original pto-isa `gemm_performance` example](https://gitcode.com/cann/pto-isa/tree/699931c6400b68bd19e3803eec61ad0268e2aeb0/kernels/manual/a2a3/gemm_performance) and 0.9~1.1x of [`aclnnMatmul` in CANN 8.5.0](https://www.hiascend.com/document/detail/zh/canncommercial/850/API/aolapi/context/ops-nn/aclnnMatmul.md). See [matmul_swizzle/simple_demo](https://github.com/huawei-csl/pto-dsl/tree/e7a68427ce867048ab2d43fd6847e320262bed51/examples/aot/matmul_swizzle/simple_demo) to reproduce.

The auto-sync version is [only ~100 lines of Python](https://github.com/huawei-csl/pto-dsl/blob/e7a68427ce867048ab2d43fd6847e320262bed51/examples/aot/matmul_swizzle/simple_demo/simple_matmul_builder.py#L227-L340), and reaching 90% of manual-sync is quite decent. I just wonder if the last 10% perf gap can be filled.

### Command line

```bash
ptoas --enable-insert-sync simple_matmul_auto_sync.pto -o simple_matmul_auto_sync.cpp
ptoas simple_matmul_manual_sync.pto -o simple_matmul_manual_sync.cpp
```

### Reproduction input

[pto_matmul.zip](https://github.com/user-attachments/files/25855588/pto_matmul.zip)

contains both inputs:
- `simple_matmul_auto_sync.pto`
- `simple_matmul_manual_sync.pto`

and outputs:
- `simple_matmul_auto_sync.cpp`
- `simple_matmul_manual_sync.cpp`

### Expected performance

Auto-sync should be ideally as fast as manual sync version. (or discover even faster pipelining?)

### Actual performance

Auto-sync is 5~15% slower than manual-sync, see the detailed PRs below (contains full code with kernel launch, and on-device performance measurement on 910B2:):

- Directly written in PTO-ISA C++: https://github.com/huawei-csl/pto-kernels/pull/26
- Python, manual-sync, equal performance as C++ version above https://github.com/huawei-csl/pto-dsl/pull/72
- Python, using ptoas auto-sync, 10% slower than manual sync https://github.com/huawei-csl/pto-dsl/pull/73

### Git commit

29ed536dedd4e0f057fde99473d25ed72fe53ba4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Performance] `--enable-insert-sync` on a dynamic GEMM example generates 10% slower kernel than manual sync version #226

Summary

Background

Command line

Reproduction input

Expected performance

Actual performance

Git commit

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Performance] --enable-insert-sync on a dynamic GEMM example generates 10% slower kernel than manual sync version #226

Description

Summary

Background

Command line

Reproduction input

Expected performance

Actual performance

Git commit

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

[Performance] `--enable-insert-sync` on a dynamic GEMM example generates 10% slower kernel than manual sync version #226