Skip to content

[Performance] --enable-insert-sync on a dynamic GEMM example generates 10% slower kernel than manual sync version #226

@learning-chip

Description

@learning-chip

Summary

Record a practical use case where ptoas --enable-insert-sync still has ~10% room for performance improvement, compared to a known manual-sync plan.

Background

I wrote a dynamic-shape matmul that is 2x faster than original pto-isa gemm_performance example and 0.9~1.1x of aclnnMatmul in CANN 8.5.0. See matmul_swizzle/simple_demo to reproduce.

The auto-sync version is only ~100 lines of Python, and reaching 90% of manual-sync is quite decent. I just wonder if the last 10% perf gap can be filled.

Command line

ptoas --enable-insert-sync simple_matmul_auto_sync.pto -o simple_matmul_auto_sync.cpp
ptoas simple_matmul_manual_sync.pto -o simple_matmul_manual_sync.cpp

Reproduction input

pto_matmul.zip

contains both inputs:

  • simple_matmul_auto_sync.pto
  • simple_matmul_manual_sync.pto

and outputs:

  • simple_matmul_auto_sync.cpp
  • simple_matmul_manual_sync.cpp

Expected performance

Auto-sync should be ideally as fast as manual sync version. (or discover even faster pipelining?)

Actual performance

Auto-sync is 5~15% slower than manual-sync, see the detailed PRs below (contains full code with kernel launch, and on-device performance measurement on 910B2:):

Git commit

29ed536

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status

    In Progress

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions