Skip to content

[Bug] FA lit regression test with S1_TILE=512 crashes at runtime, splitting to S1_TILE=256 works #629

@MirkoDeVita98

Description

@MirkoDeVita98

Component

PTO Dialect / ODS (include/PTO/IR)

Description

The MLIR fa_perf.pto provided in #609

uses:

Q_ROWS=3072
S1_TOTAL=8192
HEAD=128
S1_TILE=512
QK tile: 32x512xf32
P tile sent V2C: 16x512xf16
PV tile: 16x128xf32

PTOAS accepts the MLIR and bisheng builds the generated C++ shared object, but the first kernel launch crashes with an AICore exception before correctness or benchmarking can run.

The smallest working change we found was to split the S1_TILE=512 protocol into two S1_TILE=256 tiles while keeping the same total problem size:

Q_ROWS=3072
S1_TOTAL=8192
HEAD=128
S1_TILE=256
NUM_TILES=32

This changes the tiles from 16x512 vector/QK chunks to 16x256
chunks. It also changes all derived MLIR constants: QK FIFO slot size, GM
per-block scratch size, tile buffer shapes, local scratch offsets, and the
software-pipeline loop count.

Important isolation results:

Original fa_perf.pto, S1_TILE=512, Q=3072, KV=8192: crashes
Same S1_TILE=512 shape with P pipe local_slot_num=1: still crashes
S1_TILE=512 with smaller Q=2048, KV=4096: still crashes
S1_TILE=256 with Q=3072, KV=8192, NUM_TILES=32: passes

So this does not appear to be caused only by total sequence length, block count, GM slot size, or the P V2C FIFO local slot count. The failure has to do with the 512-wide QK/P tile protocol.

Right now, the performance of the patched version is poor:

Q_ROWS=3072
S1_TOTAL=8192
HEAD=128
block_dim=24
kernel_us=438.92
kernel_tflops=29.356
fused_us=188.20
fused_tflops=68.462
speedup_vs_fused=0.43x

Reproduction (minimal)

Use the fa_perf.pto in #609

Expected behavior

The patched version with S1_TILE=256 works:

module {
  func.func @cube_kernel(%arg0: !pto.ptr<f32>, %arg1: !pto.ptr<f16>, %arg2: !pto.ptr<f16>, %arg3: !pto.ptr<f16>) attributes {pto.kernel_kind = #pto.kernel_kind<cube>} {
    %c0 = arith.constant 0 : index
    %c1 = arith.constant 1 : index
    %c2 = arith.constant 2 : index
    %c32 = arith.constant 32 : index
    %c128 = arith.constant 128 : index
    %c256 = arith.constant 256 : index
    %c8192 = arith.constant 8192 : index
    %c32_0 = arith.constant 32 : index
    %c96 = arith.constant 96 : index
    %0 = pto.get_block_num
    %1 = arith.index_cast %0 : i64 to index
    %2 = pto.get_block_idx
    %3 = arith.index_cast %2 : i64 to index
    %4 = arith.divsi %c96, %1 : index
    %5 = arith.remsi %c96, %1 : index
    %6 = arith.addi %4, %c1 : index
    %7 = arith.muli %3, %6 : index
    %8 = arith.addi %4, %c1 : index
    %9 = arith.muli %5, %8 : index
    %10 = arith.subi %3, %5 : index
    %11 = arith.muli %10, %4 : index
    %12 = arith.addi %9, %11 : index
    %13 = arith.cmpi slt, %3, %5 : index
    %14 = arith.select %13, %7, %12 : index
    %15 = arith.cmpi slt, %3, %5 : index
    %16 = arith.addi %4, %c1 : index
    %17 = arith.select %15, %16, %4 : index
    %18 = arith.addi %14, %17 : index
    %c131072 = arith.constant 131072 : index
    %19 = arith.muli %3, %c131072 : index
    %20 = pto.addptr %arg0, %19 : <f32> -> <f32>
    %c0_1 = arith.constant 0 : index
    %21 = pto.addptr %20, %c0_1 : <f32> -> <f32>
    %c65536 = arith.constant 65536 : index
    %22 = pto.addptr %20, %c65536 : <f32> -> <f32>
    %c98304 = arith.constant 98304 : index
    %23 = pto.addptr %20, %c98304 : <f32> -> <f32>
    %24 = pto.import_reserved_buffer{name = "fa_qk_c2v_fifo", peer_func = @vector_kernel} -> i32
    %25 = pto.initialize_l2g2l_pipe{dir_mask = 1, slot_size = 32768, slot_num = 8, local_slot_num = 1} (%21 : !pto.ptr<f32>, %24 : i32) -> !pto.pipe
    %26 = pto.import_reserved_buffer{name = "fa_pv_c2v_fifo", peer_func = @vector_kernel} -> i32
    %27 = pto.initialize_l2g2l_pipe{dir_mask = 1, slot_size = 16384, slot_num = 8, local_slot_num = 1} (%22 : !pto.ptr<f32>, %26 : i32) -> !pto.pipe
    %28 = pto.reserve_buffer{name = "fa_p_v2c_fifo", size = 16384, location = <mat>, auto = false, base = 262144} -> i32
    %c0_i32 = arith.constant 0 : i32
    pto.aic_initialize_pipe {id = 30, dir_mask = 2, slot_size = 16384, local_slot_num = 1, nosplit = false}(gm_slot_buffer = %23 : !pto.ptr<f32>, c2v_consumer_buf = %c0_i32 : i32, v2c_consumer_buf = %28 : i32)
    %c0_i64 = arith.constant 0 : i64
    %c0_i64_2 = arith.constant 0 : i64
    %29 = pto.alloc_tile addr = %c0_i64_2 : !pto.tile_buf<mat, 32x128xf16, blayout=col_major, slayout=row_major>
    %c0_i64_3 = arith.constant 0 : i64
    %30 = pto.alloc_tile addr = %c0_i64_3 : !pto.tile_buf<left, 32x128xf16, slayout=row_major>
    %c8192_i64 = arith.constant 8192 : i64
    %31 = pto.alloc_tile addr = %c8192_i64 : !pto.tile_buf<mat, 128x256xf16, slayout=col_major>
    %32 = pto.alloc_tile addr = %c0_i64 : !pto.tile_buf<right, 128x256xf16, slayout=col_major>
    %c0_i64_4 = arith.constant 0 : i64
    %33 = pto.alloc_tile addr = %c0_i64_4 : !pto.tile_buf<acc, 32x256xf32, blayout=col_major, slayout=row_major, fractal=1024>
    %c73728_i64 = arith.constant 73728 : i64
    %34 = pto.alloc_tile addr = %c73728_i64 : !pto.tile_buf<mat, 32x256xf16, blayout=col_major, slayout=row_major>
    %c8192_i64_5 = arith.constant 8192 : i64
    %35 = pto.alloc_tile addr = %c8192_i64_5 : !pto.tile_buf<left, 32x256xf16, slayout=row_major>
    %c90112_i64 = arith.constant 90112 : i64
    %36 = pto.alloc_tile addr = %c90112_i64 : !pto.tile_buf<mat, 256x128xf16, blayout=col_major, slayout=row_major>
    %37 = pto.alloc_tile addr = %c0_i64 : !pto.tile_buf<right, 256x128xf16, slayout=col_major>
    %c32768_i64 = arith.constant 32768 : i64
    %38 = pto.alloc_tile addr = %c32768_i64 : !pto.tile_buf<acc, 32x128xf32, blayout=col_major, slayout=row_major, fractal=1024>
    %c3072 = arith.constant 3072 : index
    %39 = pto.make_tensor_view %arg1, shape = [%c3072, %c128], strides = [%c128, %c1] : !pto.tensor_view<?x?xf16>
    %40 = pto.make_tensor_view %arg2, shape = [%c128, %c8192], strides = [%c1, %c128] : !pto.tensor_view<?x?xf16>
    %41 = pto.make_tensor_view %arg3, shape = [%c8192, %c128], strides = [%c128, %c1] : !pto.tensor_view<?x?xf16>
    scf.for %arg4 = %14 to %18 step %c1 {
      %42 = arith.muli %arg4, %c32 : index
      %43 = pto.partition_view %39, offsets = [%42, %c0], sizes = [%c32, %c128] : !pto.tensor_view<?x?xf16>
      pto.tload ins(%43 : !pto.partition_tensor_view<32x128xf16>) outs(%29 : !pto.tile_buf<mat, 32x128xf16, blayout=col_major, slayout=row_major>)
      pto.tmov ins(%29 : !pto.tile_buf<mat, 32x128xf16, blayout=col_major, slayout=row_major>) outs(%30 : !pto.tile_buf<left, 32x128xf16, slayout=row_major>)
      %c0_6 = arith.constant 0 : index
      %44 = pto.partition_view %40, offsets = [%c0, %c0_6], sizes = [%c128, %c256] : !pto.tensor_view<?x?xf16>
      pto.tload ins(%44 : !pto.partition_tensor_view<128x256xf16>) outs(%31 : !pto.tile_buf<mat, 128x256xf16, slayout=col_major>)
      pto.tmov ins(%31 : !pto.tile_buf<mat, 128x256xf16, slayout=col_major>) outs(%32 : !pto.tile_buf<right, 128x256xf16, slayout=col_major>)
      pto.tmatmul ins(%30, %32 : !pto.tile_buf<left, 32x128xf16, slayout=row_major>, !pto.tile_buf<right, 128x256xf16, slayout=col_major>) outs(%33 : !pto.tile_buf<acc, 32x256xf32, blayout=col_major, slayout=row_major, fractal=1024>)
      pto.tpush(%33, %25 : !pto.tile_buf<acc, 32x256xf32, blayout=col_major, slayout=row_major, fractal=1024>, !pto.pipe) {split = 1}
      %c256_7 = arith.constant 256 : index
      %45 = pto.partition_view %40, offsets = [%c0, %c256_7], sizes = [%c128, %c256] : !pto.tensor_view<?x?xf16>
      pto.tload ins(%45 : !pto.partition_tensor_view<128x256xf16>) outs(%31 : !pto.tile_buf<mat, 128x256xf16, slayout=col_major>)
      pto.tmov ins(%31 : !pto.tile_buf<mat, 128x256xf16, slayout=col_major>) outs(%32 : !pto.tile_buf<right, 128x256xf16, slayout=col_major>)
      pto.tmatmul ins(%30, %32 : !pto.tile_buf<left, 32x128xf16, slayout=row_major>, !pto.tile_buf<right, 128x256xf16, slayout=col_major>) outs(%33 : !pto.tile_buf<acc, 32x256xf32, blayout=col_major, slayout=row_major, fractal=1024>)
      pto.tpush(%33, %25 : !pto.tile_buf<acc, 32x256xf32, blayout=col_major, slayout=row_major, fractal=1024>, !pto.pipe) {split = 1}
      %46 = pto.partition_view %41, offsets = [%c0, %c0], sizes = [%c256, %c128] : !pto.tensor_view<?x?xf16>
      pto.tload ins(%46 : !pto.partition_tensor_view<256x128xf16>) outs(%36 : !pto.tile_buf<mat, 256x128xf16, blayout=col_major, slayout=row_major>)
      %c15 = arith.constant 15 : index
      scf.for %arg5 = %c0 to %c15 step %c1 {
        %50 = arith.muli %arg5, %c2 : index
        %c2_8 = arith.constant 2 : index
        %51 = arith.addi %50, %c2_8 : index
        %52 = arith.muli %51, %c256 : index
        %53 = pto.partition_view %40, offsets = [%c0, %52], sizes = [%c128, %c256] : !pto.tensor_view<?x?xf16>
        pto.tload ins(%53 : !pto.partition_tensor_view<128x256xf16>) outs(%31 : !pto.tile_buf<mat, 128x256xf16, slayout=col_major>)
        %54 = pto.tpop_from_aiv {id = 30, split = 1} -> !pto.tile_buf<mat, 32x256xf16, blayout=col_major, slayout=row_major>
        pto.tmov ins(%54 : !pto.tile_buf<mat, 32x256xf16, blayout=col_major, slayout=row_major>) outs(%35 : !pto.tile_buf<left, 32x256xf16, slayout=row_major>)
        pto.tfree_from_aiv {id = 30, split = 1}
        pto.tmov ins(%36 : !pto.tile_buf<mat, 256x128xf16, blayout=col_major, slayout=row_major>) outs(%37 : !pto.tile_buf<right, 256x128xf16, slayout=col_major>)
        %55 = arith.addi %50, %c1 : index
        %56 = arith.muli %55, %c256 : index
        %57 = pto.partition_view %41, offsets = [%56, %c0], sizes = [%c256, %c128] : !pto.tensor_view<?x?xf16>
        pto.tload ins(%57 : !pto.partition_tensor_view<256x128xf16>) outs(%36 : !pto.tile_buf<mat, 256x128xf16, blayout=col_major, slayout=row_major>)
        pto.tmatmul ins(%35, %37 : !pto.tile_buf<left, 32x256xf16, slayout=row_major>, !pto.tile_buf<right, 256x128xf16, slayout=col_major>) outs(%38 : !pto.tile_buf<acc, 32x128xf32, blayout=col_major, slayout=row_major, fractal=1024>)
        pto.tpush(%38, %27 : !pto.tile_buf<acc, 32x128xf32, blayout=col_major, slayout=row_major, fractal=1024>, !pto.pipe) {split = 1}
        pto.tmov ins(%31 : !pto.tile_buf<mat, 128x256xf16, slayout=col_major>) outs(%32 : !pto.tile_buf<right, 128x256xf16, slayout=col_major>)
        pto.tmatmul ins(%30, %32 : !pto.tile_buf<left, 32x128xf16, slayout=row_major>, !pto.tile_buf<right, 128x256xf16, slayout=col_major>) outs(%33 : !pto.tile_buf<acc, 32x256xf32, blayout=col_major, slayout=row_major, fractal=1024>)
        pto.tpush(%33, %25 : !pto.tile_buf<acc, 32x256xf32, blayout=col_major, slayout=row_major, fractal=1024>, !pto.pipe) {split = 1}
        %58 = arith.muli %arg5, %c2 : index
        %59 = arith.addi %58, %c1 : index
        %c2_9 = arith.constant 2 : index
        %60 = arith.addi %59, %c2_9 : index
        %61 = arith.muli %60, %c256 : index
        %62 = pto.partition_view %40, offsets = [%c0, %61], sizes = [%c128, %c256] : !pto.tensor_view<?x?xf16>
        pto.tload ins(%62 : !pto.partition_tensor_view<128x256xf16>) outs(%31 : !pto.tile_buf<mat, 128x256xf16, slayout=col_major>)
        %63 = pto.tpop_from_aiv {id = 30, split = 1} -> !pto.tile_buf<mat, 32x256xf16, blayout=col_major, slayout=row_major>
        pto.tmov ins(%63 : !pto.tile_buf<mat, 32x256xf16, blayout=col_major, slayout=row_major>) outs(%35 : !pto.tile_buf<left, 32x256xf16, slayout=row_major>)
        pto.tfree_from_aiv {id = 30, split = 1}
        pto.tmov ins(%36 : !pto.tile_buf<mat, 256x128xf16, blayout=col_major, slayout=row_major>) outs(%37 : !pto.tile_buf<right, 256x128xf16, slayout=col_major>)
        %64 = arith.addi %59, %c1 : index
        %65 = arith.muli %64, %c256 : index
        %66 = pto.partition_view %41, offsets = [%65, %c0], sizes = [%c256, %c128] : !pto.tensor_view<?x?xf16>
        pto.tload ins(%66 : !pto.partition_tensor_view<256x128xf16>) outs(%36 : !pto.tile_buf<mat, 256x128xf16, blayout=col_major, slayout=row_major>)
        pto.tmatmul ins(%35, %37 : !pto.tile_buf<left, 32x256xf16, slayout=row_major>, !pto.tile_buf<right, 256x128xf16, slayout=col_major>) outs(%38 : !pto.tile_buf<acc, 32x128xf32, blayout=col_major, slayout=row_major, fractal=1024>)
        pto.tpush(%38, %27 : !pto.tile_buf<acc, 32x128xf32, blayout=col_major, slayout=row_major, fractal=1024>, !pto.pipe) {split = 1}
        pto.tmov ins(%31 : !pto.tile_buf<mat, 128x256xf16, slayout=col_major>) outs(%32 : !pto.tile_buf<right, 128x256xf16, slayout=col_major>)
        pto.tmatmul ins(%30, %32 : !pto.tile_buf<left, 32x128xf16, slayout=row_major>, !pto.tile_buf<right, 128x256xf16, slayout=col_major>) outs(%33 : !pto.tile_buf<acc, 32x256xf32, blayout=col_major, slayout=row_major, fractal=1024>)
        pto.tpush(%33, %25 : !pto.tile_buf<acc, 32x256xf32, blayout=col_major, slayout=row_major, fractal=1024>, !pto.pipe) {split = 1}
      }
      %47 = pto.tpop_from_aiv {id = 30, split = 1} -> !pto.tile_buf<mat, 32x256xf16, blayout=col_major, slayout=row_major>
      pto.tmov ins(%47 : !pto.tile_buf<mat, 32x256xf16, blayout=col_major, slayout=row_major>) outs(%35 : !pto.tile_buf<left, 32x256xf16, slayout=row_major>)
      pto.tfree_from_aiv {id = 30, split = 1}
      pto.tmov ins(%36 : !pto.tile_buf<mat, 256x128xf16, blayout=col_major, slayout=row_major>) outs(%37 : !pto.tile_buf<right, 256x128xf16, slayout=col_major>)
      %c7936 = arith.constant 7936 : index
      %48 = pto.partition_view %41, offsets = [%c7936, %c0], sizes = [%c256, %c128] : !pto.tensor_view<?x?xf16>
      pto.tload ins(%48 : !pto.partition_tensor_view<256x128xf16>) outs(%36 : !pto.tile_buf<mat, 256x128xf16, blayout=col_major, slayout=row_major>)
      pto.tmatmul ins(%35, %37 : !pto.tile_buf<left, 32x256xf16, slayout=row_major>, !pto.tile_buf<right, 256x128xf16, slayout=col_major>) outs(%38 : !pto.tile_buf<acc, 32x128xf32, blayout=col_major, slayout=row_major, fractal=1024>)
      pto.tpush(%38, %27 : !pto.tile_buf<acc, 32x128xf32, blayout=col_major, slayout=row_major, fractal=1024>, !pto.pipe) {split = 1}
      %49 = pto.tpop_from_aiv {id = 30, split = 1} -> !pto.tile_buf<mat, 32x256xf16, blayout=col_major, slayout=row_major>
      pto.tmov ins(%49 : !pto.tile_buf<mat, 32x256xf16, blayout=col_major, slayout=row_major>) outs(%35 : !pto.tile_buf<left, 32x256xf16, slayout=row_major>)
      pto.tfree_from_aiv {id = 30, split = 1}
      pto.tmov ins(%36 : !pto.tile_buf<mat, 256x128xf16, blayout=col_major, slayout=row_major>) outs(%37 : !pto.tile_buf<right, 256x128xf16, slayout=col_major>)
      pto.tmatmul ins(%35, %37 : !pto.tile_buf<left, 32x256xf16, slayout=row_major>, !pto.tile_buf<right, 256x128xf16, slayout=col_major>) outs(%38 : !pto.tile_buf<acc, 32x128xf32, blayout=col_major, slayout=row_major, fractal=1024>)
      pto.tpush(%38, %27 : !pto.tile_buf<acc, 32x128xf32, blayout=col_major, slayout=row_major, fractal=1024>, !pto.pipe) {split = 1}
    }
    return
  }
  func.func @vector_kernel(%arg0: !pto.ptr<f32>, %arg1: !pto.ptr<f32>) attributes {pto.kernel_kind = #pto.kernel_kind<vector>} {
    %c0 = arith.constant 0 : index
    %c1 = arith.constant 1 : index
    %c2 = arith.constant 2 : index
    %c32 = arith.constant 32 : index
    %c16 = arith.constant 16 : index
    %c128 = arith.constant 128 : index
    %c32_0 = arith.constant 32 : index
    %c96 = arith.constant 96 : index
    %0 = pto.get_block_num
    %1 = arith.index_cast %0 : i64 to index
    %2 = pto.get_block_idx
    %3 = arith.index_cast %2 : i64 to index
    %4 = arith.divsi %c96, %1 : index
    %5 = arith.remsi %c96, %1 : index
    %6 = arith.addi %4, %c1 : index
    %7 = arith.muli %3, %6 : index
    %8 = arith.addi %4, %c1 : index
    %9 = arith.muli %5, %8 : index
    %10 = arith.subi %3, %5 : index
    %11 = arith.muli %10, %4 : index
    %12 = arith.addi %9, %11 : index
    %13 = arith.cmpi slt, %3, %5 : index
    %14 = arith.select %13, %7, %12 : index
    %15 = arith.cmpi slt, %3, %5 : index
    %16 = arith.addi %4, %c1 : index
    %17 = arith.select %15, %16, %4 : index
    %18 = arith.addi %14, %17 : index
    %c131072 = arith.constant 131072 : index
    %19 = arith.muli %3, %c131072 : index
    %20 = pto.addptr %arg0, %19 : <f32> -> <f32>
    %c0_1 = arith.constant 0 : index
    %21 = pto.addptr %20, %c0_1 : <f32> -> <f32>
    %c65536 = arith.constant 65536 : index
    %22 = pto.addptr %20, %c65536 : <f32> -> <f32>
    %c98304 = arith.constant 98304 : index
    %23 = pto.addptr %20, %c98304 : <f32> -> <f32>
    %24 = pto.reserve_buffer{name = "fa_qk_c2v_fifo", size = 32768, location = <vec>, auto = false, base = 0} -> i32
    %25 = pto.initialize_l2g2l_pipe{dir_mask = 1, slot_size = 32768, slot_num = 8, local_slot_num = 1} (%21 : !pto.ptr<f32>, %24 : i32) -> !pto.pipe
    %26 = pto.reserve_buffer{name = "fa_pv_c2v_fifo", size = 16384, location = <vec>, auto = false, base = 32768} -> i32
    %27 = pto.initialize_l2g2l_pipe{dir_mask = 1, slot_size = 16384, slot_num = 8, local_slot_num = 1} (%22 : !pto.ptr<f32>, %26 : i32) -> !pto.pipe
    %28 = pto.import_reserved_buffer{name = "fa_p_v2c_fifo", peer_func = @cube_kernel} -> i32
    %c0_i32 = arith.constant 0 : i32
    pto.aiv_initialize_pipe {id = 30, dir_mask = 2, slot_size = 16384, local_slot_num = 1, nosplit = false}(gm_slot_buffer = %23 : !pto.ptr<f32>, c2v_consumer_buf = %c0_i32 : i32, v2c_consumer_buf = %28 : i32)
    %29 = pto.get_subblock_idx
    %30 = arith.index_cast %29 : i64 to index
    %31 = arith.muli %30, %c16 : index
    %c49152_i64 = arith.constant 49152 : i64
    %32 = pto.alloc_tile addr = %c49152_i64 : !pto.tile_buf<vec, 16x256xf32>
    %c65536_i64 = arith.constant 65536 : i64
    %33 = pto.alloc_tile addr = %c65536_i64 : !pto.tile_buf<vec, 16x256xf32>
    %c81920_i64 = arith.constant 81920 : i64
    %34 = pto.alloc_tile addr = %c81920_i64 : !pto.tile_buf<vec, 16x256xf16>
    %c90112_i64 = arith.constant 90112 : i64
    %35 = pto.alloc_tile addr = %c90112_i64 : !pto.tile_buf<vec, 16x128xf32>
    %c98304_i64 = arith.constant 98304 : i64
    %36 = pto.alloc_tile addr = %c98304_i64 : !pto.tile_buf<vec, 16x1xf32, blayout=col_major>
    %c98816_i64 = arith.constant 98816 : i64
    %37 = pto.alloc_tile addr = %c98816_i64 : !pto.tile_buf<vec, 16x1xf32, blayout=col_major>
    %c99328_i64 = arith.constant 99328 : i64
    %38 = pto.alloc_tile addr = %c99328_i64 : !pto.tile_buf<vec, 16x1xf32, blayout=col_major>
    %c99840_i64 = arith.constant 99840 : i64
    %39 = pto.alloc_tile addr = %c99840_i64 : !pto.tile_buf<vec, 16x1xf32, blayout=col_major>
    %c100352_i64 = arith.constant 100352 : i64
    %40 = pto.alloc_tile addr = %c100352_i64 : !pto.tile_buf<vec, 16x1xf32, blayout=col_major>
    %c100864_i64 = arith.constant 100864 : i64
    %41 = pto.alloc_tile addr = %c100864_i64 : !pto.tile_buf<vec, 16x1xf32, blayout=col_major>
    %cst = arith.constant 0.0883883461 : f32
    %cst_2 = arith.constant 1.000000e+00 : f32
    %c3072 = arith.constant 3072 : index
    %42 = pto.make_tensor_view %arg1, shape = [%c3072, %c128], strides = [%c128, %c1] : !pto.tensor_view<?x?xf32>
    scf.for %arg2 = %14 to %18 step %c1 {
      %43 = arith.muli %arg2, %c32 : index
      %c101376_i64 = arith.constant 101376 : i64
      %44 = pto.alloc_tile addr = %c101376_i64 : !pto.tile_buf<vec, 16x256xf32>
      pto.tpop(%44, %25 : !pto.tile_buf<vec, 16x256xf32>, !pto.pipe) {split = 1}
      pto.tmuls ins(%44, %cst : !pto.tile_buf<vec, 16x256xf32>, f32) outs(%44 : !pto.tile_buf<vec, 16x256xf32>)
      pto.trowmax ins(%44, %32 : !pto.tile_buf<vec, 16x256xf32>, !pto.tile_buf<vec, 16x256xf32>) outs(%37 : !pto.tile_buf<vec, 16x1xf32, blayout=col_major>)
      %45 = pto.treshape %37 : !pto.tile_buf<vec, 16x1xf32, blayout=col_major> -> !pto.tile_buf<vec, 1x16xf32>
      %46 = pto.treshape %36 : !pto.tile_buf<vec, 16x1xf32, blayout=col_major> -> !pto.tile_buf<vec, 1x16xf32>
      %47 = pto.treshape %40 : !pto.tile_buf<vec, 16x1xf32, blayout=col_major> -> !pto.tile_buf<vec, 1x16xf32>
      %48 = pto.treshape %38 : !pto.tile_buf<vec, 16x1xf32, blayout=col_major> -> !pto.tile_buf<vec, 1x16xf32>
      %49 = pto.treshape %39 : !pto.tile_buf<vec, 16x1xf32, blayout=col_major> -> !pto.tile_buf<vec, 1x16xf32>
      pto.trowexpandsub ins(%44, %37 : !pto.tile_buf<vec, 16x256xf32>, !pto.tile_buf<vec, 16x1xf32, blayout=col_major>) outs(%33 : !pto.tile_buf<vec, 16x256xf32>)
      pto.tmuls ins(%45, %cst_2 : !pto.tile_buf<vec, 1x16xf32>, f32) outs(%46 : !pto.tile_buf<vec, 1x16xf32>)
      pto.texp ins(%33 : !pto.tile_buf<vec, 16x256xf32>) outs(%33 : !pto.tile_buf<vec, 16x256xf32>)
      pto.trowsum ins(%33, %32 : !pto.tile_buf<vec, 16x256xf32>, !pto.tile_buf<vec, 16x256xf32>) outs(%38 : !pto.tile_buf<vec, 16x1xf32, blayout=col_major>)
      pto.tcvt ins(%33 {rmode = #pto<round_mode CAST_RINT>} : !pto.tile_buf<vec, 16x256xf32>) outs(%34 : !pto.tile_buf<vec, 16x256xf16>)
      pto.tpush_to_aic(%34 : !pto.tile_buf<vec, 16x256xf16>) {id = 30, split = 1}
      pto.tfree(%25 : !pto.pipe) {split = 1}
      %c101376_i64_3 = arith.constant 101376 : i64
      %50 = pto.alloc_tile addr = %c101376_i64_3 : !pto.tile_buf<vec, 16x256xf32>
      pto.tpop(%50, %25 : !pto.tile_buf<vec, 16x256xf32>, !pto.pipe) {split = 1}
      pto.tmuls ins(%50, %cst : !pto.tile_buf<vec, 16x256xf32>, f32) outs(%50 : !pto.tile_buf<vec, 16x256xf32>)
      pto.trowmax ins(%50, %32 : !pto.tile_buf<vec, 16x256xf32>, !pto.tile_buf<vec, 16x256xf32>) outs(%37 : !pto.tile_buf<vec, 16x1xf32, blayout=col_major>)
      %51 = pto.treshape %37 : !pto.tile_buf<vec, 16x1xf32, blayout=col_major> -> !pto.tile_buf<vec, 1x16xf32>
      %52 = pto.treshape %36 : !pto.tile_buf<vec, 16x1xf32, blayout=col_major> -> !pto.tile_buf<vec, 1x16xf32>
      %53 = pto.treshape %41 : !pto.tile_buf<vec, 16x1xf32, blayout=col_major> -> !pto.tile_buf<vec, 1x16xf32>
      %54 = pto.treshape %38 : !pto.tile_buf<vec, 16x1xf32, blayout=col_major> -> !pto.tile_buf<vec, 1x16xf32>
      %55 = pto.treshape %39 : !pto.tile_buf<vec, 16x1xf32, blayout=col_major> -> !pto.tile_buf<vec, 1x16xf32>
      pto.tmax ins(%51, %52 : !pto.tile_buf<vec, 1x16xf32>, !pto.tile_buf<vec, 1x16xf32>) outs(%51 : !pto.tile_buf<vec, 1x16xf32>)
      pto.tsub ins(%52, %51 : !pto.tile_buf<vec, 1x16xf32>, !pto.tile_buf<vec, 1x16xf32>) outs(%53 : !pto.tile_buf<vec, 1x16xf32>)
      pto.tmuls ins(%51, %cst_2 : !pto.tile_buf<vec, 1x16xf32>, f32) outs(%52 : !pto.tile_buf<vec, 1x16xf32>)
      pto.trowexpandsub ins(%50, %37 : !pto.tile_buf<vec, 16x256xf32>, !pto.tile_buf<vec, 16x1xf32, blayout=col_major>) outs(%33 : !pto.tile_buf<vec, 16x256xf32>)
      pto.texp ins(%53 : !pto.tile_buf<vec, 1x16xf32>) outs(%53 : !pto.tile_buf<vec, 1x16xf32>)
      pto.texp ins(%33 : !pto.tile_buf<vec, 16x256xf32>) outs(%33 : !pto.tile_buf<vec, 16x256xf32>)
      pto.tmul ins(%54, %53 : !pto.tile_buf<vec, 1x16xf32>, !pto.tile_buf<vec, 1x16xf32>) outs(%54 : !pto.tile_buf<vec, 1x16xf32>)
      pto.trowsum ins(%33, %32 : !pto.tile_buf<vec, 16x256xf32>, !pto.tile_buf<vec, 16x256xf32>) outs(%39 : !pto.tile_buf<vec, 16x1xf32, blayout=col_major>)
      pto.tadd ins(%54, %55 : !pto.tile_buf<vec, 1x16xf32>, !pto.tile_buf<vec, 1x16xf32>) outs(%54 : !pto.tile_buf<vec, 1x16xf32>)
      pto.tcvt ins(%33 {rmode = #pto<round_mode CAST_RINT>} : !pto.tile_buf<vec, 16x256xf32>) outs(%34 : !pto.tile_buf<vec, 16x256xf16>)
      pto.tpush_to_aic(%34 : !pto.tile_buf<vec, 16x256xf16>) {id = 30, split = 1}
      pto.tfree(%25 : !pto.pipe) {split = 1}
      %c101376_i64_4 = arith.constant 101376 : i64
      %56 = pto.alloc_tile addr = %c101376_i64_4 : !pto.tile_buf<vec, 16x128xf32>
      pto.tpop(%56, %27 : !pto.tile_buf<vec, 16x128xf32>, !pto.pipe) {split = 1}
      pto.tmov ins(%56 : !pto.tile_buf<vec, 16x128xf32>) outs(%35 : !pto.tile_buf<vec, 16x128xf32>)
      pto.tfree(%27 : !pto.pipe) {split = 1}
      %c101376_i64_5 = arith.constant 101376 : i64
      %57 = pto.alloc_tile addr = %c101376_i64_5 : !pto.tile_buf<vec, 16x256xf32>
      pto.tpop(%57, %25 : !pto.tile_buf<vec, 16x256xf32>, !pto.pipe) {split = 1}
      pto.tmuls ins(%57, %cst : !pto.tile_buf<vec, 16x256xf32>, f32) outs(%57 : !pto.tile_buf<vec, 16x256xf32>)
      pto.trowmax ins(%57, %32 : !pto.tile_buf<vec, 16x256xf32>, !pto.tile_buf<vec, 16x256xf32>) outs(%37 : !pto.tile_buf<vec, 16x1xf32, blayout=col_major>)
      %58 = pto.treshape %37 : !pto.tile_buf<vec, 16x1xf32, blayout=col_major> -> !pto.tile_buf<vec, 1x16xf32>
      %59 = pto.treshape %36 : !pto.tile_buf<vec, 16x1xf32, blayout=col_major> -> !pto.tile_buf<vec, 1x16xf32>
      %60 = pto.treshape %40 : !pto.tile_buf<vec, 16x1xf32, blayout=col_major> -> !pto.tile_buf<vec, 1x16xf32>
      %61 = pto.treshape %38 : !pto.tile_buf<vec, 16x1xf32, blayout=col_major> -> !pto.tile_buf<vec, 1x16xf32>
      %62 = pto.treshape %39 : !pto.tile_buf<vec, 16x1xf32, blayout=col_major> -> !pto.tile_buf<vec, 1x16xf32>
      pto.tmax ins(%58, %59 : !pto.tile_buf<vec, 1x16xf32>, !pto.tile_buf<vec, 1x16xf32>) outs(%58 : !pto.tile_buf<vec, 1x16xf32>)
      pto.tsub ins(%59, %58 : !pto.tile_buf<vec, 1x16xf32>, !pto.tile_buf<vec, 1x16xf32>) outs(%60 : !pto.tile_buf<vec, 1x16xf32>)
      pto.tmuls ins(%58, %cst_2 : !pto.tile_buf<vec, 1x16xf32>, f32) outs(%59 : !pto.tile_buf<vec, 1x16xf32>)
      pto.trowexpandsub ins(%57, %37 : !pto.tile_buf<vec, 16x256xf32>, !pto.tile_buf<vec, 16x1xf32, blayout=col_major>) outs(%33 : !pto.tile_buf<vec, 16x256xf32>)
      pto.texp ins(%60 : !pto.tile_buf<vec, 1x16xf32>) outs(%60 : !pto.tile_buf<vec, 1x16xf32>)
      pto.texp ins(%33 : !pto.tile_buf<vec, 16x256xf32>) outs(%33 : !pto.tile_buf<vec, 16x256xf32>)
      pto.tmul ins(%61, %60 : !pto.tile_buf<vec, 1x16xf32>, !pto.tile_buf<vec, 1x16xf32>) outs(%61 : !pto.tile_buf<vec, 1x16xf32>)
      pto.trowsum ins(%33, %32 : !pto.tile_buf<vec, 16x256xf32>, !pto.tile_buf<vec, 16x256xf32>) outs(%39 : !pto.tile_buf<vec, 16x1xf32, blayout=col_major>)
      pto.tadd ins(%61, %62 : !pto.tile_buf<vec, 1x16xf32>, !pto.tile_buf<vec, 1x16xf32>) outs(%61 : !pto.tile_buf<vec, 1x16xf32>)
      pto.tcvt ins(%33 {rmode = #pto<round_mode CAST_RINT>} : !pto.tile_buf<vec, 16x256xf32>) outs(%34 : !pto.tile_buf<vec, 16x256xf16>)
      pto.tpush_to_aic(%34 : !pto.tile_buf<vec, 16x256xf16>) {id = 30, split = 1}
      pto.tfree(%25 : !pto.pipe) {split = 1}
      %c101376_i64_6 = arith.constant 101376 : i64
      %63 = pto.alloc_tile addr = %c101376_i64_6 : !pto.tile_buf<vec, 16x128xf32>
      pto.tpop(%63, %27 : !pto.tile_buf<vec, 16x128xf32>, !pto.pipe) {split = 1}
      pto.trowexpandmul ins(%35, %41 : !pto.tile_buf<vec, 16x128xf32>, !pto.tile_buf<vec, 16x1xf32, blayout=col_major>) outs(%35 : !pto.tile_buf<vec, 16x128xf32>)
      pto.tadd ins(%35, %63 : !pto.tile_buf<vec, 16x128xf32>, !pto.tile_buf<vec, 16x128xf32>) outs(%35 : !pto.tile_buf<vec, 16x128xf32>)
      pto.tfree(%27 : !pto.pipe) {split = 1}
      %c101376_i64_7 = arith.constant 101376 : i64
      %64 = pto.alloc_tile addr = %c101376_i64_7 : !pto.tile_buf<vec, 16x256xf32>
      pto.tpop(%64, %25 : !pto.tile_buf<vec, 16x256xf32>, !pto.pipe) {split = 1}
      pto.tmuls ins(%64, %cst : !pto.tile_buf<vec, 16x256xf32>, f32) outs(%64 : !pto.tile_buf<vec, 16x256xf32>)
      pto.trowmax ins(%64, %32 : !pto.tile_buf<vec, 16x256xf32>, !pto.tile_buf<vec, 16x256xf32>) outs(%37 : !pto.tile_buf<vec, 16x1xf32, blayout=col_major>)
      %65 = pto.treshape %37 : !pto.tile_buf<vec, 16x1xf32, blayout=col_major> -> !pto.tile_buf<vec, 1x16xf32>
      %66 = pto.treshape %36 : !pto.tile_buf<vec, 16x1xf32, blayout=col_major> -> !pto.tile_buf<vec, 1x16xf32>
      %67 = pto.treshape %41 : !pto.tile_buf<vec, 16x1xf32, blayout=col_major> -> !pto.tile_buf<vec, 1x16xf32>
      %68 = pto.treshape %38 : !pto.tile_buf<vec, 16x1xf32, blayout=col_major> -> !pto.tile_buf<vec, 1x16xf32>
      %69 = pto.treshape %39 : !pto.tile_buf<vec, 16x1xf32, blayout=col_major> -> !pto.tile_buf<vec, 1x16xf32>
      pto.tmax ins(%65, %66 : !pto.tile_buf<vec, 1x16xf32>, !pto.tile_buf<vec, 1x16xf32>) outs(%65 : !pto.tile_buf<vec, 1x16xf32>)
      pto.tsub ins(%66, %65 : !pto.tile_buf<vec, 1x16xf32>, !pto.tile_buf<vec, 1x16xf32>) outs(%67 : !pto.tile_buf<vec, 1x16xf32>)
      pto.tmuls ins(%65, %cst_2 : !pto.tile_buf<vec, 1x16xf32>, f32) outs(%66 : !pto.tile_buf<vec, 1x16xf32>)
      pto.trowexpandsub ins(%64, %37 : !pto.tile_buf<vec, 16x256xf32>, !pto.tile_buf<vec, 16x1xf32, blayout=col_major>) outs(%33 : !pto.tile_buf<vec, 16x256xf32>)
      pto.texp ins(%67 : !pto.tile_buf<vec, 1x16xf32>) outs(%67 : !pto.tile_buf<vec, 1x16xf32>)
      pto.texp ins(%33 : !pto.tile_buf<vec, 16x256xf32>) outs(%33 : !pto.tile_buf<vec, 16x256xf32>)
      pto.tmul ins(%68, %67 : !pto.tile_buf<vec, 1x16xf32>, !pto.tile_buf<vec, 1x16xf32>) outs(%68 : !pto.tile_buf<vec, 1x16xf32>)
      pto.trowsum ins(%33, %32 : !pto.tile_buf<vec, 16x256xf32>, !pto.tile_buf<vec, 16x256xf32>) outs(%39 : !pto.tile_buf<vec, 16x1xf32, blayout=col_major>)
      pto.tadd ins(%68, %69 : !pto.tile_buf<vec, 1x16xf32>, !pto.tile_buf<vec, 1x16xf32>) outs(%68 : !pto.tile_buf<vec, 1x16xf32>)
      pto.tcvt ins(%33 {rmode = #pto<round_mode CAST_RINT>} : !pto.tile_buf<vec, 16x256xf32>) outs(%34 : !pto.tile_buf<vec, 16x256xf16>)
      pto.tpush_to_aic(%34 : !pto.tile_buf<vec, 16x256xf16>) {id = 30, split = 1}
      pto.tfree(%25 : !pto.pipe) {split = 1}
      %c15 = arith.constant 15 : index
      scf.for %arg3 = %c1 to %c15 step %c1 {
        %c101376_i64_10 = arith.constant 101376 : i64
        %74 = pto.alloc_tile addr = %c101376_i64_10 : !pto.tile_buf<vec, 16x128xf32>
        pto.tpop(%74, %27 : !pto.tile_buf<vec, 16x128xf32>, !pto.pipe) {split = 1}
        pto.trowexpandmul ins(%35, %40 : !pto.tile_buf<vec, 16x128xf32>, !pto.tile_buf<vec, 16x1xf32, blayout=col_major>) outs(%35 : !pto.tile_buf<vec, 16x128xf32>)
        pto.tadd ins(%35, %74 : !pto.tile_buf<vec, 16x128xf32>, !pto.tile_buf<vec, 16x128xf32>) outs(%35 : !pto.tile_buf<vec, 16x128xf32>)
        pto.tfree(%27 : !pto.pipe) {split = 1}
        %c101376_i64_11 = arith.constant 101376 : i64
        %75 = pto.alloc_tile addr = %c101376_i64_11 : !pto.tile_buf<vec, 16x256xf32>
        pto.tpop(%75, %25 : !pto.tile_buf<vec, 16x256xf32>, !pto.pipe) {split = 1}
        pto.tmuls ins(%75, %cst : !pto.tile_buf<vec, 16x256xf32>, f32) outs(%75 : !pto.tile_buf<vec, 16x256xf32>)
        pto.trowmax ins(%75, %32 : !pto.tile_buf<vec, 16x256xf32>, !pto.tile_buf<vec, 16x256xf32>) outs(%37 : !pto.tile_buf<vec, 16x1xf32, blayout=col_major>)
        %76 = pto.treshape %37 : !pto.tile_buf<vec, 16x1xf32, blayout=col_major> -> !pto.tile_buf<vec, 1x16xf32>
        %77 = pto.treshape %36 : !pto.tile_buf<vec, 16x1xf32, blayout=col_major> -> !pto.tile_buf<vec, 1x16xf32>
        %78 = pto.treshape %40 : !pto.tile_buf<vec, 16x1xf32, blayout=col_major> -> !pto.tile_buf<vec, 1x16xf32>
        %79 = pto.treshape %38 : !pto.tile_buf<vec, 16x1xf32, blayout=col_major> -> !pto.tile_buf<vec, 1x16xf32>
        %80 = pto.treshape %39 : !pto.tile_buf<vec, 16x1xf32, blayout=col_major> -> !pto.tile_buf<vec, 1x16xf32>
        pto.tmax ins(%76, %77 : !pto.tile_buf<vec, 1x16xf32>, !pto.tile_buf<vec, 1x16xf32>) outs(%76 : !pto.tile_buf<vec, 1x16xf32>)
        pto.tsub ins(%77, %76 : !pto.tile_buf<vec, 1x16xf32>, !pto.tile_buf<vec, 1x16xf32>) outs(%78 : !pto.tile_buf<vec, 1x16xf32>)
        pto.tmuls ins(%76, %cst_2 : !pto.tile_buf<vec, 1x16xf32>, f32) outs(%77 : !pto.tile_buf<vec, 1x16xf32>)
        pto.trowexpandsub ins(%75, %37 : !pto.tile_buf<vec, 16x256xf32>, !pto.tile_buf<vec, 16x1xf32, blayout=col_major>) outs(%33 : !pto.tile_buf<vec, 16x256xf32>)
        pto.texp ins(%78 : !pto.tile_buf<vec, 1x16xf32>) outs(%78 : !pto.tile_buf<vec, 1x16xf32>)
        pto.texp ins(%33 : !pto.tile_buf<vec, 16x256xf32>) outs(%33 : !pto.tile_buf<vec, 16x256xf32>)
        pto.tmul ins(%79, %78 : !pto.tile_buf<vec, 1x16xf32>, !pto.tile_buf<vec, 1x16xf32>) outs(%79 : !pto.tile_buf<vec, 1x16xf32>)
        pto.trowsum ins(%33, %32 : !pto.tile_buf<vec, 16x256xf32>, !pto.tile_buf<vec, 16x256xf32>) outs(%39 : !pto.tile_buf<vec, 16x1xf32, blayout=col_major>)
        pto.tadd ins(%79, %80 : !pto.tile_buf<vec, 1x16xf32>, !pto.tile_buf<vec, 1x16xf32>) outs(%79 : !pto.tile_buf<vec, 1x16xf32>)
        pto.tcvt ins(%33 {rmode = #pto<round_mode CAST_RINT>} : !pto.tile_buf<vec, 16x256xf32>) outs(%34 : !pto.tile_buf<vec, 16x256xf16>)
        pto.tpush_to_aic(%34 : !pto.tile_buf<vec, 16x256xf16>) {id = 30, split = 1}
        pto.tfree(%25 : !pto.pipe) {split = 1}
        %c101376_i64_12 = arith.constant 101376 : i64
        %81 = pto.alloc_tile addr = %c101376_i64_12 : !pto.tile_buf<vec, 16x128xf32>
        pto.tpop(%81, %27 : !pto.tile_buf<vec, 16x128xf32>, !pto.pipe) {split = 1}
        pto.trowexpandmul ins(%35, %41 : !pto.tile_buf<vec, 16x128xf32>, !pto.tile_buf<vec, 16x1xf32, blayout=col_major>) outs(%35 : !pto.tile_buf<vec, 16x128xf32>)
        pto.tadd ins(%35, %81 : !pto.tile_buf<vec, 16x128xf32>, !pto.tile_buf<vec, 16x128xf32>) outs(%35 : !pto.tile_buf<vec, 16x128xf32>)
        pto.tfree(%27 : !pto.pipe) {split = 1}
        %c101376_i64_13 = arith.constant 101376 : i64
        %82 = pto.alloc_tile addr = %c101376_i64_13 : !pto.tile_buf<vec, 16x256xf32>
        pto.tpop(%82, %25 : !pto.tile_buf<vec, 16x256xf32>, !pto.pipe) {split = 1}
        pto.tmuls ins(%82, %cst : !pto.tile_buf<vec, 16x256xf32>, f32) outs(%82 : !pto.tile_buf<vec, 16x256xf32>)
        pto.trowmax ins(%82, %32 : !pto.tile_buf<vec, 16x256xf32>, !pto.tile_buf<vec, 16x256xf32>) outs(%37 : !pto.tile_buf<vec, 16x1xf32, blayout=col_major>)
        %83 = pto.treshape %37 : !pto.tile_buf<vec, 16x1xf32, blayout=col_major> -> !pto.tile_buf<vec, 1x16xf32>
        %84 = pto.treshape %36 : !pto.tile_buf<vec, 16x1xf32, blayout=col_major> -> !pto.tile_buf<vec, 1x16xf32>
        %85 = pto.treshape %41 : !pto.tile_buf<vec, 16x1xf32, blayout=col_major> -> !pto.tile_buf<vec, 1x16xf32>
        %86 = pto.treshape %38 : !pto.tile_buf<vec, 16x1xf32, blayout=col_major> -> !pto.tile_buf<vec, 1x16xf32>
        %87 = pto.treshape %39 : !pto.tile_buf<vec, 16x1xf32, blayout=col_major> -> !pto.tile_buf<vec, 1x16xf32>
        pto.tmax ins(%83, %84 : !pto.tile_buf<vec, 1x16xf32>, !pto.tile_buf<vec, 1x16xf32>) outs(%83 : !pto.tile_buf<vec, 1x16xf32>)
        pto.tsub ins(%84, %83 : !pto.tile_buf<vec, 1x16xf32>, !pto.tile_buf<vec, 1x16xf32>) outs(%85 : !pto.tile_buf<vec, 1x16xf32>)
        pto.tmuls ins(%83, %cst_2 : !pto.tile_buf<vec, 1x16xf32>, f32) outs(%84 : !pto.tile_buf<vec, 1x16xf32>)
        pto.trowexpandsub ins(%82, %37 : !pto.tile_buf<vec, 16x256xf32>, !pto.tile_buf<vec, 16x1xf32, blayout=col_major>) outs(%33 : !pto.tile_buf<vec, 16x256xf32>)
        pto.texp ins(%85 : !pto.tile_buf<vec, 1x16xf32>) outs(%85 : !pto.tile_buf<vec, 1x16xf32>)
        pto.texp ins(%33 : !pto.tile_buf<vec, 16x256xf32>) outs(%33 : !pto.tile_buf<vec, 16x256xf32>)
        pto.tmul ins(%86, %85 : !pto.tile_buf<vec, 1x16xf32>, !pto.tile_buf<vec, 1x16xf32>) outs(%86 : !pto.tile_buf<vec, 1x16xf32>)
        pto.trowsum ins(%33, %32 : !pto.tile_buf<vec, 16x256xf32>, !pto.tile_buf<vec, 16x256xf32>) outs(%39 : !pto.tile_buf<vec, 16x1xf32, blayout=col_major>)
        pto.tadd ins(%86, %87 : !pto.tile_buf<vec, 1x16xf32>, !pto.tile_buf<vec, 1x16xf32>) outs(%86 : !pto.tile_buf<vec, 1x16xf32>)
        pto.tcvt ins(%33 {rmode = #pto<round_mode CAST_RINT>} : !pto.tile_buf<vec, 16x256xf32>) outs(%34 : !pto.tile_buf<vec, 16x256xf16>)
        pto.tpush_to_aic(%34 : !pto.tile_buf<vec, 16x256xf16>) {id = 30, split = 1}
        pto.tfree(%25 : !pto.pipe) {split = 1}
      }
      %c101376_i64_8 = arith.constant 101376 : i64
      %70 = pto.alloc_tile addr = %c101376_i64_8 : !pto.tile_buf<vec, 16x128xf32>
      pto.tpop(%70, %27 : !pto.tile_buf<vec, 16x128xf32>, !pto.pipe) {split = 1}
      pto.trowexpandmul ins(%35, %40 : !pto.tile_buf<vec, 16x128xf32>, !pto.tile_buf<vec, 16x1xf32, blayout=col_major>) outs(%35 : !pto.tile_buf<vec, 16x128xf32>)
      pto.tadd ins(%35, %70 : !pto.tile_buf<vec, 16x128xf32>, !pto.tile_buf<vec, 16x128xf32>) outs(%35 : !pto.tile_buf<vec, 16x128xf32>)
      pto.tfree(%27 : !pto.pipe) {split = 1}
      %c101376_i64_9 = arith.constant 101376 : i64
      %71 = pto.alloc_tile addr = %c101376_i64_9 : !pto.tile_buf<vec, 16x128xf32>
      pto.tpop(%71, %27 : !pto.tile_buf<vec, 16x128xf32>, !pto.pipe) {split = 1}
      pto.trowexpandmul ins(%35, %41 : !pto.tile_buf<vec, 16x128xf32>, !pto.tile_buf<vec, 16x1xf32, blayout=col_major>) outs(%35 : !pto.tile_buf<vec, 16x128xf32>)
      pto.tadd ins(%35, %71 : !pto.tile_buf<vec, 16x128xf32>, !pto.tile_buf<vec, 16x128xf32>) outs(%35 : !pto.tile_buf<vec, 16x128xf32>)
      pto.tfree(%27 : !pto.pipe) {split = 1}
      pto.trowexpanddiv ins(%35, %38 : !pto.tile_buf<vec, 16x128xf32>, !pto.tile_buf<vec, 16x1xf32, blayout=col_major>) outs(%35 : !pto.tile_buf<vec, 16x128xf32>)
      %72 = arith.addi %43, %31 : index
      %73 = pto.partition_view %42, offsets = [%72, %c0], sizes = [%c16, %c128] : !pto.tensor_view<?x?xf32>
      pto.tstore ins(%35 : !pto.tile_buf<vec, 16x128xf32>) outs(%73 : !pto.partition_tensor_view<16x128xf32>)
    }
    return
  }
  func.func @call_both(%arg0: memref<256xi64>, %arg1: !pto.ptr<f32>, %arg2: !pto.ptr<f16>, %arg3: !pto.ptr<f16>, %arg4: !pto.ptr<f16>, %arg5: !pto.ptr<f32>) attributes {pto.entry} {
    pto.set_ffts %arg0 : memref<256xi64>
    call @cube_kernel(%arg1, %arg2, %arg3, %arg4) : (!pto.ptr<f32>, !pto.ptr<f16>, !pto.ptr<f16>, !pto.ptr<f16>) -> ()
    call @vector_kernel(%arg1, %arg5) : (!pto.ptr<f32>, !pto.ptr<f32>) -> ()
    return
  }
}

Actual behavior / error logs

PTOAS and bisheng both complete successfully, but the first device synchronize after launching the kernel fails:


RuntimeError: npuSynchronizeDevice: ... AclrtSynchronizeDeviceWithTimeout, error code is 507015
[Error]: The aicore execution is abnormal.
EZ9999: ... exception of fftsplus aicore error
errorStr: CCU instruction address check error
retCode=0x26, [aicore exception]

Git commit

150abd8

Host platform

Linux (aarch64)

Target Ascend arch (if relevant)

a3

PTOAS build level (if relevant)

level3

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions