Skip to content

[Issue]: Model run fails with segfault in libmigraphx.so when one dynamic dimension is specified #4845

@mferencevic

Description

@mferencevic

Problem description

Related to #4844, MIGraphX succeeds to compile a simple model when one dynamic dimension is specified, but it then fails with a segfault when the model is run.

Also, the MIGRAPHX_ENABLE_FULL_DYNAMIC environment variable doesn't help whether or not it's set to 1.

Steps to reproduce

import math
import migraphx
import torch

DEVICE = "cuda:0"
EMBEDDING_COUNT = 32
EMBEDDING_DIM = 16
BATCH_SIZE = 4

torch.inference_mode(True)
torch.cuda.set_device(DEVICE)

_TORCH_TYPE_MAPPING = {
    torch.int64: "int64_type",
    torch.float32: "float_type",
}

def _convert_tensor_to_argument(tensor):
    assert str(tensor.device) == DEVICE
    assert tensor.is_contiguous()
    return migraphx.argument_from_pointer(
        migraphx.shape(
            type=_TORCH_TYPE_MAPPING[tensor.dtype],
            lens=list(tensor.size()),
            strides=list(tensor.stride()),
        ), tensor.data_ptr())

model = torch.nn.Embedding(EMBEDDING_COUNT, EMBEDDING_DIM)
model.eval()
input_batch = torch.arange(math.ceil(EMBEDDING_COUNT / 2)).repeat(BATCH_SIZE, 1).contiguous()

torch.onnx.export(
    model,
    (input_batch,),
    "model.onnx",
    external_data=False,
    dynamo=True,
    dynamic_shapes=[
        {0: torch.export.Dim.DYNAMIC, 1: torch.export.Dim.DYNAMIC},
    ],
)

migraphx_model = migraphx.parse_onnx("model.onnx", map_dyn_input_dims={
    "input": [
        migraphx.shape.dynamic_dimension(BATCH_SIZE, BATCH_SIZE, {BATCH_SIZE}),
        migraphx.shape.dynamic_dimension(1, 64, {1}),
    ],
})
migraphx_model.compile(migraphx.get_target("gpu"), offload_copy=False)

input_batch = input_batch.to(DEVICE)
output = torch.empty(
    (*input_batch.shape, EMBEDDING_DIM), dtype=torch.float32, device=DEVICE)
torch.cuda.synchronize(DEVICE)

migraphx_model.run({
    "input": _convert_tensor_to_argument(input_batch),
    "main:#output_0": _convert_tensor_to_argument(output),
})

We want to note that we've observed the same issue with larger models, but we've created this reproducer script with a single node for simpler analysis.

Also, if you compare the script in this issue with the script in #4844, we've converted the first dimension (BATCH_SIZE) into a static dimension while the second dimension is still a dynamic dimension. The issue still happens if you leave the first dynamic dimension to be dynamic (BATCH_SIZE) and convert the second dynamic dimension into a static dimension.

Environment

OS: Debian GNU/Linux 12 (bookworm)
CPU: AMD Ryzen 9 9950X
GPU: AMD Radeon AI PRO R9700
ROCm version: 7.2.1
MIGraphX version: 2.16.0.dev+20250912-17-406-gb91f1c0c0

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions