Skip to content

Need to catch when code attempts to run on a device ID that doesn't exist #2711

@ggorman

Description

@ggorman

There are cases where the runtime environment can be malformed and Devito is being asked to run CUDA code on a device ID that does not exist.

Example: When using docker run you cannot have both cli arguments --env CUDA_VISIBLE_DEVICES and --gpus "device=${CUDA_VISIBLE_DEVICES:-all}". To see why, consider if we have export CUDA_VISIBLE_DEVICES=1, then setting --env CUDA_VISIBLE_DEVICES for docker run means that the docker runtime env will contain CUDA_VISIBLE_DEVICES=1; however, when you set docker run --gpus "device=${CUDA_VISIBLE_DEVICES:-all}", the docker runtime will only use GPU 1 but it renumbers it as zero. Therefore, when you run a cuda code inside docker the runtime only sees a single GPU device with device ID 0, but CUDA_VISIBLE_DEVICES is set to device id 1, and therefore you get an (uncaught) exception.

This leads to a very opaque failure that is difficult for users to understand and debug, e.g.:

tests/test_gpu_openacc.py ...............
Error: Process completed with exit code 1.

Therefore, when running on GPUs, we need to sanity check that we can execute on the target device and if not emit a informative user message. Alternatively, it might be that we are not checking the error code of a cuda call that would provide the same message.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions