Need to catch when code attempts to run on a device ID that doesn't exist

There are cases where the runtime environment can be malformed and Devito is being asked to run CUDA code on a device ID that does not exist.

Example: When using `docker run` you cannot have both cli arguments `--env CUDA_VISIBLE_DEVICES` and `--gpus "device=${CUDA_VISIBLE_DEVICES:-all}"`. To see why, consider if we have export CUDA_VISIBLE_DEVICES=1, then setting --env CUDA_VISIBLE_DEVICES for docker run means that the docker runtime env will contain CUDA_VISIBLE_DEVICES=1; however, when you set docker run --gpus "device=${CUDA_VISIBLE_DEVICES:-all}", the docker runtime will only use GPU 1 but it renumbers it as zero. Therefore, when you run a cuda code inside docker the runtime only sees a single GPU device with device ID 0, but CUDA_VISIBLE_DEVICES is set to device id 1, and therefore you get an (uncaught) exception.

This leads to a very opaque failure that is difficult for users to understand and debug, e.g.:
```
tests/test_gpu_openacc.py ...............
Error: Process completed with exit code 1.
```

Therefore, when running on GPUs, we need to sanity check that we can execute on the target device and if not emit a informative user message. Alternatively, it might be that we are not checking the error code of a cuda call that would provide the same message.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Need to catch when code attempts to run on a device ID that doesn't exist #2711

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Need to catch when code attempts to run on a device ID that doesn't exist #2711

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions