cudaErrorIllegalAddress when running with Trilinos+Belos+MueLu on GPUs
Created by: georgepaw
Hi,
I have been recently looking into the performance of the TeaLeaf mini-app (https://github.com/georgepaw/TeaLeaf_Trilinos/tree/kokkos_gpu) which has been ported to Trilinos running on multiple GPUs. This implementation uses the Belos package to do a CG solve and MueLu for multigrid preconditioners.
Using the instructions from here:
https://trilinos.org/about/documentation/building-with-cuda-support/
I have been able to build Trilinos (commit 8f5c62f) with Kokkos with CUDA support (gcc 4.8.5, OpenMPI 1.10.3 with CUDA support enabled and CUDA 7.5.18) (built using https://github.com/georgepaw/TeaLeaf_Trilinos/blob/kokkos_gpu/do-configure.sh), and I have been able to run TeaLeaf with a CG solve and no preconditioner on multiple nodes with multiple GPUs (NVIDIA k40) (using https://gist.github.com/georgepaw/ee7a3dc9fcbb904dee11860e1e4d47b2 as Options.xml).
I have now been trying to use the multigrid preconditioners from MueLu (using https://gist.github.com/georgepaw/a22360106f7bc40738686427a432a3b2 and https://gist.github.com/georgepaw/ccf4d6a4138285f1f37f0856c9c03ffe as Options.xml) and both of these, MueLu-GS and MueLu-damped Jacobi, work fine on a single GPU, or on 2 GPUs (whether 2 GPUs on 1 node or 2 nodes with 1 GPU each).
When running on more than 2 GPUs I get these errors for both MueLu-GS and MueLu-damped Jacobi:
terminate called after throwing an instance of 'std::runtime_error' what(): cudaDeviceSynchronize() error( cudaErrorIllegalAddress): an illegal memory access was encountered /path/to/Trilinos/packages/kokkos/core/src/Cuda/Kokkos_Cuda_Impl.cpp:122
(cleaned up trace in https://gist.github.com/georgepaw/8a1c3e28f930c1228de4c2f8e47ca627)
For MueLu-GS I have also tried running multiple ranks on the same GPU as I expect there might be some host-only computations. I have been able to run on 1 GPU with 2 MPI ranks (1 CPU core per rank), however running more than 2 ranks on a single GPU or more than 1 rank on 2 GPUs results in the same illegal memory exception as above.