Tpetra_ASSUME_CUDA_AWARE_MPI affecting test performance on white
Created by: kddevin
@trilinos/tpetra @trilinos/zoltan2 @ndellingwood @kyungjoo-kim @nmhamster @crtrott
Next Action Status
PR #3500 merged to 'develop' on 12/6/2018 that enables
Tpetra_ASSUME_CUDA_AWARE_MPI =ON by default for ATDM (and all) CUDA Trilinos builds. Next, watch to see ATDM Trilinos builds and the EMPIRE builds over next few days to see what happens ...
Using devpack/20180521/openmpi/3.1.0/gcc/7.2.0/cuda/9.2.88 on white, the behavior of tests differs depending on the setting of Tpetra_ASSUME_CUDA_AWARE_MPI.
With Tpetra_ASSUME_CUDA_AWARE_MPI=ON, there are segfaults in MPI_Send called from Tpetra::Distributor::doPosts.
With Tpetra_ASSUME_CUDA_AWARE_MPI=OFF, the tests run fine. (Note that ATDM testing uses this setting.)
I don't know whether this problem is caused by OpenMPI 3 (see #3356), or a problem with Tpetra, or something that I'm doing wrong with my configure/build. @crtrott wrote the Tpetra test script that I am using (see info below), and it worked without errors in March with now obsolete devpack/openmpi/1.10.4/gcc/5.4.0/cuda/8.0.44.
I've had difficulty getting Trilinos to configure and build with devpacks using OpenMPI 2. I will continue to try to get an OpenMPI 2 version to work. I welcome an OpenMPI 2 build script that works on white; please share if you have one. Thanks.
Motivation and Context
We need Tpetra to work for Zoltan2 testing. The segfaults from the Tpetra::Distributor's MPI_Send occur in Zoltan2 testing as well; setting Tpetra_ASSUME_CUDA_AWARE_MPI=OFF allows Zoltan2 tests to run.
Steps to Reproduce
I am running the Tpetra test script described at https://github.com/trilinos/Trilinos/wiki/Tpetra-test-script. I changed the devpack in the script to devpack/20180521/openmpi/3.1.0/gcc/7.2.0/cuda/9.2.88, because the original devpack in the script (devpack/openmpi/1.10.4/gcc/5.4.0/cuda/8.0.44) is no longer available on white. I compared tests with -DTpetra_ASSUME_CUDA_AWARE_MPI=ON and OFF.
On white, in my home directory, Trilinos/Obj_white/Test_white_2018_09_06_22.34.48 has tests with Tpetra_ASSUME_CUDA_AWARE_MPI=OFF, and Trilinos/Obj_white/Test_white_2018_09_06_15.35.24 has tests with Tpetra_ASSUME_CUDA_AWARE_MPI=ON.
The usual Testing/Temporary/LastTest.log shows the segfaults for the case with Tpetra_ASSUME_CUDA_AWARE_MPI=ON, and the passing tests with it OFF.