Tpetra_ASSUME_CUDA_AWARE_MPI affecting test performance on white
Created by: kddevin
@trilinos/tpetra @trilinos/zoltan2 @ndellingwood @kyungjoo-kim @nmhamster @crtrott
Next Action Status
PR #3500 merged to 'develop' on 12/6/2018 that enables Tpetra_ASSUME_CUDA_AWARE_MPI =ON
by default for ATDM (and all) CUDA Trilinos builds. Next, watch to see ATDM Trilinos builds and the EMPIRE builds over next few days to see what happens ...
Description
Using devpack/20180521/openmpi/3.1.0/gcc/7.2.0/cuda/9.2.88 on white, the behavior of tests differs depending on the setting of Tpetra_ASSUME_CUDA_AWARE_MPI.
With Tpetra_ASSUME_CUDA_AWARE_MPI=ON, there are segfaults in MPI_Send called from Tpetra::Distributor::doPosts.
With Tpetra_ASSUME_CUDA_AWARE_MPI=OFF, the tests run fine. (Note that ATDM testing uses this setting.)
I don't know whether this problem is caused by OpenMPI 3 (see #3356), or a problem with Tpetra, or something that I'm doing wrong with my configure/build. @crtrott wrote the Tpetra test script that I am using (see info below), and it worked without errors in March with now obsolete devpack/openmpi/1.10.4/gcc/5.4.0/cuda/8.0.44.
I've had difficulty getting Trilinos to configure and build with devpacks using OpenMPI 2. I will continue to try to get an OpenMPI 2 version to work. I welcome an OpenMPI 2 build script that works on white; please share if you have one. Thanks.
Motivation and Context
We need Tpetra to work for Zoltan2 testing. The segfaults from the Tpetra::Distributor's MPI_Send occur in Zoltan2 testing as well; setting Tpetra_ASSUME_CUDA_AWARE_MPI=OFF allows Zoltan2 tests to run.
Steps to Reproduce
I am running the Tpetra test script described at https://github.com/trilinos/Trilinos/wiki/Tpetra-test-script. I changed the devpack in the script to devpack/20180521/openmpi/3.1.0/gcc/7.2.0/cuda/9.2.88, because the original devpack in the script (devpack/openmpi/1.10.4/gcc/5.4.0/cuda/8.0.44) is no longer available on white. I compared tests with -DTpetra_ASSUME_CUDA_AWARE_MPI=ON and OFF.
On white, in my home directory, Trilinos/Obj_white/Test_white_2018_09_06_22.34.48 has tests with Tpetra_ASSUME_CUDA_AWARE_MPI=OFF, and Trilinos/Obj_white/Test_white_2018_09_06_15.35.24 has tests with Tpetra_ASSUME_CUDA_AWARE_MPI=ON.
The usual Testing/Temporary/LastTest.log shows the segfaults for the case with Tpetra_ASSUME_CUDA_AWARE_MPI=ON, and the passing tests with it OFF.