Tpetra: Possible OpenMPI bug; don't enable CUDA-aware MPI support
Created by: mhoemmen
@trilinos/tpetra @trilinos/framework
When debugging #3500, @rppawlo found a possible bug in OpenMPI 2.1.2 with CUDA-aware MPI enabled. When doing a Tpetra::Export between two MultiVectors with AbsMax CombineMode, the target MultiVector sometimes doesn't get the right result. Roger found that on unpack, the receive buffer doesn't always have the right values in it. It's not consistently reproducible -- Roger sometimes has to run the test three times to get this to happen -- but it does happen fairly often. Roger thinks that MPI might not have finished writing to the buffer, on a process (Proc 0) that has no sends but some receives, even though Process 0 should have done an MPI_Waitall
on the receives (they are MPI_Irecv
, always). The results are always correct when running with CUDA-aware MPI disabled (that's a Tpetra option, not an MPI option).
Next steps:
-
Make sure this isn't an issue with Kokkos::atomic_assign
in AbxMax -
Test with different OpenMPI versions -
Write a short reproducer, to make sure this isn't a Tpetra issue (start perhaps with @crtrott 's CG solve)
We recommend disabling Tpetra's CUDA-aware MPI option for now.