Segfault with Spectrum MPI when TPETRA_ASSUME_CUDA_AWARE_MPI=1
Created by: jhux2
I am running the mini-app "MiniEM" in Panzer on the LLNL machine rzansel, which is using IBM Spectrum MPI. When I set export TPETRA_ASSUME_CUDA_AWARE_MPI=1
, the job crashes. When this variable is not set, the job runs to completion.
The version of spectrum mpi is 10.2.0.11rtm2
. Spectrum MPI is based on OpenMPI, and there are known problems with certain versions of OpenMPI, see #3405 (closed).
error log
[rzansel46:108415] *** Process received signal ***
[rzansel46:108415] Signal: Segmentation fault (11)
[rzansel46:108415] Signal code: Address not mapped (1)
[rzansel46:108415] Failing at address: 0x20007f210280
[rzansel46:108415] [ 0] [0x2000000504d8]
[rzansel46:108415] [ 1] [0x0]
[rzansel46:108415] [ 2] /usr/tce/packages/spectrum-mpi/ibm/spectrum-mpi-rolling-release/lib/pami_port/libpami.so.3(_ZN4PAMI4Fifo8WrapFifoINS0_10FifoPacketILj64ELj4096EEENS_7Counter15IndirectBoundedINS_6Atomic12NativeAtomicEEELj256EE18producePacket_implINS_6Device5Shmem17PacketIovecWriterILj2EEEEEbRT_+0xb8)[0x200018db0d18]
[rzansel46:108415] [ 3] /usr/tce/packages/spectrum-mpi/ibm/spectrum-mpi-rolling-release/lib/pami_port/libpami.so.3(_ZN4PAMI8Protocol4Send11EagerSimpleINS_6Device5Shmem11PacketModelINS3_11ShmemDeviceINS_4Fifo8WrapFifoINS7_10FifoPacketILj64ELj4096EEENS_7Counter15IndirectBoundedINS_6Atomic12NativeAtomicEEELj256EEENSB_8IndirectINSB_6NativeEEENS4_9CMAShaddrELj256ELj512EEEEELNS1_15configuration_tE1EE11simple_implEP11pami_send_t+0x498)[0x200018db1258]
[rzansel46:108415] [ 4] /usr/tce/packages/spectrum-mpi/ibm/spectrum-mpi-rolling-release/lib/pami_port/libpami.so.3(_ZN4PAMI8Protocol4Send5EagerINS_6Device5Shmem11PacketModelINS3_11ShmemDeviceINS_4Fifo8WrapFifoINS7_10FifoPacketILj64ELj4096EEENS_7Counter15IndirectBoundedINS_6Atomic12NativeAtomicEEELj256EEENSB_8IndirectINSB_6NativeEEENS4_9CMAShaddrELj256ELj512EEEEENS3_3IBV11PacketModelINSN_6DeviceELb1EEEE9EagerImplILNS1_15configuration_tE1ELb1EE6simpleEP11pami_send_t+0x2c)[0x200018db16fc]
[rzansel46:108415] [ 5] /usr/tce/packages/spectrum-mpi/ibm/spectrum-mpi-rolling-release/lib/pami_port/libpami.so.3(PAMI_Send+0x58)[0x200018ce1478]
[rzansel46:108415] [ 6] /usr/tce/packages/spectrum-mpi/ibm/spectrum-mpi-rolling-release/lib/spectrum_mpi/mca_pml_pami.so(pml_pami_send+0x628)[0x200013e3ec98]
[rzansel46:108415] [ 7] /usr/tce/packages/spectrum-mpi/ibm/spectrum-mpi-rolling-release/lib/spectrum_mpi/mca_pml_pami.so(mca_pml_pami_send+0x568)[0x200013e3fe58]
[rzansel46:108415] [ 8] /usr/tce/packages/spectrum-mpi/ibm/spectrum-mpi-rolling-release/lib/libmpi_ibm.so.3(PMPI_Send+0x148)[0x200011a5f918]
[rzansel46:108415] [ 9] ./PanzerMiniEM_BlockPrec.exe[0x3504ec34]
[rzansel46:108415] [10] ./PanzerMiniEM_BlockPrec.exe[0x3504eef8]
[rzansel46:108415] [11] ./PanzerMiniEM_BlockPrec.exe[0x33bff3b0]
[rzansel46:108415] [12] ./PanzerMiniEM_BlockPrec.exe[0x33c02558]
[rzansel46:108415] [13] ./PanzerMiniEM_BlockPrec.exe[0x33c048a8]
[rzansel46:108415] [14] ./PanzerMiniEM_BlockPrec.exe[0x33c86d2c]
[rzansel46:108415] [15] ./PanzerMiniEM_BlockPrec.exe[0x33c9b86c]
[rzansel46:108415] [16] ./PanzerMiniEM_BlockPrec.exe[0x33cb9314]
[rzansel46:108415] [17] ./PanzerMiniEM_BlockPrec.exe[0x180c028c]
[rzansel46:108415] [18] ./PanzerMiniEM_BlockPrec.exe[0x180fce30]
[rzansel46:108415] [19] ./PanzerMiniEM_BlockPrec.exe[0x180584e8]
[rzansel46:108415] [20] ./PanzerMiniEM_BlockPrec.exe[0x18058b34]
[rzansel46:108415] [21] ./PanzerMiniEM_BlockPrec.exe[0x17ffa384]
[rzansel46:108415] [22] ./PanzerMiniEM_BlockPrec.exe[0x14e8e80c]
[rzansel46:108415] [23] ./PanzerMiniEM_BlockPrec.exe[0x11660714]
[rzansel46:108415] [24] ./PanzerMiniEM_BlockPrec.exe[0x117ec69c]
[rzansel46:108415] [25] /lib64/libc.so.6(+0x25100)[0x200011da5100]
[rzansel46:108415] [26] /lib64/libc.so.6(__libc_start_main+0xc4)[0x200011da52f4]
[rzansel46:108415] *** End of error message ***
ERROR: One or more process (first noticed rank 3) terminated with signal 11 (core dumped)
This issue is meant to record the problem and see if there are any known fixes.
Module environment
Currently Loaded Modules:
1) cuda/9.2.148
2) StdEnv
3) xl/2018.11.26
4) spectrum-mpi/rolling-release
5) lapack/3.8.0-P9-xl-2018.08.24
6) cmake/3.12.1
7) hdf5-parallel/1.10.4
@trilinos/tpetra @nmhamster