TpetraCore_Issue1454_MPI_4 test failing on new cuda 9.2 ATDM build on white/ride
Created by: fryeguy52
CC: @trilinos/tpetra , @kddevin (Trilinos Data Services Product Lead) @bartlettroscoe
Next Action Status
PR #3318 merged on 8/18/2018 and this test started passing in these cuda-9.2
builds on 8/19/2018.
Description
As shown in this query the tests:
- TpetraCore_Issue1454_MPI_4
are failing in the builds:
- Trilinos-atdm-white-ride-cuda-9.2-opt
- Trilinos-atdm-white-ride-cuda-9.2-debug
The test is timing out in 10 minutes. This test takes at most 30 seconds on other builds of the ATDM configuration.
Some of the test output:
[white27:41375] mca_base_component_repository_open: unable to open mca_coll_hcoll: libsharp_coll.so.2: cannot open shared object file: No such file or directory (ignored)
[white27:41376] mca_base_component_repository_open: unable to open mca_coll_hcoll: libsharp_coll.so.2: cannot open shared object file: No such file or directory (ignored)
[white27:41378] mca_base_component_repository_open: unable to open mca_coll_hcoll: libsharp_coll.so.2: cannot open shared object file: No such file or directory (ignored)
[white27:41377] mca_base_component_repository_open: unable to open mca_coll_hcoll: libsharp_coll.so.2: cannot open shared object file: No such file or directory (ignored)
Teuchos::GlobalMPISession::GlobalMPISession(): started processor with name white27 and rank 0!
Teuchos::GlobalMPISession::GlobalMPISession(): started processor with name white27 and rank 1!
Teuchos::GlobalMPISession::GlobalMPISession(): started processor with name white27 and rank 2!
Teuchos::GlobalMPISession::GlobalMPISession(): started processor with name white27 and rank 3!
***
*** Unit test suite ...
***
Sorting tests by group name then by the order they were added ... (time = 2.18e-07)
Running unit tests ...
0. Distributor_Issue1454_UnitTest ... [white27:41378:0] Caught signal 11 (Segmentation fault)
[white27:41375:0] Caught signal 11 (Segmentation fault)
[white27:41376:0] Caught signal 11 (Segmentation fault)
[white27:41377:0] Caught signal 11 (Segmentation fault)
Steps to Reproduce
One should be able to reproduce this failure on the machine white
as described in:
More specifically, the commands given for the system white
are provided at:
The exact commands to reproduce this issue should be:
$ cd <some_build_dir>/
$ source $TRILINOS_DIR/cmake/std/atdm/load-env.sh cuda-9.2-opt
$ cmake \
-GNinja \
-DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnv.cmake \
-DTrilinos_ENABLE_TESTS=ON -DTrilinos_ENABLE_Tpetra=ON \
$TRILINOS_DIR
$ make NP=16
$ bsub -x -Is -q rhel7F -n 16 ctest -j16