Failing TeuchosCore tests in ATDM cuda 9.2 builds
Created by: fryeguy52
CC: @trilinos/teuchos , @jwillenbring (Trilinos Framework Product Lead), @bartlettroscoe
Next Action Status
Caused by upgraded SLURM/MPI correctly implementing --mca orte_abort_on_non_zero_status 0
for srun
on white/ride.. PR #3292 merged on 8/13/2018 which removes --mca orte_abort_on_non_zero_status 0
and tests passed on 8/14/2018 .
Description
As shown in this query the tests:
- TeuchosCore_testTeuchosTestForTermination
- TeuchosCore_testTeuchosTestForTermination_0_MPI_4
- TeuchosCore_testTeuchosTestForTermination_1_MPI_4
- TeuchosCore_testTeuchosTestForTermination_2_MPI_4
- TeuchosCore_testTeuchosTestForTermination_3_MPI_4
are failing in the builds:
- Trilinos-atdm-white-ride-cuda-9.2-opt
- Trilinos-atdm-white-ride-cuda-9.2-debug
The failures are all due to timeouts and have never passed in these builds. The longest that any of these take on the other ATDM builds is about 30 seconds
Steps to Reproduce
One should be able to reproduce this failure on the machine as described in:
More specifically, the commands given for the system are provided at:
The exact commands to reproduce this issue should be:
$ cd <some_build_dir>/
$ source $TRILINOS_DIR/cmake/std/atdm/load-env.sh cuda-9.2-opt
$ cmake \
-GNinja \
-DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnv.cmake \
-DTrilinos_ENABLE_TESTS=ON -DTrilinos_ENABLE_Teuchos=ON \
$TRILINOS_DIR
$ make NP=16
$ bsub -x -Is -q rhel7F -n 16 ctest -j16