Created by: bartlettroscoe
CC: @trilinos/teuchos, @trilinos/kokkos, @fryeguy52
Description
Removes --mca orte_abort_on_non_zero_status 0
from srun
command (i.e. one-line change).
Motivation and Context
This srun
option was causing tests that call abort()
on purpose to not terminate when abort()
is called. That option was set by recommendation by the Test Bed team to address the bsub
crashes but it did not seem to work for that (see TrIL-198). It seems that the old MPI in that env ignored that option but this new MPI in this env does not :-)
This option was causing some Teuchos tests (see #3287 (closed)) and Kokkos tests (see #3291 (closed)) that call abort()
on purpose to just hang instead of terminate the MPI job.
How Has This Been Tested?
Before this change I tested the current 'develop' branch of Trilinos on 'white' with:
$ bsub -x -Is -q rhel7F -n 16 \
./checkin-test-atdm.sh cuda-9.2-opt-Power8-Kepler37 --enable-packages=Kokkos,Teuchos --local-do-all
and it reproduced the failing tests described in #3287 (closed) and #3297 as:
96% tests passed, 6 tests failed out of 156
Subproject Time Summary:
Kokkos = 860.38 sec*proc (27 tests)
Teuchos = 10285.22 sec*proc (129 tests)
Total Test time (real) = 1809.41 sec
The following tests FAILED:
6 - KokkosCore_UnitTest_PushFinalizeHook_terminate (Timeout)
49 - TeuchosCore_testTeuchosTestForTermination (Timeout)
50 - TeuchosCore_testTeuchosTestForTermination_0_MPI_4 (Timeout)
51 - TeuchosCore_testTeuchosTestForTermination_1_MPI_4 (Timeout)
52 - TeuchosCore_testTeuchosTestForTermination_2_MPI_4 (Timeout)
53 - TeuchosCore_testTeuchosTestForTermination_3_MPI_4 (Timeout)
Errors while running CTest
After the commit, I ran it again and this time it produced:
100% tests passed, 0 tests failed out of 156
Subproject Time Summary:
Kokkos = 287.70 sec*proc (27 tests)
Teuchos = 149.60 sec*proc (129 tests)
Total Test time (real) = 125.07 sec
Checklist
-
My commit messages mention the appropriate GitHub issue numbers.