Skip to content

Remove --mca orte_abort_on_non_zero_status 0 from srun on white/ride (#3287, #3291)

Created by: bartlettroscoe

CC: @trilinos/teuchos, @trilinos/kokkos, @fryeguy52

Description

Removes --mca orte_abort_on_non_zero_status 0 from srun command (i.e. one-line change).

Motivation and Context

This srun option was causing tests that call abort() on purpose to not terminate when abort() is called. That option was set by recommendation by the Test Bed team to address the bsub crashes but it did not seem to work for that (see TrIL-198). It seems that the old MPI in that env ignored that option but this new MPI in this env does not :-)

This option was causing some Teuchos tests (see #3287 (closed)) and Kokkos tests (see #3291 (closed)) that call abort() on purpose to just hang instead of terminate the MPI job.

How Has This Been Tested?

Before this change I tested the current 'develop' branch of Trilinos on 'white' with:

$ bsub -x -Is -q rhel7F -n 16 \
  ./checkin-test-atdm.sh cuda-9.2-opt-Power8-Kepler37 --enable-packages=Kokkos,Teuchos --local-do-all

and it reproduced the failing tests described in #3287 (closed) and #3297 as:

  96% tests passed, 6 tests failed out of 156

  Subproject Time Summary:
  Kokkos     = 860.38 sec*proc (27 tests)
  Teuchos    = 10285.22 sec*proc (129 tests)

  Total Test time (real) = 1809.41 sec

  The following tests FAILED:
          6 - KokkosCore_UnitTest_PushFinalizeHook_terminate (Timeout)
         49 - TeuchosCore_testTeuchosTestForTermination (Timeout)
         50 - TeuchosCore_testTeuchosTestForTermination_0_MPI_4 (Timeout)
         51 - TeuchosCore_testTeuchosTestForTermination_1_MPI_4 (Timeout)
         52 - TeuchosCore_testTeuchosTestForTermination_2_MPI_4 (Timeout)
         53 - TeuchosCore_testTeuchosTestForTermination_3_MPI_4 (Timeout)
  Errors while running CTest

After the commit, I ran it again and this time it produced:

  100% tests passed, 0 tests failed out of 156
  
  Subproject Time Summary:
  Kokkos     = 287.70 sec*proc (27 tests)
  Teuchos    = 149.60 sec*proc (129 tests)
  
  Total Test time (real) = 125.07 sec

Checklist

  • My commit messages mention the appropriate GitHub issue numbers.

Merge request reports