Test Stratimikos_test_aztecoo_thyra_driver_MPI_1 timing out in Trilinos-atdm-hansen-shiller-gnu-debug-serial build since 5/30/2018
Created by: bartlettroscoe
CC: @trilinos/stratimikos, @fryeguy52
Next Action Stauts
Test was disabled for these two builds on 'hansen' in commit 73ae19cf pushed on 6/12/2018 and this test disappeared in these builds on 6/13/2018.
Description
As shown in this query, the test Stratimikos_test_aztecoo_thyra_driver_MPI_1
has been timing out in the builds Trilinos-atdm-hansen-shiller-gnu-debug-serial
and Trilinos-atdm-hansen-shiller-gnu-opt-serial
since 5/30/2018. (That query also shows this is the only Stratimikos test that has failed in any of the promoted "ATDM" builds since 5/20/2018.)
This query shows that the test Stratimikos_test_aztecoo_thyra_driver_MPI_1
went from passing at under 21s every day to timing out at 10 minutes every day since 5/29/2018 (but it did pass once taking 9m 56s 930ms on 6/8/2018, the only time it did not time-out since 5/29/2018).
What changed from 5/29/2018 to 5/30/2018? Looking at the updates pulled in for the build Trilinos-atdm-hansen-shiller-gnu-debug-serial
with build stamp 20180530-0400-ATDM
shown at:
it seems like only commits that could have impacted this were:
c9ccf7d: Switch from srun to salloc on hansen/shiller (TRIL-209)
Author: Roscoe A. Bartlett <rabartl@sandia.gov>
Date: Tue May 29 08:35:16 2018 -0600
M cmake/ctest/drivers/atdm/shiller/local-driver.sh
M cmake/std/atdm/README.md
c840658: Switch to CMake 3.11.2, Ninja 1.8.2 and all-at-once mode on hansen/shiller (TRIL-209)
Author: Roscoe A. Bartlett <rabartl@sandia.gov>
Date: Tue May 29 08:12:42 2018 -0600
M cmake/ctest/drivers/atdm/shiller/local-driver.sh
M cmake/std/atdm/shiller/environment.sh
There are no other commits that I could see that could impact this AztecOO test. So it looks like moving to CMake/CTest 3.11.2 and to the all-at-once approach triggered this large increase in runtime for the test Stratimikos_test_aztecoo_thyra_driver_MPI_1
for the build Trilinos-atdm-hansen-shiller-gnu-debug-serial
. This may have been a result of having more tests running while this Stratimikos test is running.
Looking in this query, we can see that the test Stratimikos_test_aztecoo_thyra_driver_MPI_1
timed-out in the build Trilinos-atdm-hansen-shiller-gnu-debug-serial
yesterday 6/10/2018 but it took upwards of 2.5 to 3.5 minutes to run in the CUDA builds. Ohterwise, this test did not take any longer than 22s to run in all of the other ATDM builds of Trilinos. And what is also interesting is that query showed that this test passed in 4s 460ms for the build Trilinos-atdm-hansen-shiller-intel-debug-serial
also run on 'hansen'. How can the same test pass on an intel-debug-serial
build in under 5 seconds but then time out at 10 minutes for a gnu-debug-serial
build on the same hardware with the same MPI implementation and settings?
For that matter, this query shows that other than the CUDA builds of Trilinos and the yet-to-be-cleaned-up 'mutrinos' build Trilinos-atdm-mutrino-intel-debug-openmp
, this test did not take any longer than 22s to run in any of the 46 Trilinos builds where this test ran yesterday. On some platforms, this test completed in less than 2s!
This is very strange behavior for a test. There must be some type of machine or system usage issue going on here. But why would it impact a gnu-debug-serial
build but not an intel-debug-serial
build on the same machine?
Steps to reproduce
Following the instructions at:
one can log on to 'hansen' or 'shiller', clone Trilinos and get on to the 'develop' branch, and then do:
$ cd <some_build_dir>/
$ source $TRILINOS_DIR/cmake/std/atdm/load-env.sh intel-opt-openmp
$ cmake \
-GNinja \
-DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnv.cmake \
-DTrilinos_ENABLE_TESTS=ON -DTrilinos_ENABLE_Stratimikos=ON \
$TRILINOS_DIR
$ make NP=16
$ salloc ctest -j16
I did this on 'shiller' but unfortunately all of the Stratimikos tests passed:
100% tests passed, 0 tests failed out of 40
Subproject Time Summary:
Stratimikos = 256.50 sec*proc (40 tests)
Total Test time (real) = 20.84 sec
Therefore, I was not able to reproduce this behavior on 'shiller'. Therefore, this must be some type of system issue.