Test Stratimikos_test_aztecoo_thyra_driver_MPI_1 timing out in Trilinos-atdm-hansen-shiller-gnu-debug-serial build since 5/30/2018
Created by: bartlettroscoe
CC: @trilinos/stratimikos, @fryeguy52
Next Action Stauts
Test was disabled for these two builds on 'hansen' in commit 73ae19cf pushed on 6/12/2018 and this test disappeared in these builds on 6/13/2018.
As shown in this query, the test
Stratimikos_test_aztecoo_thyra_driver_MPI_1 has been timing out in the builds
Trilinos-atdm-hansen-shiller-gnu-opt-serial since 5/30/2018. (That query also shows this is the only Stratimikos test that has failed in any of the promoted "ATDM" builds since 5/20/2018.)
This query shows that the test
Stratimikos_test_aztecoo_thyra_driver_MPI_1 went from passing at under 21s every day to timing out at 10 minutes every day since 5/29/2018 (but it did pass once taking 9m 56s 930ms on 6/8/2018, the only time it did not time-out since 5/29/2018).
What changed from 5/29/2018 to 5/30/2018? Looking at the updates pulled in for the build
Trilinos-atdm-hansen-shiller-gnu-debug-serial with build stamp
20180530-0400-ATDM shown at:
it seems like only commits that could have impacted this were:
c9ccf7d: Switch from srun to salloc on hansen/shiller (TRIL-209) Author: Roscoe A. Bartlett <email@example.com> Date: Tue May 29 08:35:16 2018 -0600 M cmake/ctest/drivers/atdm/shiller/local-driver.sh M cmake/std/atdm/README.md c840658: Switch to CMake 3.11.2, Ninja 1.8.2 and all-at-once mode on hansen/shiller (TRIL-209) Author: Roscoe A. Bartlett <firstname.lastname@example.org> Date: Tue May 29 08:12:42 2018 -0600 M cmake/ctest/drivers/atdm/shiller/local-driver.sh M cmake/std/atdm/shiller/environment.sh
There are no other commits that I could see that could impact this AztecOO test. So it looks like moving to CMake/CTest 3.11.2 and to the all-at-once approach triggered this large increase in runtime for the test
Stratimikos_test_aztecoo_thyra_driver_MPI_1 for the build
Trilinos-atdm-hansen-shiller-gnu-debug-serial. This may have been a result of having more tests running while this Stratimikos test is running.
Looking in this query, we can see that the test
Stratimikos_test_aztecoo_thyra_driver_MPI_1 timed-out in the build
Trilinos-atdm-hansen-shiller-gnu-debug-serial yesterday 6/10/2018 but it took upwards of 2.5 to 3.5 minutes to run in the CUDA builds. Ohterwise, this test did not take any longer than 22s to run in all of the other ATDM builds of Trilinos. And what is also interesting is that query showed that this test passed in 4s 460ms for the build
Trilinos-atdm-hansen-shiller-intel-debug-serial also run on 'hansen'. How can the same test pass on an
intel-debug-serial build in under 5 seconds but then time out at 10 minutes for a
gnu-debug-serial build on the same hardware with the same MPI implementation and settings?
For that matter, this query shows that other than the CUDA builds of Trilinos and the yet-to-be-cleaned-up 'mutrinos' build
Trilinos-atdm-mutrino-intel-debug-openmp, this test did not take any longer than 22s to run in any of the 46 Trilinos builds where this test ran yesterday. On some platforms, this test completed in less than 2s!
This is very strange behavior for a test. There must be some type of machine or system usage issue going on here. But why would it impact a
gnu-debug-serial build but not an
intel-debug-serial build on the same machine?
Steps to reproduce
Following the instructions at:
one can log on to 'hansen' or 'shiller', clone Trilinos and get on to the 'develop' branch, and then do:
$ cd <some_build_dir>/ $ source $TRILINOS_DIR/cmake/std/atdm/load-env.sh intel-opt-openmp $ cmake \ -GNinja \ -DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnv.cmake \ -DTrilinos_ENABLE_TESTS=ON -DTrilinos_ENABLE_Stratimikos=ON \ $TRILINOS_DIR $ make NP=16 $ salloc ctest -j16
I did this on 'shiller' but unfortunately all of the Stratimikos tests passed:
100% tests passed, 0 tests failed out of 40 Subproject Time Summary: Stratimikos = 256.50 sec*proc (40 tests) Total Test time (real) = 20.84 sec
Therefore, I was not able to reproduce this behavior on 'shiller'. Therefore, this must be some type of system issue.