Skip to content

GitLab

  • Menu
Projects Groups Snippets
    • Loading...
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
  • Sign in
  • T Trilinos
  • Project information
    • Project information
    • Activity
    • Labels
    • Planning hierarchy
    • Members
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
  • Issues 936
    • Issues 936
    • List
    • Boards
    • Service Desk
    • Milestones
  • Merge requests 22
    • Merge requests 22
  • CI/CD
    • CI/CD
    • Pipelines
    • Jobs
    • Schedules
  • Deployments
    • Deployments
    • Environments
    • Releases
  • Monitor
    • Monitor
    • Incidents
  • Analytics
    • Analytics
    • Value stream
    • CI/CD
    • Repository
  • Wiki
    • Wiki
  • Snippets
    • Snippets
  • Activity
  • Graph
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
Collapse sidebar
  • James Willenbring
  • Trilinos
  • Issues
  • #2925

Closed
Open
Created Jun 11, 2018 by James Willenbring@jmwilleMaintainer

Test Stratimikos_test_aztecoo_thyra_driver_MPI_1 timing out in Trilinos-atdm-hansen-shiller-gnu-debug-serial build since 5/30/2018

Created by: bartlettroscoe

CC: @trilinos/stratimikos, @fryeguy52

Next Action Stauts

Test was disabled for these two builds on 'hansen' in commit 73ae19cf pushed on 6/12/2018 and this test disappeared in these builds on 6/13/2018.

Description

As shown in this query, the test Stratimikos_test_aztecoo_thyra_driver_MPI_1 has been timing out in the builds Trilinos-atdm-hansen-shiller-gnu-debug-serial and Trilinos-atdm-hansen-shiller-gnu-opt-serial since 5/30/2018. (That query also shows this is the only Stratimikos test that has failed in any of the promoted "ATDM" builds since 5/20/2018.)

This query shows that the test Stratimikos_test_aztecoo_thyra_driver_MPI_1 went from passing at under 21s every day to timing out at 10 minutes every day since 5/29/2018 (but it did pass once taking 9m 56s 930ms on 6/8/2018, the only time it did not time-out since 5/29/2018).

What changed from 5/29/2018 to 5/30/2018? Looking at the updates pulled in for the build Trilinos-atdm-hansen-shiller-gnu-debug-serial with build stamp 20180530-0400-ATDM shown at:

  • https://testing-vm.sandia.gov/cdash/viewNotes.php?buildid=3558860#!#note0

it seems like only commits that could have impacted this were:

c9ccf7d:  Switch from srun to salloc on hansen/shiller (TRIL-209)
Author: Roscoe A. Bartlett <rabartl@sandia.gov>
Date:   Tue May 29 08:35:16 2018 -0600

M	cmake/ctest/drivers/atdm/shiller/local-driver.sh
M	cmake/std/atdm/README.md

c840658:  Switch to CMake 3.11.2, Ninja 1.8.2 and all-at-once mode on hansen/shiller (TRIL-209)
Author: Roscoe A. Bartlett <rabartl@sandia.gov>
Date:   Tue May 29 08:12:42 2018 -0600

M	cmake/ctest/drivers/atdm/shiller/local-driver.sh
M	cmake/std/atdm/shiller/environment.sh

There are no other commits that I could see that could impact this AztecOO test. So it looks like moving to CMake/CTest 3.11.2 and to the all-at-once approach triggered this large increase in runtime for the test Stratimikos_test_aztecoo_thyra_driver_MPI_1 for the build Trilinos-atdm-hansen-shiller-gnu-debug-serial. This may have been a result of having more tests running while this Stratimikos test is running.

Looking in this query, we can see that the test Stratimikos_test_aztecoo_thyra_driver_MPI_1 timed-out in the build Trilinos-atdm-hansen-shiller-gnu-debug-serial yesterday 6/10/2018 but it took upwards of 2.5 to 3.5 minutes to run in the CUDA builds. Ohterwise, this test did not take any longer than 22s to run in all of the other ATDM builds of Trilinos. And what is also interesting is that query showed that this test passed in 4s 460ms for the build Trilinos-atdm-hansen-shiller-intel-debug-serial also run on 'hansen'. How can the same test pass on an intel-debug-serial build in under 5 seconds but then time out at 10 minutes for a gnu-debug-serial build on the same hardware with the same MPI implementation and settings?

For that matter, this query shows that other than the CUDA builds of Trilinos and the yet-to-be-cleaned-up 'mutrinos' build Trilinos-atdm-mutrino-intel-debug-openmp, this test did not take any longer than 22s to run in any of the 46 Trilinos builds where this test ran yesterday. On some platforms, this test completed in less than 2s!

This is very strange behavior for a test. There must be some type of machine or system usage issue going on here. But why would it impact a gnu-debug-serial build but not an intel-debug-serial build on the same machine?

Steps to reproduce

Following the instructions at:

  • https://github.com/trilinos/Trilinos/blob/develop/cmake/std/atdm/README.md#shillerhansen

one can log on to 'hansen' or 'shiller', clone Trilinos and get on to the 'develop' branch, and then do:

$ cd <some_build_dir>/

$ source $TRILINOS_DIR/cmake/std/atdm/load-env.sh intel-opt-openmp

$ cmake \
  -GNinja \
  -DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnv.cmake \
  -DTrilinos_ENABLE_TESTS=ON -DTrilinos_ENABLE_Stratimikos=ON \
  $TRILINOS_DIR

$ make NP=16

$ salloc ctest -j16

I did this on 'shiller' but unfortunately all of the Stratimikos tests passed:

  100% tests passed, 0 tests failed out of 40
  
  Subproject Time Summary:
  Stratimikos    = 256.50 sec*proc (40 tests)
  
  Total Test time (real) =  20.84 sec

Therefore, I was not able to reproduce this behavior on 'shiller'. Therefore, this must be some type of system issue.

Assignee
Assign to
Time tracking