Skip to content

GitLab

  • Menu
Projects Groups Snippets
    • Loading...
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
  • Sign in
  • T Trilinos
  • Project information
    • Project information
    • Activity
    • Labels
    • Planning hierarchy
    • Members
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
  • Issues 936
    • Issues 936
    • List
    • Boards
    • Service Desk
    • Milestones
  • Merge requests 22
    • Merge requests 22
  • CI/CD
    • CI/CD
    • Pipelines
    • Jobs
    • Schedules
  • Deployments
    • Deployments
    • Environments
    • Releases
  • Monitor
    • Monitor
    • Incidents
  • Analytics
    • Analytics
    • Value stream
    • CI/CD
    • Repository
  • Wiki
    • Wiki
  • Snippets
    • Snippets
  • Activity
  • Graph
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
Collapse sidebar
  • James Willenbring
  • Trilinos
  • Issues
  • #2919

Closed
Open
Created Jun 08, 2018 by James Willenbring@jmwilleMaintainer

Belos_rcg_hb_MPI_4 timing out in several ATDM Trilinos builds on 'hansen' since 5/29/2018

Created by: bartlettroscoe

CC: @trilinos/belos, @fryeguy52, @srajama1 (Linear Solves Project Lead)

Next Action Status

Test was disabled in these builds on 'hansen' in the commit 8850c648 pushed on 6/12/2018 and was shown to be disabled in the builds on CDash 6/13/2018

Description

As shown in this large query the test Belos_rcg_hb_MPI_4 looks to be consistently timing out in the builds:

  • Trilinos-atdm-hansen-shiller-cuda-8.0-debug
  • Trilinos-atdm-hansen-shiller-cuda-8.0-opt
  • Trilinos-atdm-hansen-shiller-cuda-9.0-debug
  • Trilinos-atdm-hansen-shiller-cuda-9.0-opt
  • Trilinos-atdm-hansen-shiller-gnu-debug-serial
  • Trilinos-atdm-hansen-shiller-gnu-opt-serial

all on 'hansen' starting on 5/29/201 or 5/30/2018. (Since the these builds are pulling directly from the 'develop' branch, they may be testing different versions on the same day and this is UTC time so they may be on the same testing day in Mountain time.)

That same query shows that that test has been consistently passing in every other promoted build on every other ATDM Trilinos testing machine.

What that query also shows is that in those same builds that are now timing out, the test was taking upwards of 6+ minutes to complete before it started timing out at 10 minutes on 5/29/201 or 5/30/2018 as shown in the last non-timing-out builds:

  • Trilinos-atdm-hansen-shiller-cuda-8.0-debug: 6m 26s 280ms
  • Trilinos-atdm-hansen-shiller-cuda-8.0-opt: 6m 25s 680ms
  • Trilinos-atdm-hansen-shiller-cuda-9.0-debug: 6m 22s 810ms
  • Trilinos-atdm-hansen-shiller-cuda-9.0-opt: 6m 22s 440ms
  • Trilinos-atdm-hansen-shiller-gnu-debug-serial: 6m 13s 150ms
  • Trilinos-atdm-hansen-shiller-gnu-opt-serial: 5m 58s 960ms

But the other builds that are not showing any timeouts, that test completes very fast (in under 30 seconds in about every case). Some of the recent test times shown in that query for the various builds that don't have timeouts now are:

  • Trilinos-atdm-hansen-shiller-gnu-debug-openmp: 23s 850ms
  • Trilinos-atdm-hansen-shiller-gnu-opt-openmp: 8s 650ms
  • Trilinos-atdm-hansen-shiller-intel-debug-openmp: 7s 720ms
  • Trilinos-atdm-hansen-shiller-intel-debug-serial: 7s 950ms
  • Trilinos-atdm-hansen-shiller-intel-opt-openmp: 6s 150ms
  • Trilinos-atdm-hansen-shiller-intel-opt-serial: 5s 910ms
  • Trilinos-atdm-rhel6-gnu-debug-openmp: 6s 840ms
  • Trilinos-atdm-rhel6-gnu-debug-serial: 5s 340ms
  • Trilinos-atdm-rhel6-gnu-opt-openmp: 5s 180ms
  • Trilinos-atdm-rhel6-gnu-opt-serial: 4s 250ms
  • Trilinos-atdm-rhel6-intel-opt-openmp: 3s 740ms
  • Trilinos-atdm-sems-gcc-7-2-0: 5s 290ms
  • Trilinos-atdm-white-ride-cuda-debug: 9s 430ms
  • Trilinos-atdm-white-ride-gnu-debug-openmp: 9s 90ms

So this seems pretty crazy. How can the same test take over 6 minutes to complete for a CUDA 8.0 and 9.0 optimized build on 'hansen' and only take 9 seconds for a CUDA debug on 'white'? And this test takes a very long time (and are now timing out) for the gnu-debug-serial and gnu-opt-serial builds as well on 'hansen' but is fast for the intel-debug-serial and intel-opt-serial builds on the same machine. How can that be the case?

To try to get more insight about this test we can look at the test output for a case where it takes a long time to run (and is timing out currently) and compare that to the test output for a case that completes very quickly.

First, lets look at the last time this test passed for the Trilinos-atdm-hansen-shiller-gnu-debug-serial build on 'hansen' which took 6m 13s 150ms to complate and pass on 2018-05-29T06:41:19 UTC with output shown at:

  • https://testing-vm.sandia.gov/cdash/testDetails.php?test=47454651&build=3555977

which shows:

Passed.......OR Combination -> 
  OK...........Number of Iterations = 2206 < 4000
  Converged....(2-Norm Imp Res Vec) / (2-Norm Res0)
               residual [ 0 ] = 9.56537e-07 < 1e-06
               residual [ 1 ] = 9.4486e-07 < 1e-06
               residual [ 2 ] = 9.24543e-07 < 1e-06
               residual [ 3 ] = 9.44363e-07 < 1e-06
               residual [ 4 ] = 9.64382e-07 < 1e-06
               residual [ 5 ] = 9.14533e-07 < 1e-06
               residual [ 6 ] = 9.50517e-07 < 1e-06
               residual [ 7 ] = 8.31671e-07 < 1e-06
               residual [ 8 ] = 9.59686e-07 < 1e-06
               residual [ 9 ] = 9.74218e-07 < 1e-06

==================================================================================================================================

                                              TimeMonitor results over 4 processors

Timer Name                                MinOverProcs         MeanOverProcs        MaxOverProcs         MeanOverCallCounts       
----------------------------------------------------------------------------------------------------------------------------------
Belos: Operation Op*x                     1.489 (2.114e+04)    1.582 (2.114e+04)    1.668 (2.114e+04)    7.483e-05 (2.114e+04)    
Belos: Operation Prec*x                   0 (0)                0 (0)                0 (0)                0 (0)                    
Belos: RCGSolMgr total solve time         365.4 (1)            365.4 (1)            365.4 (1)            365.4 (1)                
Epetra_CrsMatrix::Multiply(TransA,X,Y)    1.45 (2.114e+04)     1.542 (2.114e+04)    1.629 (2.114e+04)    7.295e-05 (2.114e+04)    
==================================================================================================================================

And let's compare this to the test output for the build Trilinos-atdm-hansen-shiller-intel-debug-serial on 'hansen' which took 6s 740ms to complete and pass on 2018-05-29T14:52:35 UTC shown at:

  • https://testing-vm.sandia.gov/cdash/testDetails.php?test=47482010&build=3557186

which shows:

Passed.......OR Combination -> 
  OK...........Number of Iterations = 2131 < 4000
  Converged....(2-Norm Imp Res Vec) / (2-Norm Res0)
               residual [ 0 ] = 9.5909e-07 < 1e-06
               residual [ 1 ] = 9.65321e-07 < 1e-06
               residual [ 2 ] = 8.59334e-07 < 1e-06
               residual [ 3 ] = 9.55053e-07 < 1e-06
               residual [ 4 ] = 9.97094e-07 < 1e-06
               residual [ 5 ] = 7.53902e-07 < 1e-06
               residual [ 6 ] = 8.46489e-07 < 1e-06
               residual [ 7 ] = 9.64082e-07 < 1e-06
               residual [ 8 ] = 9.92318e-07 < 1e-06
               residual [ 9 ] = 9.92263e-07 < 1e-06

==================================================================================================================================

                                              TimeMonitor results over 4 processors

Timer Name                                MinOverProcs         MeanOverProcs        MaxOverProcs         MeanOverCallCounts       
----------------------------------------------------------------------------------------------------------------------------------
Belos: Operation Op*x                     2.026 (2.109e+04)    2.179 (2.109e+04)    2.403 (2.109e+04)    0.0001033 (2.109e+04)    
Belos: Operation Prec*x                   0 (0)                0 (0)                0 (0)                0 (0)                    
Belos: RCGSolMgr total solve time         5.945 (1)            5.946 (1)            5.946 (1)            5.946 (1)                
Epetra_CrsMatrix::Multiply(TransA,X,Y)    1.975 (2.109e+04)    2.116 (2.109e+04)    2.316 (2.109e+04)    0.0001003 (2.109e+04)    
==================================================================================================================================

The times for the individual operations is not that different but "Belos: RCGSolMgr total solve time" at 365.4 vs. 5.946 is the real problem. The final results show that the test is doing different computations in these two builds but the total number of operations is not radically different (e,g, 2.114e+04 vs. 2.109e+04 mat-vecs). So what is going on here to cause the huge increase in wall clock time for a serial Kokkos threading test?

Looking at the new commits pulled in when this started to fail for the build Trilinos-atdm-hansen-shiller-gnu-opt-serial on 2018-05-29 14:05:09 shown at:

  • https://testing-vm.sandia.gov/cdash/viewNotes.php?buildid=3560199#!#note0

it is hard to tell what might have caused these tests to start timing out. I would guess that the most likely trigger was:

c840658:  Switch to CMake 3.11.2, Ninja 1.8.2 and all-at-once mode on hansen/shiller (TRIL-209)
Author: Roscoe A. Bartlett <rabartl@sandia.gov>
Date:   Tue May 29 08:12:42 2018 -0600

M	cmake/ctest/drivers/atdm/shiller/local-driver.sh
M	cmake/std/atdm/shiller/environment.sh

That will increase the number of tests running on the machine and could result in single tests taking longer to run.

But the fact that the same test takes 6 minutes GCC but only takes 7 seconds with Intel is a major problem, in my opinion and that has to be investigated.

Someone is going to need to add some more timers to account for where the time is going.

Steps to reproduce

One should be able to follow the instructions at:

  • https://github.com/trilinos/Trilinos/blob/develop/cmake/std/atdm/README.md

to reproduce this behavior on 'hansen' or 'shiller'. To avoid needing to run on a compute node, one could use the gnu-debug-serial build and do:

$ cd <some_build_dir>/

$ source $TRILINOS_DIR/cmake/std/atdm/load-env.sh gnu-debug-serial

$ cmake \
  -GNinja \
  -DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnv.cmake \
  -DTrilinos_ENABLE_TESTS=ON -DTrilinos_ENABLE_Belos=ON \
  $TRILINOS_DIR

$ make NP=16 

$ ctest -VV -R Belos_rcg_hb_MPI_4
Assignee
Assign to
Time tracking