Belos_rcg_hb_MPI_4 timing out in several ATDM Trilinos builds on 'hansen' since 5/29/2018
Created by: bartlettroscoe
CC: @trilinos/belos, @fryeguy52, @srajama1 (Linear Solves Project Lead)
Next Action Status
Test was disabled in these builds on 'hansen' in the commit 8850c648 pushed on 6/12/2018 and was shown to be disabled in the builds on CDash 6/13/2018
Description
As shown in this large query the test Belos_rcg_hb_MPI_4
looks to be consistently timing out in the builds:
- Trilinos-atdm-hansen-shiller-cuda-8.0-debug
- Trilinos-atdm-hansen-shiller-cuda-8.0-opt
- Trilinos-atdm-hansen-shiller-cuda-9.0-debug
- Trilinos-atdm-hansen-shiller-cuda-9.0-opt
- Trilinos-atdm-hansen-shiller-gnu-debug-serial
- Trilinos-atdm-hansen-shiller-gnu-opt-serial
all on 'hansen' starting on 5/29/201 or 5/30/2018. (Since the these builds are pulling directly from the 'develop' branch, they may be testing different versions on the same day and this is UTC time so they may be on the same testing day in Mountain time.)
That same query shows that that test has been consistently passing in every other promoted build on every other ATDM Trilinos testing machine.
What that query also shows is that in those same builds that are now timing out, the test was taking upwards of 6+ minutes to complete before it started timing out at 10 minutes on 5/29/201 or 5/30/2018 as shown in the last non-timing-out builds:
- Trilinos-atdm-hansen-shiller-cuda-8.0-debug: 6m 26s 280ms
- Trilinos-atdm-hansen-shiller-cuda-8.0-opt: 6m 25s 680ms
- Trilinos-atdm-hansen-shiller-cuda-9.0-debug: 6m 22s 810ms
- Trilinos-atdm-hansen-shiller-cuda-9.0-opt: 6m 22s 440ms
- Trilinos-atdm-hansen-shiller-gnu-debug-serial: 6m 13s 150ms
- Trilinos-atdm-hansen-shiller-gnu-opt-serial: 5m 58s 960ms
But the other builds that are not showing any timeouts, that test completes very fast (in under 30 seconds in about every case). Some of the recent test times shown in that query for the various builds that don't have timeouts now are:
- Trilinos-atdm-hansen-shiller-gnu-debug-openmp: 23s 850ms
- Trilinos-atdm-hansen-shiller-gnu-opt-openmp: 8s 650ms
- Trilinos-atdm-hansen-shiller-intel-debug-openmp: 7s 720ms
- Trilinos-atdm-hansen-shiller-intel-debug-serial: 7s 950ms
- Trilinos-atdm-hansen-shiller-intel-opt-openmp: 6s 150ms
- Trilinos-atdm-hansen-shiller-intel-opt-serial: 5s 910ms
- Trilinos-atdm-rhel6-gnu-debug-openmp: 6s 840ms
- Trilinos-atdm-rhel6-gnu-debug-serial: 5s 340ms
- Trilinos-atdm-rhel6-gnu-opt-openmp: 5s 180ms
- Trilinos-atdm-rhel6-gnu-opt-serial: 4s 250ms
- Trilinos-atdm-rhel6-intel-opt-openmp: 3s 740ms
- Trilinos-atdm-sems-gcc-7-2-0: 5s 290ms
- Trilinos-atdm-white-ride-cuda-debug: 9s 430ms
- Trilinos-atdm-white-ride-gnu-debug-openmp: 9s 90ms
So this seems pretty crazy. How can the same test take over 6 minutes to complete for a CUDA 8.0 and 9.0 optimized build on 'hansen' and only take 9 seconds for a CUDA debug on 'white'? And this test takes a very long time (and are now timing out) for the gnu-debug-serial
and gnu-opt-serial
builds as well on 'hansen' but is fast for the intel-debug-serial
and intel-opt-serial
builds on the same machine. How can that be the case?
To try to get more insight about this test we can look at the test output for a case where it takes a long time to run (and is timing out currently) and compare that to the test output for a case that completes very quickly.
First, lets look at the last time this test passed for the Trilinos-atdm-hansen-shiller-gnu-debug-serial
build on 'hansen' which took 6m 13s 150ms to complate and pass on 2018-05-29T06:41:19 UTC with output shown at:
which shows:
Passed.......OR Combination ->
OK...........Number of Iterations = 2206 < 4000
Converged....(2-Norm Imp Res Vec) / (2-Norm Res0)
residual [ 0 ] = 9.56537e-07 < 1e-06
residual [ 1 ] = 9.4486e-07 < 1e-06
residual [ 2 ] = 9.24543e-07 < 1e-06
residual [ 3 ] = 9.44363e-07 < 1e-06
residual [ 4 ] = 9.64382e-07 < 1e-06
residual [ 5 ] = 9.14533e-07 < 1e-06
residual [ 6 ] = 9.50517e-07 < 1e-06
residual [ 7 ] = 8.31671e-07 < 1e-06
residual [ 8 ] = 9.59686e-07 < 1e-06
residual [ 9 ] = 9.74218e-07 < 1e-06
==================================================================================================================================
TimeMonitor results over 4 processors
Timer Name MinOverProcs MeanOverProcs MaxOverProcs MeanOverCallCounts
----------------------------------------------------------------------------------------------------------------------------------
Belos: Operation Op*x 1.489 (2.114e+04) 1.582 (2.114e+04) 1.668 (2.114e+04) 7.483e-05 (2.114e+04)
Belos: Operation Prec*x 0 (0) 0 (0) 0 (0) 0 (0)
Belos: RCGSolMgr total solve time 365.4 (1) 365.4 (1) 365.4 (1) 365.4 (1)
Epetra_CrsMatrix::Multiply(TransA,X,Y) 1.45 (2.114e+04) 1.542 (2.114e+04) 1.629 (2.114e+04) 7.295e-05 (2.114e+04)
==================================================================================================================================
And let's compare this to the test output for the build Trilinos-atdm-hansen-shiller-intel-debug-serial
on 'hansen' which took 6s 740ms to complete and pass on 2018-05-29T14:52:35 UTC shown at:
which shows:
Passed.......OR Combination ->
OK...........Number of Iterations = 2131 < 4000
Converged....(2-Norm Imp Res Vec) / (2-Norm Res0)
residual [ 0 ] = 9.5909e-07 < 1e-06
residual [ 1 ] = 9.65321e-07 < 1e-06
residual [ 2 ] = 8.59334e-07 < 1e-06
residual [ 3 ] = 9.55053e-07 < 1e-06
residual [ 4 ] = 9.97094e-07 < 1e-06
residual [ 5 ] = 7.53902e-07 < 1e-06
residual [ 6 ] = 8.46489e-07 < 1e-06
residual [ 7 ] = 9.64082e-07 < 1e-06
residual [ 8 ] = 9.92318e-07 < 1e-06
residual [ 9 ] = 9.92263e-07 < 1e-06
==================================================================================================================================
TimeMonitor results over 4 processors
Timer Name MinOverProcs MeanOverProcs MaxOverProcs MeanOverCallCounts
----------------------------------------------------------------------------------------------------------------------------------
Belos: Operation Op*x 2.026 (2.109e+04) 2.179 (2.109e+04) 2.403 (2.109e+04) 0.0001033 (2.109e+04)
Belos: Operation Prec*x 0 (0) 0 (0) 0 (0) 0 (0)
Belos: RCGSolMgr total solve time 5.945 (1) 5.946 (1) 5.946 (1) 5.946 (1)
Epetra_CrsMatrix::Multiply(TransA,X,Y) 1.975 (2.109e+04) 2.116 (2.109e+04) 2.316 (2.109e+04) 0.0001003 (2.109e+04)
==================================================================================================================================
The times for the individual operations is not that different but "Belos: RCGSolMgr total solve time" at 365.4 vs. 5.946 is the real problem. The final results show that the test is doing different computations in these two builds but the total number of operations is not radically different (e,g, 2.114e+04 vs. 2.109e+04 mat-vecs). So what is going on here to cause the huge increase in wall clock time for a serial Kokkos threading test?
Looking at the new commits pulled in when this started to fail for the build Trilinos-atdm-hansen-shiller-gnu-opt-serial
on 2018-05-29 14:05:09 shown at:
it is hard to tell what might have caused these tests to start timing out. I would guess that the most likely trigger was:
c840658: Switch to CMake 3.11.2, Ninja 1.8.2 and all-at-once mode on hansen/shiller (TRIL-209)
Author: Roscoe A. Bartlett <rabartl@sandia.gov>
Date: Tue May 29 08:12:42 2018 -0600
M cmake/ctest/drivers/atdm/shiller/local-driver.sh
M cmake/std/atdm/shiller/environment.sh
That will increase the number of tests running on the machine and could result in single tests taking longer to run.
But the fact that the same test takes 6 minutes GCC but only takes 7 seconds with Intel is a major problem, in my opinion and that has to be investigated.
Someone is going to need to add some more timers to account for where the time is going.
Steps to reproduce
One should be able to follow the instructions at:
to reproduce this behavior on 'hansen' or 'shiller'. To avoid needing to run on a compute node, one could use the gnu-debug-serial
build and do:
$ cd <some_build_dir>/
$ source $TRILINOS_DIR/cmake/std/atdm/load-env.sh gnu-debug-serial
$ cmake \
-GNinja \
-DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnv.cmake \
-DTrilinos_ENABLE_TESTS=ON -DTrilinos_ENABLE_Belos=ON \
$TRILINOS_DIR
$ make NP=16
$ ctest -VV -R Belos_rcg_hb_MPI_4