Belos_pseudo_stochastic_pcg_hb_[0,1]_MPI_4 tests failing due to max iterations limit seemingly randomly in the `Trilinos-atdm-white-ride-cuda-debug` build on 'white'
Created by: bartlettroscoe
CC: @trilinos/belos, @fryeguy52, @srajama1 (Linear Solvers Product Lead)
Next Action Status
Disabled in build Trilinos-atdm-white-ride-cuda-debug
in commit cc7fff28 pushed on 6/12/2018 and showed disabled and missing on CDash on 6/13/2018. PR #3546 merged on 10/2/2018 which re-enables tests that should be fixed from PR #3050 merged before. No new failures as of 12/3/2018!
Description
As shown in this rather complex query showing all failing Belos tests other than Belos_rcg_hb_MPI_4 in all promoted ATDM builds since 5/10/2018 the tests:
- Belos_pseudo_stochastic_pcg_hb_0_MPI_4
- Belos_pseudo_stochastic_pcg_hb_1_MPI_4
failed 5 times in total and appear to be randomly failing in the Trilinos-atdm-white-ride-cuda-debug
build. (The other failing test shown was Belos_pseudo_pcg_hb_1_MPI_4
also for the Trilinos-atdm-white-ride-cuda-debug
build but that only failed once yesterday so we will ignore that for now.) (The test Belos_rcg_hb_MPI_4
was excluded from the above query because it is addressed in #2919.)
Looking at the testing history for these tests Belos_pseudo_stochastic_pcg_hb_[0,1]_MPI_4
from 5/10/2018 through today 6/8/2018 in this less complex query one can see that these tests complete in about the same time in under 2 seconds when they pass or fail.
The output when these tests fail (such as shown for the test Belos_pseudo_stochastic_pcg_hb_1_MPI_4
yesterday on 6/7/2018 here) looks like:
Belos::StatusTestGeneralOutput: Passed
(Num calls,Mod test,State test): (104, 1, Passed)
Passed.......OR Combination ->
Failed.......Number of Iterations = 100 == 100
Unconverged..(2-Norm Imp Res Vec) / (2-Norm Res0)
residual [ 0 ] = 8.95881e-09 < 1e-08
residual [ 1 ] = 1.21989e-08 > 1e-08
residual [ 2 ] = 6.84374e-09 < 1e-08
residual [ 3 ] = 9.15804e-09 < 1e-08
residual [ 4 ] = 7.2567e-09 < 1e-08
Passed.......OR Combination ->
Failed.......Number of Iterations = 100 == 100
Unconverged..(2-Norm Imp Res Vec) / (2-Norm Res0)
residual [ 0 ] = 8.95881e-09 < 1e-08
residual [ 1 ] = 1.21989e-08 > 1e-08
residual [ 2 ] = 6.84374e-09 < 1e-08
residual [ 3 ] = 9.15804e-09 < 1e-08
residual [ 4 ] = 7.2567e-09 < 1e-08
==================================================================================================================================
TimeMonitor results over 4 processors
Timer Name MinOverProcs MeanOverProcs MaxOverProcs MeanOverCallCounts
----------------------------------------------------------------------------------------------------------------------------------
Belos: Operation Op*x 0.06571 (101) 0.07122 (101) 0.07694 (101) 0.0007051 (101)
Belos: Operation Prec*x 0.1014 (104) 0.108 (104) 0.1151 (104) 0.001039 (104)
Belos: PseudoBlockStochasticCGSolMgr total solve time 0.2159 (1) 0.216 (1) 0.2162 (1) 0.216 (1)
Epetra_CrsMatrix::Multiply(TransA,X,Y) 0.0665 (102) 0.07206 (102) 0.07777 (102) 0.0007065 (102)
Epetra_CrsMatrix::Solve(Upper,Trans,UnitDiag,X,Y) 0.101 (210) 0.1076 (210) 0.1147 (210) 0.0005122 (210)
==================================================================================================================================
---------- Actual Residuals (normalized) ----------
Problem 0 : 8.95881e-09
Problem 1 : 1.21989e-08
Problem 2 : 6.84374e-09
Problem 3 : 9.15804e-09
Problem 4 : 7.2567e-09
End Result: TEST FAILED
So this shows that the test fails due to the max iteration limit of 100 being reached before reaching the desired residual tolerance . The other failures for the tests Belos_pseudo_stochastic_pcg_hb_0_MPI_4
and Belos_pseudo_stochastic_pcg_hb_1_MPI_4
look to all be maxing out the number of iterations at 100.
When the test Belos_pseudo_stochastic_pcg_hb_1_MPI_4
passed the day before on 6/6/2018 as shown here showed output like:
Belos::StatusTestGeneralOutput: Passed
(Num calls,Mod test,State test): (89, 1, Passed)
Passed.......OR Combination ->
OK...........Number of Iterations = 87 < 100
Converged....(2-Norm Imp Res Vec) / (2-Norm Res0)
residual [ 0 ] = 5.02551e-09 < 1e-08
residual [ 1 ] = 5.92159e-09 < 1e-08
residual [ 2 ] = 6.61897e-09 < 1e-08
residual [ 3 ] = 8.2598e-09 < 1e-08
residual [ 4 ] = 3.67011e-09 < 1e-08
Passed.......OR Combination ->
OK...........Number of Iterations = 87 < 100
Converged....(2-Norm Imp Res Vec) / (2-Norm Res0)
residual [ 0 ] = 5.02551e-09 < 1e-08
residual [ 1 ] = 5.92159e-09 < 1e-08
residual [ 2 ] = 6.61897e-09 < 1e-08
residual [ 3 ] = 8.2598e-09 < 1e-08
residual [ 4 ] = 3.67011e-09 < 1e-08
=================================================================================================================================
TimeMonitor results over 4 processors
Timer Name MinOverProcs MeanOverProcs MaxOverProcs MeanOverCallCounts
---------------------------------------------------------------------------------------------------------------------------------
Belos: Operation Op*x 0.0652 (88) 0.06892 (88) 0.07251 (88) 0.0007831 (88)
Belos: Operation Prec*x 0.09675 (89) 0.1009 (89) 0.1101 (89) 0.001134 (89)
Belos: PseudoBlockStochasticCGSolMgr total solve time 0.195 (1) 0.195 (1) 0.195 (1) 0.195 (1)
Epetra_CrsMatrix::Multiply(TransA,X,Y) 0.06596 (89) 0.06969 (89) 0.07333 (89) 0.0007831 (89)
Epetra_CrsMatrix::Solve(Upper,Trans,UnitDiag,X,Y) 0.09635 (180) 0.1006 (180) 0.1098 (180) 0.0005587 (180)
=================================================================================================================================
---------- Actual Residuals (normalized) ----------
Problem 0 : 5.02551e-09
Problem 1 : 5.92159e-09
Problem 2 : 6.61897e-09
Problem 3 : 8.2598e-09
Problem 4 : 3.67011e-09
End Result: TEST PASSED
which shows it converged in 87 iterations. I looked at several other instances when these tests passed and they all look to be converging in 87 iterations.
Is this non-deterministic behavior due to fact that this is "stochastic" code and therefore the behavior is truly random or is it due to the fact that the random seed is not set consistently, or is it due to non-deterministic behavior in the accumulations with the CUDA 8.0 threaded Kokkos implementation on this machine? The fact that the test seems to converge in 87 iterations when it passes suggests that this is not purposeful random behavior but is a result of some other undesired and unintended random behavior.
Steps to reproduce
Following the instructions at:
one might be able to reproduce this behavior on 'white' or 'ride' by cloning the Trilinos github repo, getting on the 'develop' branch and then doing:
$ cd <some_build_dir>/
$ source $TRILINOS_DIR/cmake/std/atdm/load-env.sh cuda-debug
$ cmake \
-GNinja \
-DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnv.cmake \
-DTrilinos_ENABLE_TESTS=ON -DTrilinos_ENABLE_Belos=ON \
$TRILINOS_DIR
$ make NP=16
$ bsub -x -Is -q rhel7F -n 16 ctest -j16
But given that this test looks to be randomly failing, it may be hard to reproduce this behavior locally.