Belos_pseudo_stochastic_pcg_hb_[0,1]_MPI_4 tests failing due to max iterations limit seemingly randomly in the `Trilinos-atdm-white-ride-cuda-debug` build on 'white'

Created by: bartlettroscoe

CC: @trilinos/belos, @fryeguy52, @srajama1 (Linear Solvers Product Lead)

Next Action Status

Disabled in build Trilinos-atdm-white-ride-cuda-debug in commit cc7fff28 pushed on 6/12/2018 and showed disabled and missing on CDash on 6/13/2018. PR #3546 merged on 10/2/2018 which re-enables tests that should be fixed from PR #3050 merged before. No new failures as of 12/3/2018!

Description

As shown in this rather complex query showing all failing Belos tests other than Belos_rcg_hb_MPI_4 in all promoted ATDM builds since 5/10/2018 the tests:

Belos_pseudo_stochastic_pcg_hb_0_MPI_4
Belos_pseudo_stochastic_pcg_hb_1_MPI_4

failed 5 times in total and appear to be randomly failing in the Trilinos-atdm-white-ride-cuda-debug build. (The other failing test shown was Belos_pseudo_pcg_hb_1_MPI_4 also for the Trilinos-atdm-white-ride-cuda-debug build but that only failed once yesterday so we will ignore that for now.) (The test Belos_rcg_hb_MPI_4 was excluded from the above query because it is addressed in #2919.)

Looking at the testing history for these tests Belos_pseudo_stochastic_pcg_hb_[0,1]_MPI_4 from 5/10/2018 through today 6/8/2018 in this less complex query one can see that these tests complete in about the same time in under 2 seconds when they pass or fail.

The output when these tests fail (such as shown for the test Belos_pseudo_stochastic_pcg_hb_1_MPI_4 yesterday on 6/7/2018 here) looks like:

Belos::StatusTestGeneralOutput: Passed
  (Num calls,Mod test,State test): (104, 1, Passed)
   Passed.......OR Combination -> 
     Failed.......Number of Iterations = 100 == 100
     Unconverged..(2-Norm Imp Res Vec) / (2-Norm Res0)
                  residual [ 0 ] = 8.95881e-09 < 1e-08
                  residual [ 1 ] = 1.21989e-08 > 1e-08
                  residual [ 2 ] = 6.84374e-09 < 1e-08
                  residual [ 3 ] = 9.15804e-09 < 1e-08
                  residual [ 4 ] = 7.2567e-09 < 1e-08

Passed.......OR Combination -> 
  Failed.......Number of Iterations = 100 == 100
  Unconverged..(2-Norm Imp Res Vec) / (2-Norm Res0)
               residual [ 0 ] = 8.95881e-09 < 1e-08
               residual [ 1 ] = 1.21989e-08 > 1e-08
               residual [ 2 ] = 6.84374e-09 < 1e-08
               residual [ 3 ] = 9.15804e-09 < 1e-08
               residual [ 4 ] = 7.2567e-09 < 1e-08

==================================================================================================================================

                                              TimeMonitor results over 4 processors

Timer Name                                               MinOverProcs     MeanOverProcs    MaxOverProcs     MeanOverCallCounts    
----------------------------------------------------------------------------------------------------------------------------------
Belos: Operation Op*x                                    0.06571 (101)    0.07122 (101)    0.07694 (101)    0.0007051 (101)       
Belos: Operation Prec*x                                  0.1014 (104)     0.108 (104)      0.1151 (104)     0.001039 (104)        
Belos: PseudoBlockStochasticCGSolMgr total solve time    0.2159 (1)       0.216 (1)        0.2162 (1)       0.216 (1)             
Epetra_CrsMatrix::Multiply(TransA,X,Y)                   0.0665 (102)     0.07206 (102)    0.07777 (102)    0.0007065 (102)       
Epetra_CrsMatrix::Solve(Upper,Trans,UnitDiag,X,Y)        0.101 (210)      0.1076 (210)     0.1147 (210)     0.0005122 (210)       
==================================================================================================================================
---------- Actual Residuals (normalized) ----------

Problem 0 : 	8.95881e-09
Problem 1 : 	1.21989e-08
Problem 2 : 	6.84374e-09
Problem 3 : 	9.15804e-09
Problem 4 : 	7.2567e-09

End Result: TEST FAILED

So this shows that the test fails due to the max iteration limit of 100 being reached before reaching the desired residual tolerance . The other failures for the tests Belos_pseudo_stochastic_pcg_hb_0_MPI_4 and Belos_pseudo_stochastic_pcg_hb_1_MPI_4 look to all be maxing out the number of iterations at 100.

When the test Belos_pseudo_stochastic_pcg_hb_1_MPI_4 passed the day before on 6/6/2018 as shown here showed output like:

Belos::StatusTestGeneralOutput: Passed
  (Num calls,Mod test,State test): (89, 1, Passed)
   Passed.......OR Combination -> 
     OK...........Number of Iterations = 87 < 100
     Converged....(2-Norm Imp Res Vec) / (2-Norm Res0)
                  residual [ 0 ] = 5.02551e-09 < 1e-08
                  residual [ 1 ] = 5.92159e-09 < 1e-08
                  residual [ 2 ] = 6.61897e-09 < 1e-08
                  residual [ 3 ] = 8.2598e-09 < 1e-08
                  residual [ 4 ] = 3.67011e-09 < 1e-08

Passed.......OR Combination -> 
  OK...........Number of Iterations = 87 < 100
  Converged....(2-Norm Imp Res Vec) / (2-Norm Res0)
               residual [ 0 ] = 5.02551e-09 < 1e-08
               residual [ 1 ] = 5.92159e-09 < 1e-08
               residual [ 2 ] = 6.61897e-09 < 1e-08
               residual [ 3 ] = 8.2598e-09 < 1e-08
               residual [ 4 ] = 3.67011e-09 < 1e-08

=================================================================================================================================

                                              TimeMonitor results over 4 processors

Timer Name                                               MinOverProcs     MeanOverProcs    MaxOverProcs    MeanOverCallCounts    
---------------------------------------------------------------------------------------------------------------------------------
Belos: Operation Op*x                                    0.0652 (88)      0.06892 (88)     0.07251 (88)    0.0007831 (88)        
Belos: Operation Prec*x                                  0.09675 (89)     0.1009 (89)      0.1101 (89)     0.001134 (89)         
Belos: PseudoBlockStochasticCGSolMgr total solve time    0.195 (1)        0.195 (1)        0.195 (1)       0.195 (1)             
Epetra_CrsMatrix::Multiply(TransA,X,Y)                   0.06596 (89)     0.06969 (89)     0.07333 (89)    0.0007831 (89)        
Epetra_CrsMatrix::Solve(Upper,Trans,UnitDiag,X,Y)        0.09635 (180)    0.1006 (180)     0.1098 (180)    0.0005587 (180)       
=================================================================================================================================
---------- Actual Residuals (normalized) ----------

Problem 0 : 	5.02551e-09
Problem 1 : 	5.92159e-09
Problem 2 : 	6.61897e-09
Problem 3 : 	8.2598e-09
Problem 4 : 	3.67011e-09

End Result: TEST PASSED

which shows it converged in 87 iterations. I looked at several other instances when these tests passed and they all look to be converging in 87 iterations.

Is this non-deterministic behavior due to fact that this is "stochastic" code and therefore the behavior is truly random or is it due to the fact that the random seed is not set consistently, or is it due to non-deterministic behavior in the accumulations with the CUDA 8.0 threaded Kokkos implementation on this machine? The fact that the test seems to converge in 87 iterations when it passes suggests that this is not purposeful random behavior but is a result of some other undesired and unintended random behavior.

Steps to reproduce

Following the instructions at:

https://github.com/trilinos/Trilinos/blob/develop/cmake/std/atdm/README.md

one might be able to reproduce this behavior on 'white' or 'ride' by cloning the Trilinos github repo, getting on the 'develop' branch and then doing:

$ cd <some_build_dir>/

$ source $TRILINOS_DIR/cmake/std/atdm/load-env.sh cuda-debug

$ cmake \
  -GNinja \
  -DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnv.cmake \
  -DTrilinos_ENABLE_TESTS=ON -DTrilinos_ENABLE_Belos=ON \
  $TRILINOS_DIR

$ make NP=16 

$ bsub -x -Is -q rhel7F -n 16 ctest -j16

But given that this test looks to be randomly failing, it may be hard to reproduce this behavior locally.