Tests Belos_pseudo_pcg_hb_[0,1]_MPI_4 seeming fail with max iterations randomly in some builds on white/ride
Created by: bartlettroscoe
CC: @trilinos/belos, @fryeguy52, @srajama1 (Trilinos Linear Solvers Product Lead)
Next Action Status
PR #3546 merged on 10/2/2018 re-enabled these tests on 'white' and 'ride'. No new failures as of 12/3/2018.
Description
As shown in this query, the tests Belos_pseudo_pcg_hb_0_MPI_4
and Belos_pseudo_pcg_hb_1_MPI_4
appear to be failing randomly (24 times in total) in the following builds on 'whte' and 'ride' over the last 16 weeks:
- Trilinos-atdm-white-ride-cuda-debug
- Trilinos-atdm-white-ride-cuda-debug-all-at-once
- Trilinos-atdm-white-ride-cuda-opt
- Trilinos-atdm-white-ride-gnu-debug-openmp
- Trilinos-atdm-white-ride-gnu-opt-openmp
The failure output for the most recent failure for the test Belos_pseudo_pcg_hb_1_MPI_4
for the build Trilinos-atdm-white-ride-cuda-debug
on ride just today shown here shows:
...
Belos::StatusTestGeneralOutput: Passed
(Num calls,Mod test,State test): (104, 1, Passed)
Passed.......OR Combination ->
Failed.......Number of Iterations = 100 == 100
Unconverged..(2-Norm Imp Res Vec) / (2-Norm Res0)
residual [ 0 ] = 8.95881e-09 < 1e-08
residual [ 1 ] = 1.21989e-08 > 1e-08
residual [ 2 ] = 6.84374e-09 < 1e-08
residual [ 3 ] = 9.15804e-09 < 1e-08
residual [ 4 ] = 7.2567e-09 < 1e-08
Passed.......OR Combination ->
Failed.......Number of Iterations = 100 == 100
Unconverged..(2-Norm Imp Res Vec) / (2-Norm Res0)
residual [ 0 ] = 8.95881e-09 < 1e-08
residual [ 1 ] = 1.21989e-08 > 1e-08
residual [ 2 ] = 6.84374e-09 < 1e-08
residual [ 3 ] = 9.15804e-09 < 1e-08
residual [ 4 ] = 7.2567e-09 < 1e-08
==============================================================================================================================
TimeMonitor results over 4 processors
Timer Name MinOverProcs MeanOverProcs MaxOverProcs MeanOverCallCounts
------------------------------------------------------------------------------------------------------------------------------
Belos: Operation Op*x 0.06529 (101) 0.06993 (101) 0.07256 (101) 0.0006924 (101)
Belos: Operation Prec*x 0.09817 (104) 0.1055 (104) 0.1237 (104) 0.001015 (104)
Belos: PseudoBlockCGSolMgr total solve time 0.2098 (1) 0.2099 (1) 0.21 (1) 0.2099 (1)
Epetra_CrsMatrix::Multiply(TransA,X,Y) 0.06611 (102) 0.07081 (102) 0.0734 (102) 0.0006942 (102)
Epetra_CrsMatrix::Solve(Upper,Trans,UnitDiag,X,Y) 0.09769 (210) 0.1051 (210) 0.1233 (210) 0.0005004 (210)
==============================================================================================================================
---------- Actual Residuals (normalized) ----------
Problem 0 : 8.95881e-09
Problem 1 : 1.21989e-08
Problem 2 : 6.84374e-09
Problem 3 : 9.15804e-09
Problem 4 : 7.2567e-09
End Result: TEST FAILED
See, it maxed out the number of iterations:
Failed.......Number of Iterations = 100 == 100
The previous day this same test in this same build on 'ride' passsed as shown here showing:
Belos::StatusTestGeneralOutput: Passed
(Num calls,Mod test,State test): (89, 1, Passed)
Passed.......OR Combination ->
OK...........Number of Iterations = 87 < 100
Converged....(2-Norm Imp Res Vec) / (2-Norm Res0)
residual [ 0 ] = 5.02551e-09 < 1e-08
residual [ 1 ] = 5.92159e-09 < 1e-08
residual [ 2 ] = 6.61897e-09 < 1e-08
residual [ 3 ] = 8.2598e-09 < 1e-08
residual [ 4 ] = 3.67011e-09 < 1e-08
Passed.......OR Combination ->
OK...........Number of Iterations = 87 < 100
Converged....(2-Norm Imp Res Vec) / (2-Norm Res0)
residual [ 0 ] = 5.02551e-09 < 1e-08
residual [ 1 ] = 5.92159e-09 < 1e-08
residual [ 2 ] = 6.61897e-09 < 1e-08
residual [ 3 ] = 8.2598e-09 < 1e-08
residual [ 4 ] = 3.67011e-09 < 1e-08
=============================================================================================================================
TimeMonitor results over 4 processors
Timer Name MinOverProcs MeanOverProcs MaxOverProcs MeanOverCallCounts
-----------------------------------------------------------------------------------------------------------------------------
Belos: Operation Op*x 0.06567 (88) 0.07594 (88) 0.08703 (88) 0.0008629 (88)
Belos: Operation Prec*x 0.09592 (89) 0.1201 (89) 0.15 (89) 0.00135 (89)
Belos: PseudoBlockCGSolMgr total solve time 0.2488 (1) 0.2489 (1) 0.249 (1) 0.2489 (1)
Epetra_CrsMatrix::Multiply(TransA,X,Y) 0.06652 (89) 0.07686 (89) 0.08802 (89) 0.0008635 (89)
Epetra_CrsMatrix::Solve(Upper,Trans,UnitDiag,X,Y) 0.09555 (180) 0.1198 (180) 0.1497 (180) 0.0006653 (180)
=============================================================================================================================
---------- Actual Residuals (normalized) ----------
Problem 0 : 5.02551e-09
Problem 1 : 5.92159e-09
Problem 2 : 6.61897e-09
Problem 3 : 8.2598e-09
Problem 4 : 3.67011e-09
End Result: TEST PASSED
See, the number of iterations was:
OK...........Number of Iterations = 87 < 100
The other instances of failing tests I looked at from the above query all show maxing out the number of iterations:
Failed.......Number of Iterations = 100 == 100
This looks about identical to the behavior of the randomly filing tests Belos_pseudo_stochastic_pcg_hb_0_MPI_4
and Belos_pseudo_stochastic_pcg_hb_1_MPI_4
described in issue #2920 (closed) that got disabled. My guess is that this is suffering from the same random generator problem described in https://github.com/trilinos/Trilinos/issues/2920#issuecomment-398109326.
Steps to Reproduce
One could try to reproduce this on 'white' or 'ride' as described at:
using something like:
$ cd <some_build_dir>/
$ source $TRILINOS_DIR/cmake/std/atdm/load-env.sh cuda-debug # or cuda-opt, gnu-debug-openmp, etc.
$ cmake \
-GNinja \
-DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnv.cmake \
-DTrilinos_ENABLE_TESTS=ON -DTrilinos_ENABLE_Belos=ON \
$TRILINOS_DIR
$ make NP=16
$ bsub -x -Is -q rhel7F -n 16 ctest -j16
But since this is a randomly failing test that usually passes, it may be hard to reproduce this behavior locally.