Belos_Tpetra_HybridGMRES_hb_test_* randomly failing in many trilinos builds
Created by: fryeguy52
CC: @trilinos/belos, @srajama1 (Trilinos Linear Solvers Product Lead), @bartlettroscoe, @fryeguy52
Next Action Status
PR #4229 that may fix this was merged to 'develop' on 1/22/2019. Next: Watch for any more random failures and if no new failures by 2/22/2019 then we can close ...
Description
As shown in this query the tests:
- Belos_Tpetra_HybridGMRES_hb_test_1_MPI_4
- Belos_Tpetra_HybridGMRES_hb_test_0_MPI_4
have failed 11 total times since 2018-12-01 in the following ATDM builds:
- Trilinos-atdm-cee-rhel6-clang-5.0.1-openmpi-1.10.2-serial-static-opt
- Trilinos-atdm-cee-rhel6-gnu-4.9.3-openmpi-1.10.2-serial-static-opt
- Trilinos-atdm-cee-rhel6-gnu-7.2.0-openmpi-1.10.2-serial-static-opt
- Trilinos-atdm-mutrino-intel-opt-openmp-HSW
- Trilinos-atdm-sems-rhel6-gnu-debug-openmp
- Trilinos-atdm-sems-rhel6-gnu-opt-openmp
- Trilinos-atdm-sems-rhel6-gnu-opt-serial
This query shows that Belos_Tpetra_HybridGMRES_hb_test_*
tests have been failing in other trilinos builds as well during that same time period.
Here is some typical output from a failure:
Belos Version 1.3d - 9/17/2008
Dimension of matrix: 1806
Number of right-hand sides: 1
Block size used by solver: 1
Max number of Gmres iterations: 1805
Relative residual tolerance: 1e-05
Failed.......OR Combination ->
OK...........Number of Iterations = 800 < 1805
Unconverged..(2-Norm Res Vec) / (2-Norm Prec Res0)
residual [ 0 ] = 0.0224497 > 1e-05
========================================================================================================================
TimeMonitor results over 4 processors
Timer Name MinOverProcs MeanOverProcs MaxOverProcs MeanOverCallCounts
------------------------------------------------------------------------------------------------------------------------
Belos: BlockGmresSolMgr total solve time 0.5308 (1) 0.5308 (1) 0.5308 (1) 0.5308 (1)
Belos: DGKS[2]: Ortho (Inner Product) 0.03627 (1370) 0.03643 (1370) 0.03654 (1370) 2.659e-05 (1370)
Belos: DGKS[2]: Ortho (Norm) 0.01371 (2416) 0.01547 (2416) 0.01742 (2416) 6.402e-06 (2416)
Belos: DGKS[2]: Ortho (Update) 0.02398 (1370) 0.02443 (1370) 0.02485 (1370) 1.783e-05 (1370)
Belos: DGKS[2]: Orthogonalization 0.08255 (816) 0.08426 (816) 0.08634 (816) 0.0001033 (816)
Belos: GmresPolyOp creation time 0.001283 (1) 0.001308 (1) 0.001326 (1) 0.001308 (1)
Belos: Hybrid Gmres: Operation Op*x 0.03555 (815) 0.03593 (815) 0.03648 (815) 4.408e-05 (815)
Belos: Hybrid Gmres: Operation Prec*x 0.3885 (816) 0.3908 (816) 0.3932 (816) 0.0004789 (816)
Belos: Operation Op*x 0.382 (8986) 0.3853 (8986) 0.3886 (8986) 4.288e-05 (8986)
Belos: Operation Prec*x 0 (0) 0 (0) 0 (0) 0 (0)
========================================================================================================================
---------- Actual Residuals (normalized) ----------
Problem 0 : 0.0224497
End Result: TEST FAILED
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpiexec detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[58035,1],2]
Exit code: 1
--------------------------------------------------------------------------
Current Status on CDash
- Current status and recent history of failures of test Belos_Tpetra_HybridGMRES_hb_test_* on CDash
- Recent history of test Belos_Tpetra_HybridGMRES_hb_test_1_MPI_4 in build
Steps to Reproduce
One should be able to reproduce a build where this random failure has a chance of occurring with a sems rhel6 environment as described in:
More specifically, the commands given for with a sems rhel6 environment are provided at:
The exact commands to reproduce a build where this random failure has a chance of occurring should be:
$ cd <some_build_dir>/
$ source $TRILINOS_DIR/cmake/std/atdm/load-env.sh Trilinos-atdm-sems-rhel6-gnu-opt-openmp
$ cmake \
-GNinja \
-DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnv.cmake \
-DTrilinos_ENABLE_TESTS=ON -DTrilinos_ENABLE_Belos=ON \
$TRILINOS_DIR
$ make NP=16
$ ctest -j8