Skip to content

GitLab

  • Menu
Projects Groups Snippets
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
  • Sign in
  • T Trilinos
  • Project information
    • Project information
    • Activity
    • Labels
    • Members
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
  • Issues 936
    • Issues 936
    • List
    • Boards
    • Service Desk
    • Milestones
  • Merge requests 22
    • Merge requests 22
  • CI/CD
    • CI/CD
    • Pipelines
    • Jobs
    • Schedules
  • Deployments
    • Deployments
    • Environments
    • Releases
  • Monitor
    • Monitor
    • Incidents
  • Analytics
    • Analytics
    • Value stream
    • CI/CD
    • Repository
  • Wiki
    • Wiki
  • Snippets
    • Snippets
  • Activity
  • Graph
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
Collapse sidebar
  • James Willenbring
  • Trilinos
  • Issues
  • #4159

Closed
Open
Created Jan 09, 2019 by James Willenbring@jmwilleMaintainer

Belos_Tpetra_HybridGMRES_hb_test_* randomly failing in many trilinos builds

Created by: fryeguy52

CC: @trilinos/belos, @srajama1 (Trilinos Linear Solvers Product Lead), @bartlettroscoe, @fryeguy52

Next Action Status

PR #4229 that may fix this was merged to 'develop' on 1/22/2019. Next: Watch for any more random failures and if no new failures by 2/22/2019 then we can close ...

Description

As shown in this query the tests:

  • Belos_Tpetra_HybridGMRES_hb_test_1_MPI_4
  • Belos_Tpetra_HybridGMRES_hb_test_0_MPI_4

have failed 11 total times since 2018-12-01 in the following ATDM builds:

  • Trilinos-atdm-cee-rhel6-clang-5.0.1-openmpi-1.10.2-serial-static-opt
  • Trilinos-atdm-cee-rhel6-gnu-4.9.3-openmpi-1.10.2-serial-static-opt
  • Trilinos-atdm-cee-rhel6-gnu-7.2.0-openmpi-1.10.2-serial-static-opt
  • Trilinos-atdm-mutrino-intel-opt-openmp-HSW
  • Trilinos-atdm-sems-rhel6-gnu-debug-openmp
  • Trilinos-atdm-sems-rhel6-gnu-opt-openmp
  • Trilinos-atdm-sems-rhel6-gnu-opt-serial

This query shows that Belos_Tpetra_HybridGMRES_hb_test_* tests have been failing in other trilinos builds as well during that same time period.

Here is some typical output from a failure:

Belos Version 1.3d - 9/17/2008



Dimension of matrix: 1806
Number of right-hand sides: 1
Block size used by solver: 1
Max number of Gmres iterations: 1805
Relative residual tolerance: 1e-05

Failed.......OR Combination -> 
  OK...........Number of Iterations = 800 < 1805
  Unconverged..(2-Norm Res Vec) / (2-Norm Prec Res0)
               residual [ 0 ] = 0.0224497 > 1e-05

========================================================================================================================

                                         TimeMonitor results over 4 processors

Timer Name                                  MinOverProcs      MeanOverProcs     MaxOverProcs      MeanOverCallCounts    
------------------------------------------------------------------------------------------------------------------------
Belos: BlockGmresSolMgr total solve time    0.5308 (1)        0.5308 (1)        0.5308 (1)        0.5308 (1)            
Belos: DGKS[2]: Ortho (Inner Product)       0.03627 (1370)    0.03643 (1370)    0.03654 (1370)    2.659e-05 (1370)      
Belos: DGKS[2]: Ortho (Norm)                0.01371 (2416)    0.01547 (2416)    0.01742 (2416)    6.402e-06 (2416)      
Belos: DGKS[2]: Ortho (Update)              0.02398 (1370)    0.02443 (1370)    0.02485 (1370)    1.783e-05 (1370)      
Belos: DGKS[2]: Orthogonalization           0.08255 (816)     0.08426 (816)     0.08634 (816)     0.0001033 (816)       
Belos: GmresPolyOp creation time            0.001283 (1)      0.001308 (1)      0.001326 (1)      0.001308 (1)          
Belos: Hybrid Gmres: Operation Op*x         0.03555 (815)     0.03593 (815)     0.03648 (815)     4.408e-05 (815)       
Belos: Hybrid Gmres: Operation Prec*x       0.3885 (816)      0.3908 (816)      0.3932 (816)      0.0004789 (816)       
Belos: Operation Op*x                       0.382 (8986)      0.3853 (8986)     0.3886 (8986)     4.288e-05 (8986)      
Belos: Operation Prec*x                     0 (0)             0 (0)             0 (0)             0 (0)                 
========================================================================================================================
---------- Actual Residuals (normalized) ----------

Problem 0 : 	0.0224497

End Result: TEST FAILED
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpiexec detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[58035,1],2]
  Exit code:    1
--------------------------------------------------------------------------

Current Status on CDash

  • Current status and recent history of failures of test Belos_Tpetra_HybridGMRES_hb_test_* on CDash
  • Recent history of test Belos_Tpetra_HybridGMRES_hb_test_1_MPI_4 in build

Steps to Reproduce

One should be able to reproduce a build where this random failure has a chance of occurring with a sems rhel6 environment as described in:

  • https://github.com/trilinos/Trilinos/blob/develop/cmake/std/atdm/README.md

More specifically, the commands given for with a sems rhel6 environment are provided at:

  • https://github.com/trilinos/Trilinos/blob/develop/cmake/std/atdm/README.md#sems-rhel6-environment

The exact commands to reproduce a build where this random failure has a chance of occurring should be:

$ cd <some_build_dir>/
$ source $TRILINOS_DIR/cmake/std/atdm/load-env.sh Trilinos-atdm-sems-rhel6-gnu-opt-openmp
$ cmake \
 -GNinja \
 -DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnv.cmake \
 -DTrilinos_ENABLE_TESTS=ON -DTrilinos_ENABLE_Belos=ON \
 $TRILINOS_DIR
$ make NP=16
$ ctest -j8
Assignee
Assign to
Time tracking