Anasazi_Epetra_BKS_norestart_test_MPI_4 failing in seveal ATDM builds.
Created by: fryeguy52
CC: @trilinos/anasazi, @srajama1 (Trilinos Linear Solvers Product Lead), @bartlettroscoe, @fryeguy52
Next Action Status
Triggered by the PR #3951 merged to 'develop' on 10/28/2018 that worked around Intel 18.0.2 MKL GEEV defect. Next: Try updated Intel MKL 18.0.5 on 'mutrino' (with local revert of #3951) and see all of these failures go away (@fryeguy52) ...
Description
As shown in this query the test:
- Anasazi_Epetra_BKS_norestart_test_MPI_4
is failing in the builds:
- Trilinos-atdm-mutrino-intel-opt-openmp-HSW (since ???)
- Trilinos-atdm-mutrino-intel-opt-openmp-KNL (since ???)
- Trilinos-atdm-cee-rhel6-intel-17.0.1-intelmpi-5.1.2-serial-static-opt (since 11/30/2018)
- Trilinos-atdm-cee-rhel6-gnu-7.2.0-openmpi-1.10.2-serial-static-opt (11/29/2018 & 12/1/2018)
- Trilinos-atdm-cee-rhel6-clang-5.0.1-openmpi-1.10.2-serial-static-opt (on 12/2/2018)
- Trilinos-atdm-cee-rhel6-gnu-4.9.3-openmpi-1.10.2-serial-static-opt (on 12/10/2018)
Looks like some of these failures are random like shown for the build Trilinos-atdm-cee-rhel6-clang-5.0.1-openmpi-1.10.2-serial-static-opt and the build Trilinos-atdm-cee-rhel6-gnu-7.2.0-openmpi-1.10.2-serial-static-opt.
The errors look like here for example:
Number of iterations performed in BlockKrylovSchur_test.exe: 30
Direct residual norms computed in BlockKrylovSchur_test.exe
Eigenvalue Residual
----------------------------------------
1.199112e+05 1.296543e-07
1.196455e+05 1.185550e-07
1.192047e+05 4.530562e-04
1.185918e+05 1.497329e-04
1.178109e+05 4.552932e-04
End Result: TEST FAILED
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[25128,1],1]
Exit code: 255
--------------------------------------------------------------------------
...
Current Status on CDash
The current status of these tests/builds for the current testing day can be found here
Steps to Reproduce
One should be able to reproduce this failure on a machine with a cee rhel6 environment as described in:
More specifically, the commands given for a machine with a cee rhel6 environment are provided at:
The exact commands to reproduce this issue should be:
$ cd <some_build_dir>/
$ source $TRILINOS_DIR/cmake/std/atdm/load-env.sh Trilinos-atdm-cee-rhel6-intel-17.0.1-intelmpi-5.1.2-serial-static-opt
$ cmake \
-GNinja \
-DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnv.cmake \
-DTrilinos_ENABLE_TESTS=ON -DTrilinos_ENABLE_Anasazi=ON \
$TRILINOS_DIR
$ make NP=16
$ ctest -j16