Anasazi_Epetra_GeneralizedDavidson_nh_test_MPI_4 in many ATDM builds
Created by: fryeguy52
CC: @trilinos/anasazi, @srajama1 (Trilinos Linear Solvers Product Lead), @bartlettroscoe, @fryeguy52
Next Action Status
The merge of PR #4031 to 'develop' on 12/13/2018 seems to have resulted in the test Anasazi_Epetra_GeneralizedDavidson_nh_test_MPI_4
passing in all ATDM Trilinos builds. It passed in all 41 ATDM Trilinos builds on 2018-12-19 as shown in this query (and there were no missing builds for testing day 2018-12-19 so this should be complete test results).
Description
As shown in this query the test: Anasazi_Epetra_GeneralizedDavidson_nh_test_MPI_4
is has failed in many ATDM builds since 11/24/2018 all the builds where this has failed in that time are are:
- Trilinos-atdm-sems-rhel6-intel-opt-openmp
- Trilinos-atdm-mutrino-intel-opt-openmp-KNL
- Trilinos-atdm-mutrino-intel-opt-openmp-HSW
- Trilinos-atdm-chama-intel-opt-openmp
- Trilinos-atdm-chama-intel-debug-openmp
- Trilinos-atdm-cee-rhel6-intel-17.0.1-intelmpi-5.1.2-serial-static-opt
- Trilinos-atdm-cee-rhel6-gnu-7.2.0-openmpi-1.10.2-serial-static-opt
- Trilinos-atdm-cee-rhel6-gnu-4.9.3-openmpi-1.10.2-serial-static-opt
- Trilinos-atdm-cee-rhel6-clang-5.0.1-openmpi-1.10.2-serial-static-opt
The test has been failing everyday since 11/29/2018 in the builds:
- Trilinos-atdm-cee-rhel6-clang-5.0.1-openmpi-1.10.2-serial-static-opt
- Trilinos-atdm-cee-rhel6-gnu-4.9.3-openmpi-1.10.2-serial-static-opt
- Trilinos-atdm-cee-rhel6-gnu-7.2.0-openmpi-1.10.2-serial-static-opt
the test output looks like this in these cases:
Building Map
Setting up info for filling matrix
Creating matrix
Filling matrix
Calling FillComplete on matrix
Setting Anasazi parameters
Creating initial vector for solver
Creating eigenproblem
Creating eigensolver (GeneralizedDavidsonSolMgr)
Solving eigenproblem
[ceerws1113:51638] *** An error occurred in MPI_Allreduce
[ceerws1113:51638] *** reported by process [999489537,2]
[ceerws1113:51638] *** on communicator MPI_COMM_WORLD
[ceerws1113:51638] *** MPI_ERR_IN_STATUS: error code in status
[ceerws1113:51638] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[ceerws1113:51638] *** and potentially your MPI job)
[ceerws1113:51629] 1 more process has sent help message help-mpi-errors.txt / mpi_errors_are_fatal
[ceerws1113:51629] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
Current Status on CDash
The current status of this test on all ATDM builds can be found here
History for the last week on ATDM builds can be seen here
Steps to Reproduce on CEE RHEL6
One should be able to reproduce this failure on a machine with a cee rhel6 environment because it has been failing there everyday. The process is described in:
More specifically, the commands given for a machine with a cee rhel6 environment are provided at:
The exact commands to reproduce this issue should be:
$ cd <some_build_dir>/
$ source $TRILINOS_DIR/cmake/std/atdm/load-env.sh Trilinos-atdm-cee-rhel6-gnu-4.9.3-openmpi-1.10.2-serial-static-op
$ cmake \
-GNinja \
-DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnv.cmake \
-DTrilinos_ENABLE_TESTS=ON -DTrilinos_ENABLE_Anasazi=ON \
$TRILINOS_DIR
$ make NP=16
$ ctest -j16