MueLu_UnitTests[Blocked][Epetra|Tpetra]_MPI_4 failing randomly on several ATDM builds
Created by: fryeguy52
CC: @trilinos/MueLu, @srajama1 (Trilinos Linear Solvers Product Lead), @bartlettroscoe
Next Action Status
PR #4046 merged to 'develop' on 12/18/2018 may fix these random failures. Next: Watch for more failures over the coming days and weeks to see if there are any more failures ...
Description
As shown in the links below the tests:
- MueLu_UnitTestsBlockedEpetra_MPI_4
- MueLu_UnitTestsEpetra_MPI_4
- MueLu_UnitTestsEpetra_MPI_1
- MueLu_UnitTestsTpetra_MPI_1
- MueLu_UnitTestsTpetra_MPI_1
are randomly failing across several builds. They has failed several times in the last month on different builds. The builds where we have seen failures are:
- Trilinos-atdm-cee-rhel6-gnu-4.9.3-opt-serial
- Trilinos-atdm-cee-rhel6-gnu-opt-serial
- Trilinos-atdm-cee-rhel6-intel-opt-serial
- Trilinos-atdm-hansen-shiller-gnu-opt-openmp
- Trilinos-atdm-hansen-shiller-gnu-opt-serial
- Trilinos-atdm-hansen-shiller-gnu-opt-serial
- Trilinos-atdm-hansen-shiller-intel-debug-openmp
- Trilinos-atdm-hansen-shiller-intel-debug-serial
- Trilinos-atdm-mutrino-intel-opt-openmp-HSW
- Trilinos-atdm-mutrino-intel-opt-openmp-KNL
- Trilinos-atdm-sems-rhel6-gnu-debug-openmp
- Trilinos-atdm-sems-rhel6-intel-opt-openmp
- Trilinos-atdm-serrano-intel-opt-openmp
- Trilinos-atdm-waterman-gnu-opt-openmp
- Trilinos-atdm-waterman-gnu-release-debug-openmp
- Trilinos-atdm-white-ride-cuda-9.2-opt
- Trilinos-atdm-white-ride-gnu-opt-openmp
It looks like that in each case something similar to the following appears in the 'openmp' builds:
...
p=0: *** Caught standard std::exception of type 'Xpetra::Exceptions::RuntimeError' :
EpetraExt::MatrixMarketFileToCrsMatrix return value of -1
[FAILED] (0.0902 sec) Hierarchy_double_int_int_Kokkos_Compat_KokkosOpenMPWrapperNode_Write_UnitTest
Location: /home/jenkins/hansen/workspace/Trilinos-atdm-hansen-shiller-intel-debug-openmp/SRC_AND_BUILD/Trilinos/packages/muelu/test/unit_tests/Hierarchy.cpp:889
...
The following tests FAILED:
116. Hierarchy_double_int_int_Kokkos_Compat_KokkosOpenMPWrapperNode_Write_UnitTest ...
...
and the 'serial' builds show:
...
p=0: *** Caught standard std::exception of type 'Xpetra::Exceptions::RuntimeError' :
EpetraExt::MatrixMarketFileToCrsMatrix return value of -1
[FAILED] (0.00618 sec) Hierarchy_double_int_int_Kokkos_Compat_KokkosSerialWrapperNode_Write_UnitTest
Location: /jenkins/slave/workspace/Trilinos-atdm-sems-rhel6-gnu-debug-serial/SRC_AND_BUILD/Trilinos/packages/muelu/test/unit_tests/Hierarchy.cpp:889
...
The following tests FAILED:
116. Hierarchy_double_int_int_Kokkos_Compat_KokkosSerialWrapperNode_Write_UnitTest ...
...
It is just that one failing unit test 116 called Hierarchy_double_int_int_Kokkos_Compat_KokkosSerialWrapperNode_Write_UnitTest
in the 'serial' builds and called Hierarchy_double_int_int_Kokkos_Compat_KokkosOpenMPWrapperNode_Write_UnitTest
in the 'openmp' builds.
The first failure showed up on 2018-10-21
Current Status on CDash
To see failures for these tests in the last month click here.
Steps to Reproduce
This may be very difficult to reproduce because it is failing infrequently on a single build but nearly every other day across all the builds. Instructions for reproducing ATDM builds can be found at:
More specifically, the commands given for ride or white are provided at:
The exact commands to reproduce one build where this has failed on white or ride are:
$ cd <some_build_dir>/
$ source $TRILINOS_DIR/cmake/std/atdm/load-env.sh Trilinos-atdm-white-ride-gnu-opt-openmp
$ cmake \
-GNinja \
-DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnv.cmake \
-DTrilinos_ENABLE_TESTS=ON -DTrilinos_ENABLE_MueLu=ON \
$TRILINOS_DIR
$ make NP=16
$ bsub -x -Is -q rhel7F -n 16 ctest -j16