Test failures in ATDM config gnu debug builds on Power8/9 machines
Created by: fryeguy52
CC: @trilinos/tpetra, @trilinos/belos , @srajama1 (Trilinos Linear Solvers Product Lead) @kddevin (Trilinos Data Services Product Lead) @bartlettroscoe
Next Action Status
PR #3420 merged on 9/10/2018 fixed all but one test on the gnu-debug-openmp
builds on 'white'/'ride' on 9/11/2018 (the test TpetraCore_gemm_MPI_1
is timing out). Next: Make the test TpetraCore_gemm_MPI_1
run faster in that build or disable it?
Description
As shown in this query(white/ride) and this query(waterman) the tests:
- Anasazi_Tpetra_BlockDavidson_Lap_test_MPI_4
- Anasazi_Tpetra_BlockKrylovSchur_Lap_test_MPI_4
- Anasazi_Tpetra_IRTR_Lap_test_MPI_4
- Anasazi_Tpetra_MVOPTester_MPI_4
- Anasazi_Tpetra_TraceMin_largest_standard_test_MPI_4
- Anasazi_Tpetra_TraceMin_smallest_proj_test_MPI_4
- Anasazi_Tpetra_TraceMin_smallest_schur_test_MPI_4
- Anasazi_Tpetra_TraceMinDavidson_largest_standard_test_MPI_4
- Belos_Issue_3235_MPI_2
- Belos_SolverFactory_MPI_4
- Belos_Tpetra_BlockGMRES_hb_test_MPI_4
- Belos_Tpetra_MultipleSolves_MPI_4
- Belos_Tpetra_MVOPTester_complex_test_MPI_4
- Ifpack2_AdditiveSchwarz_RILUK_MPI_4
- Ifpack2_Cheby_belos_MPI_1
- Ifpack2_GS_belos_MPI_1
- Ifpack2_ILUT_5w_2_MPI_1
- Ifpack2_ILUT_5w_no_diag_MPI_1
- Ifpack2_ILUT_belos_MPI_1
- Ifpack2_ILUT_hb_belos_MPI_2
- Ifpack2_ILUT_hb_belos_MPI_4
- Ifpack2_Jac_sm_belos_MPI_1
- Ifpack2_Jacobi_belos_constGraph_MPI_4
- Ifpack2_Jacobi_belos_MPI_1
- Ifpack2_Jacobi_hb_belos_MPI_1
- Ifpack2_Jacobi_hb_belos_MPI_2
- Ifpack2_RILUK_hb_belos_MPI_2
- Ifpack2_RILUK_hb_belos_MPI_4
- Ifpack2_SGS_belos_MPI_1
- Ifpack2_small_gmres_belos_MPI_1
- MueLu_Maxwell3D-Tpetra-Stratimikos_MPI_4
- MueLu_Stratimikos_MPI_4
- MueLu_Stratimikos2_MPI_4
- NOX_Tpetra_1DFEM_MPI_4
- NOX_Tpetra_Heq_MPI_4
- NOX_Tpetra_MultiVectorOpsTests_MPI_4
- PanzerAdaptersSTK_CurlLaplacianExample
- PanzerAdaptersSTK_main_driver_energy-ss
- PanzerAdaptersSTK_main_driver_energy-ss-blocked-tp
- PanzerAdaptersSTK_MixedPoissonExample
- PanzerAdaptersSTK_projection_MPI_2
- PanzerMiniEM_MiniEM-BlockPrec_Augmentation_MPI_1
- PanzerMiniEM_MiniEM-BlockPrec_Augmentation_MPI_4
- PanzerMiniEM_MiniEM-BlockPrec_RefMaxwell_MPI_1
- PanzerMiniEM_MiniEM-BlockPrec_RefMaxwell_MPI_4
- Teko_DiagonallyScaledPreconditioner_MPI_1
- Teko_testdriver_tpetra_MPI_1
- Teko_testdriver_tpetra_MPI_2
- ThyraTpetraAdapters_TpetraThyraWrappersUnitTests_MPI_4
- ThyraTpetraAdapters_TpetraThyraWrappersUnitTests_serial_MPI_1
- TpetraCore_gemm_MPI_1
- TpetraCore_MultiVector_UnitTests_MPI_4
are failing in the builds:
- Trilinos-atdm-waterman-gnu-debug-openmp
- Trilinos-atdm-white-ride-gnu-debug-openmp (on both white and ride)
many of the tests have the following output in common
** On entry to DGEMM parameter number 8 had an illegal value
** On entry to DGEMM parameter number 8 had an illegal value
** On entry to DGEMM parameter number 8 had an illegal value
** On entry to DGEMM parameter number 8 had an illegal value
--------------------------------------------------------------------------
mpiexec has exited due to process rank 0 with PID 0 on
node waterman2 exiting improperly. There are three reasons this could occur:
1. this process did not call "init" before exiting, but others in
the job did. This can cause a job to hang indefinitely while it waits
for all processes to call "init". By rule, if one process calls "init",
then ALL processes must call "init" prior to termination.
2. this process called "init", but exited without calling "finalize".
By rule, all processes that call "init" MUST call "finalize" prior to
exiting or it will be considered an "abnormal termination"
3. this process called "MPI_Abort" or "orte_abort" and the mca parameter
orte_create_session_dirs is set to false. In this case, the run-time cannot
detect that the abort call was an abnormal termination. Hence, the only
error message you will receive is this one.
This may have caused other processes in the application to be
terminated by signals sent by mpiexec (as reported here).
You can avoid this message by specifying -quiet on the mpiexec command line.
--------------------------------------------------------------------------
Steps to Reproduce (white/ride)
One should be able to reproduce this failure on the machine ride/white as described in:
- https://github.com/trilinos/Trilinos/blob/develop/cmake/std/atdm/README.md More specifically, the commands given for the system ride/white are provided at:
- https://github.com/trilinos/Trilinos/blob/develop/cmake/std/atdm/README.md#ridewhite The exact commands to reproduce this issue should be:
$ cd <some_build_dir>/
$ source $TRILINOS_DIR/cmake/std/atdm/load-env.sh gnu-debug-openmp
$ cmake \
-GNinja \
-DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnv.cmake \
-DTrilinos_ENABLE_TESTS=ON \
-DTrilinos_ENABLE_<PACKAGE_NAME>=ON \
$TRILINOS_DIR
$ make NP=16
$ bsub -x -Is -q rhel7F -n 16 ctest -j16
(where <PACKAGE_NAME>
is some package you want to build and run tests for.)
Steps to Reproduce (waterman)
One should be able to reproduce this failure on the machine waterman as described in:
- https://github.com/trilinos/Trilinos/blob/develop/cmake/std/atdm/README.md More specifically, the commands given for the system waterman are provided at:
- https://github.com/trilinos/Trilinos/blob/develop/cmake/std/atdm/README.md#waterman The exact commands to reproduce this issue should be:
$ cd <some_build_dir>/
$ source $TRILINOS_DIR/cmake/std/atdm/load-env.sh gnu-debug-openmp
$ cmake \
-GNinja \
-DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnv.cmake \
-DTrilinos_ENABLE_TESTS=ON \
-DTrilinos_ENABLE_<PACKAGE_NAME>=ON \
$TRILINOS_DIR
$ make NP=20
$ bsub -x -Is -n 20 ctest -j20
(where <PACKAGE_NAME>
is some package you want to build and run tests for.)