Skip to content
GitLab
Projects Groups Topics Snippets
  • /
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
  • Sign in
  • T Trilinos
  • Project information
    • Project information
    • Activity
    • Labels
    • Members
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributor statistics
    • Graph
    • Compare revisions
  • Issues 936
    • Issues 936
    • List
    • Boards
    • Service Desk
    • Milestones
  • Merge requests 22
    • Merge requests 22
  • CI/CD
    • CI/CD
    • Pipelines
    • Jobs
    • Artifacts
    • Schedules
  • Deployments
    • Deployments
    • Environments
    • Releases
  • Packages and registries
    • Packages and registries
    • Model experiments
  • Monitor
    • Monitor
    • Incidents
  • Analytics
    • Analytics
    • Value stream
    • CI/CD
    • Repository
  • Wiki
    • Wiki
  • Snippets
    • Snippets
  • Activity
  • Graph
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
Collapse sidebar
  • James Willenbring
  • Trilinos
  • Issues
  • #3417

Test failures in ATDM config gnu debug builds on Power8/9 machines

Created by: fryeguy52

CC: @trilinos/tpetra, @trilinos/belos , @srajama1 (Trilinos Linear Solvers Product Lead) @kddevin (Trilinos Data Services Product Lead) @bartlettroscoe

Next Action Status

PR #3420 merged on 9/10/2018 fixed all but one test on the gnu-debug-openmp builds on 'white'/'ride' on 9/11/2018 (the test TpetraCore_gemm_MPI_1 is timing out). Next: Make the test TpetraCore_gemm_MPI_1 run faster in that build or disable it?

Description

As shown in this query(white/ride) and this query(waterman) the tests:

  • Anasazi_Tpetra_BlockDavidson_Lap_test_MPI_4
  • Anasazi_Tpetra_BlockKrylovSchur_Lap_test_MPI_4
  • Anasazi_Tpetra_IRTR_Lap_test_MPI_4
  • Anasazi_Tpetra_MVOPTester_MPI_4
  • Anasazi_Tpetra_TraceMin_largest_standard_test_MPI_4
  • Anasazi_Tpetra_TraceMin_smallest_proj_test_MPI_4
  • Anasazi_Tpetra_TraceMin_smallest_schur_test_MPI_4
  • Anasazi_Tpetra_TraceMinDavidson_largest_standard_test_MPI_4
  • Belos_Issue_3235_MPI_2
  • Belos_SolverFactory_MPI_4
  • Belos_Tpetra_BlockGMRES_hb_test_MPI_4
  • Belos_Tpetra_MultipleSolves_MPI_4
  • Belos_Tpetra_MVOPTester_complex_test_MPI_4
  • Ifpack2_AdditiveSchwarz_RILUK_MPI_4
  • Ifpack2_Cheby_belos_MPI_1
  • Ifpack2_GS_belos_MPI_1
  • Ifpack2_ILUT_5w_2_MPI_1
  • Ifpack2_ILUT_5w_no_diag_MPI_1
  • Ifpack2_ILUT_belos_MPI_1
  • Ifpack2_ILUT_hb_belos_MPI_2
  • Ifpack2_ILUT_hb_belos_MPI_4
  • Ifpack2_Jac_sm_belos_MPI_1
  • Ifpack2_Jacobi_belos_constGraph_MPI_4
  • Ifpack2_Jacobi_belos_MPI_1
  • Ifpack2_Jacobi_hb_belos_MPI_1
  • Ifpack2_Jacobi_hb_belos_MPI_2
  • Ifpack2_RILUK_hb_belos_MPI_2
  • Ifpack2_RILUK_hb_belos_MPI_4
  • Ifpack2_SGS_belos_MPI_1
  • Ifpack2_small_gmres_belos_MPI_1
  • MueLu_Maxwell3D-Tpetra-Stratimikos_MPI_4
  • MueLu_Stratimikos_MPI_4
  • MueLu_Stratimikos2_MPI_4
  • NOX_Tpetra_1DFEM_MPI_4
  • NOX_Tpetra_Heq_MPI_4
  • NOX_Tpetra_MultiVectorOpsTests_MPI_4
  • PanzerAdaptersSTK_CurlLaplacianExample
  • PanzerAdaptersSTK_main_driver_energy-ss
  • PanzerAdaptersSTK_main_driver_energy-ss-blocked-tp
  • PanzerAdaptersSTK_MixedPoissonExample
  • PanzerAdaptersSTK_projection_MPI_2
  • PanzerMiniEM_MiniEM-BlockPrec_Augmentation_MPI_1
  • PanzerMiniEM_MiniEM-BlockPrec_Augmentation_MPI_4
  • PanzerMiniEM_MiniEM-BlockPrec_RefMaxwell_MPI_1
  • PanzerMiniEM_MiniEM-BlockPrec_RefMaxwell_MPI_4
  • Teko_DiagonallyScaledPreconditioner_MPI_1
  • Teko_testdriver_tpetra_MPI_1
  • Teko_testdriver_tpetra_MPI_2
  • ThyraTpetraAdapters_TpetraThyraWrappersUnitTests_MPI_4
  • ThyraTpetraAdapters_TpetraThyraWrappersUnitTests_serial_MPI_1
  • TpetraCore_gemm_MPI_1
  • TpetraCore_MultiVector_UnitTests_MPI_4

are failing in the builds:

  • Trilinos-atdm-waterman-gnu-debug-openmp
  • Trilinos-atdm-white-ride-gnu-debug-openmp (on both white and ride)

many of the tests have the following output in common

** On entry to DGEMM parameter number  8 had an illegal value
 ** On entry to DGEMM parameter number  8 had an illegal value
 ** On entry to DGEMM parameter number  8 had an illegal value
 ** On entry to DGEMM parameter number  8 had an illegal value
--------------------------------------------------------------------------
mpiexec has exited due to process rank 0 with PID 0 on
node waterman2 exiting improperly. There are three reasons this could occur:

1. this process did not call "init" before exiting, but others in
the job did. This can cause a job to hang indefinitely while it waits
for all processes to call "init". By rule, if one process calls "init",
then ALL processes must call "init" prior to termination.

2. this process called "init", but exited without calling "finalize".
By rule, all processes that call "init" MUST call "finalize" prior to
exiting or it will be considered an "abnormal termination"

3. this process called "MPI_Abort" or "orte_abort" and the mca parameter
orte_create_session_dirs is set to false. In this case, the run-time cannot
detect that the abort call was an abnormal termination. Hence, the only
error message you will receive is this one.

This may have caused other processes in the application to be
terminated by signals sent by mpiexec (as reported here).

You can avoid this message by specifying -quiet on the mpiexec command line.
--------------------------------------------------------------------------

Steps to Reproduce (white/ride)

One should be able to reproduce this failure on the machine ride/white as described in:

  • https://github.com/trilinos/Trilinos/blob/develop/cmake/std/atdm/README.md More specifically, the commands given for the system ride/white are provided at:
  • https://github.com/trilinos/Trilinos/blob/develop/cmake/std/atdm/README.md#ridewhite The exact commands to reproduce this issue should be:
$ cd <some_build_dir>/

$ source $TRILINOS_DIR/cmake/std/atdm/load-env.sh gnu-debug-openmp

$ cmake \
  -GNinja \
  -DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnv.cmake \
  -DTrilinos_ENABLE_TESTS=ON \
  -DTrilinos_ENABLE_<PACKAGE_NAME>=ON \
  $TRILINOS_DIR

$ make NP=16

$ bsub -x -Is -q rhel7F -n 16 ctest -j16

(where <PACKAGE_NAME> is some package you want to build and run tests for.)

Steps to Reproduce (waterman)

One should be able to reproduce this failure on the machine waterman as described in:

  • https://github.com/trilinos/Trilinos/blob/develop/cmake/std/atdm/README.md More specifically, the commands given for the system waterman are provided at:
  • https://github.com/trilinos/Trilinos/blob/develop/cmake/std/atdm/README.md#waterman The exact commands to reproduce this issue should be:
$ cd <some_build_dir>/

$ source $TRILINOS_DIR/cmake/std/atdm/load-env.sh gnu-debug-openmp

$ cmake \
  -GNinja \
  -DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnv.cmake \
  -DTrilinos_ENABLE_TESTS=ON \
  -DTrilinos_ENABLE_<PACKAGE_NAME>=ON \
  $TRILINOS_DIR

$ make NP=20

$ bsub -x -Is -n 20 ctest -j20

(where <PACKAGE_NAME> is some package you want to build and run tests for.)

Assignee
Assign to
Time tracking