Skip to content

GitLab

  • Menu
Projects Groups Snippets
    • Loading...
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
  • Sign in
  • T Trilinos
  • Project information
    • Project information
    • Activity
    • Labels
    • Planning hierarchy
    • Members
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
  • Issues 936
    • Issues 936
    • List
    • Boards
    • Service Desk
    • Milestones
  • Merge requests 22
    • Merge requests 22
  • CI/CD
    • CI/CD
    • Pipelines
    • Jobs
    • Schedules
  • Deployments
    • Deployments
    • Environments
    • Releases
  • Monitor
    • Monitor
    • Incidents
  • Analytics
    • Analytics
    • Value stream
    • CI/CD
    • Repository
  • Wiki
    • Wiki
  • Snippets
    • Snippets
  • Activity
  • Graph
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
Collapse sidebar
  • James Willenbring
  • Trilinos
  • Issues
  • #3897

Closed
Open
Created Nov 19, 2018 by James Willenbring@jmwilleMaintainer

MueLu_UnitTests[Blocked][Epetra|Tpetra]_MPI_4 failing randomly on several ATDM builds

Created by: fryeguy52

CC: @trilinos/MueLu, @srajama1 (Trilinos Linear Solvers Product Lead), @bartlettroscoe

Next Action Status

PR #4046 merged to 'develop' on 12/18/2018 may fix these random failures. Next: Watch for more failures over the coming days and weeks to see if there are any more failures ...

Description

As shown in the links below the tests:

  • MueLu_UnitTestsBlockedEpetra_MPI_4
  • MueLu_UnitTestsEpetra_MPI_4
  • MueLu_­UnitTestsEpetra_­MPI_­1
  • MueLu_­UnitTestsTpetra_­MPI_­1
  • MueLu_UnitTestsTpetra_MPI_1

are randomly failing across several builds. They has failed several times in the last month on different builds. The builds where we have seen failures are:

  • Trilinos-atdm-cee-rhel6-gnu-4.9.3-opt-serial
  • Trilinos-atdm-cee-rhel6-gnu-opt-serial
  • Trilinos-atdm-cee-rhel6-intel-opt-serial
  • Trilinos-atdm-hansen-shiller-gnu-opt-openmp
  • Trilinos-atdm-hansen-shiller-gnu-opt-serial
  • Trilinos-atdm-hansen-shiller-gnu-opt-serial
  • Trilinos-atdm-hansen-shiller-intel-debug-openmp
  • Trilinos-atdm-hansen-shiller-intel-debug-serial
  • Trilinos-atdm-mutrino-intel-opt-openmp-HSW
  • Trilinos-atdm-mutrino-intel-opt-openmp-KNL
  • Trilinos-atdm-sems-rhel6-gnu-debug-openmp
  • Trilinos-atdm-sems-rhel6-intel-opt-openmp
  • Trilinos-atdm-serrano-intel-opt-openmp
  • Trilinos-atdm-waterman-gnu-opt-openmp
  • Trilinos-atdm-waterman-gnu-release-debug-openmp
  • Trilinos-atdm-white-ride-cuda-9.2-opt
  • Trilinos-atdm-white-ride-gnu-opt-openmp

It looks like that in each case something similar to the following appears in the 'openmp' builds:

...

p=0: *** Caught standard std::exception of type 'Xpetra::Exceptions::RuntimeError' :

  EpetraExt::MatrixMarketFileToCrsMatrix return value of -1
 [FAILED]  (0.0902 sec) Hierarchy_double_int_int_Kokkos_Compat_KokkosOpenMPWrapperNode_Write_UnitTest
 Location: /home/jenkins/hansen/workspace/Trilinos-atdm-hansen-shiller-intel-debug-openmp/SRC_AND_BUILD/Trilinos/packages/muelu/test/unit_tests/Hierarchy.cpp:889
...

The following tests FAILED:
    116. Hierarchy_double_int_int_Kokkos_Compat_KokkosOpenMPWrapperNode_Write_UnitTest ... 

...

and the 'serial' builds show:

...

p=0: *** Caught standard std::exception of type 'Xpetra::Exceptions::RuntimeError' :
 
  EpetraExt::MatrixMarketFileToCrsMatrix return value of -1
 [FAILED]  (0.00618 sec) Hierarchy_double_int_int_Kokkos_Compat_KokkosSerialWrapperNode_Write_UnitTest
 Location: /jenkins/slave/workspace/Trilinos-atdm-sems-rhel6-gnu-debug-serial/SRC_AND_BUILD/Trilinos/packages/muelu/test/unit_tests/Hierarchy.cpp:889

...

The following tests FAILED:
    116. Hierarchy_double_int_int_Kokkos_Compat_KokkosSerialWrapperNode_Write_UnitTest ... 

...

It is just that one failing unit test 116 called Hierarchy_double_int_int_Kokkos_Compat_KokkosSerialWrapperNode_Write_UnitTest in the 'serial' builds and called Hierarchy_double_int_int_Kokkos_Compat_KokkosOpenMPWrapperNode_Write_UnitTest in the 'openmp' builds.

The first failure showed up on 2018-10-21

Current Status on CDash

To see failures for these tests in the last month click here.

Steps to Reproduce

This may be very difficult to reproduce because it is failing infrequently on a single build but nearly every other day across all the builds. Instructions for reproducing ATDM builds can be found at:

  • https://github.com/trilinos/Trilinos/blob/develop/cmake/std/atdm/README.md

More specifically, the commands given for ride or white are provided at:

  • https://github.com/trilinos/Trilinos/blob/develop/cmake/std/atdm/README.md#ridewhite

The exact commands to reproduce one build where this has failed on white or ride are:

$ cd <some_build_dir>/
$ source $TRILINOS_DIR/cmake/std/atdm/load-env.sh Trilinos-atdm-white-ride-gnu-opt-openmp
$ cmake \
 -GNinja \
 -DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnv.cmake \
 -DTrilinos_ENABLE_TESTS=ON -DTrilinos_ENABLE_MueLu=ON \
 $TRILINOS_DIR
$ make NP=16
$ bsub -x -Is -q rhel7F -n 16 ctest -j16
Assignee
Assign to
Time tracking