Skip to content

GitLab

  • Menu
Projects Groups Snippets
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
  • Sign in
  • T Trilinos
  • Project information
    • Project information
    • Activity
    • Labels
    • Members
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
  • Issues 936
    • Issues 936
    • List
    • Boards
    • Service Desk
    • Milestones
  • Merge requests 22
    • Merge requests 22
  • CI/CD
    • CI/CD
    • Pipelines
    • Jobs
    • Schedules
  • Deployments
    • Deployments
    • Environments
    • Releases
  • Monitor
    • Monitor
    • Incidents
  • Analytics
    • Analytics
    • Value stream
    • CI/CD
    • Repository
  • Wiki
    • Wiki
  • Snippets
    • Snippets
  • Activity
  • Graph
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
Collapse sidebar
  • James Willenbring
  • Trilinos
  • Issues
  • #2471

Closed
Open
Created Mar 28, 2018 by James Willenbring@jmwilleMaintainer

New failing tests in ATDM debug builds of Trilinos due to KOKKOS_ENABLE_DEBUG=ON being set

Created by: bartlettroscoe

CC: @trilinos/kokkos, @trilinos/kokkos-kernels, @trilinos/amesos2 , @trilinos/anasazi, @trilinos/panzer

Next Action Status

The PR #2476 fixed two of the tests on 3/30/2018 and PR #2494 disabled one single unit test on 4/3/2018 not appropriate to run on GPUs.

Description

As shown in the query:

  • https://testing.sandia.gov/cdash/queryTests.php?project=Trilinos&date=2018-03-28&limit=0&filtercount=3&showfilters=1&filtercombine=and&field1=buildname&compare1=65&value1=Trilinos-atdm-&field2=buildname&compare2=63&value2=-debug&field3=status&compare3=62&value3=passed

several tests are timing out today and failing in the ATDM -debug builds of Trilinos:

  • Amesos2_KLU2_UnitTests_MPI_2
  • Anasazi_Epetra_ModalSolversTester_MPI_4
  • KokkosCore_UnitTest_Cuda_MPI_1
  • KokkosKernels_sparse_cuda_MPI_1
  • PanzerMiniEM_MiniEM-BlockPrec_Augmentation_MPI_4

The set of tests that are failing and which platforms they are failing shown in the above query are shown in the below table:

Table of failing tests (click to expend)
Site Build Name Test Name Status Time Details
hansen Trilinos-atdm-hansen-shiller-cuda-debug Amesos2_KLU2_UnitTests_MPI_2 Failed 600.09 Completed (Timeout)
hansen Trilinos-atdm-hansen-shiller-gnu-debug-openmp Amesos2_KLU2_UnitTests_MPI_2 Failed 8.51 Completed (Failed)
hansen Trilinos-atdm-hansen-shiller-gnu-debug-serial Amesos2_KLU2_UnitTests_MPI_2 Failed 600.51 Completed (Timeout)
hansen Trilinos-atdm-hansen-shiller-intel-debug-openmp Amesos2_KLU2_UnitTests_MPI_2 Failed 2.39 Completed (Failed)
hansen Trilinos-atdm-hansen-shiller-intel-debug-serial Amesos2_KLU2_UnitTests_MPI_2 Failed 600.1 Completed (Timeout)
ride Trilinos-atdm-white-ride-cuda-debug Amesos2_KLU2_UnitTests_MPI_2 Failed 600.05 Completed (Timeout)
white Trilinos-atdm-white-ride-cuda-debug Amesos2_KLU2_UnitTests_MPI_2 Failed 600.04 Completed (Timeout)
white Trilinos-atdm-white-ride-gnu-debug-openmp Amesos2_KLU2_UnitTests_MPI_2 Failed 1.31 Completed (Failed)
ride Trilinos-atdm-white-ride-cuda-debug Anasazi_Epetra_ModalSolversTester_MPI_4 Failed 0.84 Completed (Failed)
ride Trilinos-atdm-white-ride-cuda-debug Anasazi_Epetra_OrthoManagerGenTester_0_MPI_4 Failed 0.71 Completed (Failed)
hansen Trilinos-atdm-hansen-shiller-cuda-debug KokkosCore_UnitTest_Cuda_MPI_1 Failed 103.19 Completed (Failed)
ride Trilinos-atdm-white-ride-cuda-debug KokkosCore_UnitTest_Cuda_MPI_1 Failed 213.56 Completed (Failed)
white Trilinos-atdm-white-ride-cuda-debug KokkosCore_UnitTest_Cuda_MPI_1 Failed 213.74 Completed (Failed)
hansen Trilinos-atdm-hansen-shiller-cuda-debug KokkosKernels_sparse_cuda_MPI_1 Failed 16.49 Completed (Failed)
ride Trilinos-atdm-white-ride-cuda-debug KokkosKernels_sparse_cuda_MPI_1 Failed 2.43 Completed (Failed)
white Trilinos-atdm-white-ride-cuda-debug KokkosKernels_sparse_cuda_MPI_1 Failed 2.39 Completed (Failed)
hansen Trilinos-atdm-hansen-shiller-cuda-debug PanzerMiniEM_MiniEM-BlockPrec_Augmentation_MPI_4 Failed 21.95 Completed (Failed)
ride Trilinos-atdm-white-ride-cuda-debug PanzerMiniEM_MiniEM-BlockPrec_Augmentation_MPI_4 Failed 24.31 Completed (Failed)
white Trilinos-atdm-white-ride-cuda-debug PanzerMiniEM_MiniEM-BlockPrec_Augmentation_MPI_4 Failed 24.9 Completed (Failed)

Except for the failing Anasazi tests in the build Trilinos-atdm-white-ride-cuda-debug (which I will write another GitHub issue for), all of these tests (even the timeouts) seem to be failing due to debug-mode checks from KOKKOS_ENABLE_DEBUG=ON being set (see #2439 (closed)) failing and throwing exceptions. In the case of the failing tests Amesos2_KLU2_UnitTests_MPI_2, for example, it shows:

5. KLU2_double_int_int_NonContgGID_UnitTest ... 
 
 p=0: *** Caught standard std::exception of type 'std::runtime_error' :
 
  View bounds error of view MV::DualView ( -1 < 6 , 0 < 1 )
  Traceback functionality not available
  
 [FAILED]  (0.0885 sec) KLU2_double_int_int_NonContgGID_UnitTest
 Location: /home/rabartl/WHITE/ATDM_Driver/Trilinos-atdm-white-ride-cuda-debug/SRC_AND_BUILD/Trilinos/packages/amesos2/test/solvers/KLU2_UnitTests.cpp:383

This exception causes a hang and a timeout in some cases and fails quickly and aborts in other cases. (So much for assuming that one MPI process throwing an excpetion will bring down an MPI job in all cases.)

Many of these builds have been promoted to the "ATDM" CDash group/track and therefore triggered CDash error emails today. Therefore, this must get fixed quickly if possible (or we will need to demote these builds again).

Steps to Reproduce

One can log onto white (SON) or ride (SRN) and then reproduce the build and tests as described at:

  • https://github.com/trilinos/Trilinos/blob/develop/cmake/std/atdm/README.md#ridewhite

I just reproduced many of these failures on 'white' using

$ ssh white

$ cd ~/rilinos.base/BUILD/WHITE/CHECKIN/

$ bsub -x -I -q rhel7F -n 16 \
  ./checkin-test-atdm.sh cuda-debug --enable-packages=Kokkos,KokkosKernels,Amesos2,Panzer --local-do-all

...

FAILED (NOT READY TO PUSH): Trilinos: white22

Wed Mar 28 11:03:05 MDT 2018

Enabled Packages: Kokkos, KokkosKernels, Amesos2, Panzer

Build test results:
-------------------
0) MPI_RELEASE_DEBUG_SHARED_PT => Test case MPI_RELEASE_DEBUG_SHARED_PT was not run! => Does not affect push readiness! (-1.00 min)
1) cuda-debug => FAILED: passed=189,notpassed=5 => Not ready to push! (120.13 min)


REQUESTED ACTIONS: FAILED

This showed the test results:

  97% tests passed, 5 tests failed out of 194
  
  Subproject Time Summary:
  Amesos2          = 1232.97 sec*proc (8 tests)
  Kokkos           = 954.94 sec*proc (26 tests)
  KokkosKernels    = 870.46 sec*proc (8 tests)
  Panzer           = 7490.79 sec*proc (152 tests)
  
  Total Test time (real) = 1518.22 sec
  
  The following tests FAILED:
  	  2 - KokkosCore_UnitTest_Cuda_MPI_1 (Failed)
  	 28 - KokkosKernels_sparse_cuda_MPI_1 (Failed)
  	 35 - Amesos2_KLU2_UnitTests_MPI_2 (Timeout)
  	174 - PanzerAdaptersSTK_PoissonInterfaceExample_2d_diffsideids_MPI_1 (Timeout)
  	192 - PanzerMiniEM_MiniEM-BlockPrec_Augmentation_MPI_4 (Failed)
  Errors while running CTest
  
  Total time for cuda-debug = 120.13 min

The test failure timeout PanzerAdaptersSTK_PoissonInterfaceExample_2d_diffsideids_MPI_1 was also seen in #2446 as well. Not sure why that test timed out when run locally but not in the driver jobs. But otherwise, this one build reproduced all of the failing tests shown on CDash except for the test Anasazi_Epetra_ModalSolversTester_MPI_4 (which does not look to be related to KOKKOS_ENABLE_DEBUG=ON).

Related Issues

  • Related to: #2439 (closed), #2464 (closed)
Assignee
Assign to
Time tracking