New failing tests in ATDM debug builds of Trilinos due to KOKKOS_ENABLE_DEBUG=ON being set
Created by: bartlettroscoe
CC: @trilinos/kokkos, @trilinos/kokkos-kernels, @trilinos/amesos2 , @trilinos/anasazi, @trilinos/panzer
Next Action Status
The PR #2476 fixed two of the tests on 3/30/2018 and PR #2494 disabled one single unit test on 4/3/2018 not appropriate to run on GPUs.
Description
As shown in the query:
several tests are timing out today and failing in the ATDM -debug
builds of Trilinos:
Amesos2_KLU2_UnitTests_MPI_2
Anasazi_Epetra_ModalSolversTester_MPI_4
KokkosCore_UnitTest_Cuda_MPI_1
KokkosKernels_sparse_cuda_MPI_1
PanzerMiniEM_MiniEM-BlockPrec_Augmentation_MPI_4
The set of tests that are failing and which platforms they are failing shown in the above query are shown in the below table:
Table of failing tests (click to expend)
Site | Build Name | Test Name | Status | Time | Details |
---|---|---|---|---|---|
hansen | Trilinos-atdm-hansen-shiller-cuda-debug | Amesos2_KLU2_UnitTests_MPI_2 | Failed | 600.09 | Completed (Timeout) |
hansen | Trilinos-atdm-hansen-shiller-gnu-debug-openmp | Amesos2_KLU2_UnitTests_MPI_2 | Failed | 8.51 | Completed (Failed) |
hansen | Trilinos-atdm-hansen-shiller-gnu-debug-serial | Amesos2_KLU2_UnitTests_MPI_2 | Failed | 600.51 | Completed (Timeout) |
hansen | Trilinos-atdm-hansen-shiller-intel-debug-openmp | Amesos2_KLU2_UnitTests_MPI_2 | Failed | 2.39 | Completed (Failed) |
hansen | Trilinos-atdm-hansen-shiller-intel-debug-serial | Amesos2_KLU2_UnitTests_MPI_2 | Failed | 600.1 | Completed (Timeout) |
ride | Trilinos-atdm-white-ride-cuda-debug | Amesos2_KLU2_UnitTests_MPI_2 | Failed | 600.05 | Completed (Timeout) |
white | Trilinos-atdm-white-ride-cuda-debug | Amesos2_KLU2_UnitTests_MPI_2 | Failed | 600.04 | Completed (Timeout) |
white | Trilinos-atdm-white-ride-gnu-debug-openmp | Amesos2_KLU2_UnitTests_MPI_2 | Failed | 1.31 | Completed (Failed) |
ride | Trilinos-atdm-white-ride-cuda-debug | Anasazi_Epetra_ModalSolversTester_MPI_4 | Failed | 0.84 | Completed (Failed) |
ride | Trilinos-atdm-white-ride-cuda-debug | Anasazi_Epetra_OrthoManagerGenTester_0_MPI_4 | Failed | 0.71 | Completed (Failed) |
hansen | Trilinos-atdm-hansen-shiller-cuda-debug | KokkosCore_UnitTest_Cuda_MPI_1 | Failed | 103.19 | Completed (Failed) |
ride | Trilinos-atdm-white-ride-cuda-debug | KokkosCore_UnitTest_Cuda_MPI_1 | Failed | 213.56 | Completed (Failed) |
white | Trilinos-atdm-white-ride-cuda-debug | KokkosCore_UnitTest_Cuda_MPI_1 | Failed | 213.74 | Completed (Failed) |
hansen | Trilinos-atdm-hansen-shiller-cuda-debug | KokkosKernels_sparse_cuda_MPI_1 | Failed | 16.49 | Completed (Failed) |
ride | Trilinos-atdm-white-ride-cuda-debug | KokkosKernels_sparse_cuda_MPI_1 | Failed | 2.43 | Completed (Failed) |
white | Trilinos-atdm-white-ride-cuda-debug | KokkosKernels_sparse_cuda_MPI_1 | Failed | 2.39 | Completed (Failed) |
hansen | Trilinos-atdm-hansen-shiller-cuda-debug | PanzerMiniEM_MiniEM-BlockPrec_Augmentation_MPI_4 | Failed | 21.95 | Completed (Failed) |
ride | Trilinos-atdm-white-ride-cuda-debug | PanzerMiniEM_MiniEM-BlockPrec_Augmentation_MPI_4 | Failed | 24.31 | Completed (Failed) |
white | Trilinos-atdm-white-ride-cuda-debug | PanzerMiniEM_MiniEM-BlockPrec_Augmentation_MPI_4 | Failed | 24.9 | Completed (Failed) |
Except for the failing Anasazi tests in the build Trilinos-atdm-white-ride-cuda-debug
(which I will write another GitHub issue for), all of these tests (even the timeouts) seem to be failing due to debug-mode checks from KOKKOS_ENABLE_DEBUG=ON
being set (see #2439 (closed)) failing and throwing exceptions. In the case of the failing tests Amesos2_KLU2_UnitTests_MPI_2
, for example, it shows:
5. KLU2_double_int_int_NonContgGID_UnitTest ...
p=0: *** Caught standard std::exception of type 'std::runtime_error' :
View bounds error of view MV::DualView ( -1 < 6 , 0 < 1 )
Traceback functionality not available
[FAILED] (0.0885 sec) KLU2_double_int_int_NonContgGID_UnitTest
Location: /home/rabartl/WHITE/ATDM_Driver/Trilinos-atdm-white-ride-cuda-debug/SRC_AND_BUILD/Trilinos/packages/amesos2/test/solvers/KLU2_UnitTests.cpp:383
This exception causes a hang and a timeout in some cases and fails quickly and aborts in other cases. (So much for assuming that one MPI process throwing an excpetion will bring down an MPI job in all cases.)
Many of these builds have been promoted to the "ATDM" CDash group/track and therefore triggered CDash error emails today. Therefore, this must get fixed quickly if possible (or we will need to demote these builds again).
Steps to Reproduce
One can log onto white
(SON) or ride
(SRN) and then reproduce the build and tests as described at:
I just reproduced many of these failures on 'white' using
$ ssh white
$ cd ~/rilinos.base/BUILD/WHITE/CHECKIN/
$ bsub -x -I -q rhel7F -n 16 \
./checkin-test-atdm.sh cuda-debug --enable-packages=Kokkos,KokkosKernels,Amesos2,Panzer --local-do-all
...
FAILED (NOT READY TO PUSH): Trilinos: white22
Wed Mar 28 11:03:05 MDT 2018
Enabled Packages: Kokkos, KokkosKernels, Amesos2, Panzer
Build test results:
-------------------
0) MPI_RELEASE_DEBUG_SHARED_PT => Test case MPI_RELEASE_DEBUG_SHARED_PT was not run! => Does not affect push readiness! (-1.00 min)
1) cuda-debug => FAILED: passed=189,notpassed=5 => Not ready to push! (120.13 min)
REQUESTED ACTIONS: FAILED
This showed the test results:
97% tests passed, 5 tests failed out of 194
Subproject Time Summary:
Amesos2 = 1232.97 sec*proc (8 tests)
Kokkos = 954.94 sec*proc (26 tests)
KokkosKernels = 870.46 sec*proc (8 tests)
Panzer = 7490.79 sec*proc (152 tests)
Total Test time (real) = 1518.22 sec
The following tests FAILED:
2 - KokkosCore_UnitTest_Cuda_MPI_1 (Failed)
28 - KokkosKernels_sparse_cuda_MPI_1 (Failed)
35 - Amesos2_KLU2_UnitTests_MPI_2 (Timeout)
174 - PanzerAdaptersSTK_PoissonInterfaceExample_2d_diffsideids_MPI_1 (Timeout)
192 - PanzerMiniEM_MiniEM-BlockPrec_Augmentation_MPI_4 (Failed)
Errors while running CTest
Total time for cuda-debug = 120.13 min
The test failure timeout PanzerAdaptersSTK_PoissonInterfaceExample_2d_diffsideids_MPI_1
was also seen in #2446 as well. Not sure why that test timed out when run locally but not in the driver jobs. But otherwise, this one build reproduced all of the failing tests shown on CDash except for the test Anasazi_Epetra_ModalSolversTester_MPI_4
(which does not look to be related to KOKKOS_ENABLE_DEBUG=ON
).
Related Issues
- Related to: #2439 (closed), #2464 (closed)