New Tpetra and Zoltan2 test failures in all ATDM Trilinos CUDA builds starting 12/21/2018
Created by: bartlettroscoe
CC: @trilinos/tpetra, @trilinos/zoltan2 , @kddevin (Trilinos Data Services Product Lead), @bartlettroscoe, @fryeguy52, @tjfulle
Next Action Status
PR #4124 that reverts PR #4070 was merged to 'develop' on 12/21/2018 and addressed the issue in testing on 12/22/2018.
Description
As shown in this query the tests:
- TpetraCore_CrsGraph_insertGlobalIndicesFiltered_MPI_2
- TpetraCore_CrsGraph_StaticImportExport_MPI_4
- TpetraCore_CrsGraph_UnitTests0_MPI_4
- TpetraCore_CrsGraph_UnpackIntoStaticGraph_MPI_4
- TpetraCore_Issue601_MPI_4
- Zoltan2_TpetraRowGraphInput_MPI_4
- Zoltan2_XpetraCrsGraphInput_MPI_4
- Zoltan2_XpetraTraits_MPI_4
are newly failing in all of the ATDM Trilinos CUDA builds:
- Trilinos-atdm-waterman-cuda-9.2-debug
- Trilinos-atdm-waterman-cuda-9.2-opt
- Trilinos-atdm-waterman-cuda-9.2-release-debug
- Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-debug
- Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-release
- Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-release-debug
Looking at the test output for the failing tests in the build Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-release-debug
most of the failing tests shown errors like TpetraCore_CrsGraph_insertGlobalIndicesFiltered_MPI_2 showing:
terminate called after throwing an instance of 'std::runtime_error'
what(): cudaDeviceSynchronize() error( cudaErrorIllegalAddress): an illegal memory access was encountered /home/jenkins/white/workspace/Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-release-debug/SRC_AND_BUILD/Trilinos/packages/kokkos/core/src/Cuda/Kokkos_Cuda_Impl.cpp:119
Traceback functionality not available
[white22:69850] *** Process received signal ***
[white22:69850] Signal: Aborted (6)
[white22:69850] Signal code: (-6)
and like TpetraCore_CrsGraph_UnpackIntoStaticGraph_MPI_4 showing:
block: [0,0,0], thread: [0,0,0] Assertion `Kokkos::View ERROR: attempt to access inaccessible memory space` failed.
:0: : block: [0,0,0], thread: [0,1,0] Assertion `Kokkos::View ERROR: attempt to access inaccessible memory space` failed.
:0: : block: [0,0,0], thread: [0,2,0] Assertion `Kokkos::View ERROR: attempt to access inaccessible memory space` failed.
:0: : block: [0,0,0], thread: [0,3,0] Assertion `Kokkos::View ERROR: attempt to access inaccessible memory space` failed.
terminate called after throwing an instance of 'std::runtime_error'
what(): cudaMemcpy( dst , src , n , cudaMemcpyDefault ) error( cudaErrorAssert): device-side assert triggered /home/jenkins/white/workspace/Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-release-debug/SRC_AND_BUILD/Trilinos/packages/kokkos/core/src/Cuda/Kokkos_CudaSpace.cpp:89
Traceback functionality not available
[white22:75371] *** Process received signal ***
[white22:75371] Signal: Aborted (6)
[white22:75371] Signal code: (-6)
Pretty much all of the errors except the for the test TpetraCore_CrsGraph_UnpackIntoStaticGraph_MPI_4
shown above show:
terminate called after throwing an instance of 'std::runtime_error'
what(): cudaDeviceSynchronize() error( cudaErrorIllegalAddress): an illegal memory access was encountered ...
The new commits that were pulled the day that these failures started are show, for example, here. From looking over that set of commits, it seems very likely that the commit 9ec2271b from @tjfulle merged in from PR #4070 on 12/20/2018 (yesterday) is the cause.
Current Status on CDash
The current status of these CUDA builds can be found at:
- ATDM Trilinos CUDA Builds for current testing day
- ATDM Trilinos CUDA Builds non-passing tests for current testing day
Steps to Reproduce
One should be able to reproduce this failure on the any of the machines with a CUDA build as described in:
More specifically, the commands given for the system 'white' or ''ride' are provided at:
The exact commands to reproduce this issue should be:
$ cd <some_build_dir>/
$ source $TRILINOS_DIR/cmake/std/atdm/load-env.sh cuda-9.2-gnu-7.2.0-release-debug
$ cmake \
-GNinja \
-DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnv.cmake \
-DTrilinos_ENABLE_TESTS=ON -DTrilinos_ENABLE_Tpetra=ON -DTrilinos_ENABLE_Zoltan2=ON \
$TRILINOS_DIR
$ make NP=16
$ bsub -x -Is -q rhel7F -n 16 ctest ctest -j16