ROL tests failing in targeted CUDA PR build Trilinos-atdm-white-ride-cuda-9.2-release-debug-pt
Created by: bartlettroscoe
CC: @trilinos/rol , @rppawlo (Trilinos Nonlinear Solvers Product Area Lead)
Next Action Status
Description
The ROL package has 66 failing tests in the build Trilinos-atdm-white-ride-cuda-9.2-release-debug-pt
on 'white' and 'ride' as shown here which shows the failing tests:
- ROL_example_PDE-OPT_0ld_adv-diff-react_example_01_MPI_4
- ROL_example_PDE-OPT_0ld_adv-diff-react_example_02_MPI_4
- ROL_example_PDE-OPT_0ld_poisson_example_01_MPI_4
- ROL_example_PDE-OPT_0ld_stefan-boltzmann_example_03_MPI_4
- ROL_example_PDE-OPT_navier-stokes_example_01_MPI_4
- ROL_example_PDE-OPT_navier-stokes_example_02_MPI_4
- ROL_example_PDE-OPT_nonlinear-elliptic_example_01_MPI_4
- ROL_example_PDE-OPT_nonlinear-elliptic_example_02_MPI_4
- ROL_example_PDE-OPT_obstacle_example_01_MPI_4
- ROL_example_PDE-OPT_stefan-boltzmann_example_01_MPI_4
- ROL_example_PDE-OPT_stefan-boltzmann_example_03_MPI_4
- ROL_example_PDE-OPT_topo-opt_poisson_example_01_MPI_4
- ROL_example_tempus_example_parabolic_modeleval_MPI_1
- ROL_example_tempus_example_parabolic_thyravec_MPI_1
- ROL_test_elementwise_TpetraMultiVector_MPI_4
The first failing test ROL_example_PDE-OPT_0ld_adv-diff-react_example_01_MPI_4
with detailed output shown here shows:
Total number of processors: 4
Number of nodes = 1089
Number of cells = 1024
Number of edges = 2112
Cell offsets across processors: {0, 256, 512, 768}
terminate called after throwing an instance of 'std::runtime_error'
terminate called after throwing an instance of 'std::runtime_error'
what(): cudaGetLastError() error( cudaErrorIllegalAddress): an illegal memory access was encountered /home/jenkins/white/workspace/Trilinos-atdm-white-ride-cuda-9.2-release-debug-pt/SRC_AND_BUILD/Trilinos/packages/kokkos/core/src/Cuda/Kokkos_CudaExec.hpp:401
Traceback functionality not available
what(): cudaGetLastError() error( cudaErrorIllegalAddress): an illegal memory access was encountered /home/jenkins/white/workspace/Trilinos-atdm-white-ride-cuda-9.2-release-debug-pt/SRC_AND_BUILD/Trilinos/packages/kokkos/core/src/Cuda/Kokkos_CudaExec.hpp:401
Traceback functionality not available
[white27:11203] *** Process received signal ***
[white27:11204] *** Process received signal ***
[white27:11203] Signal: Aborted (6)
[white27:11203] Signal code: (-6)
[white27:11204] Signal: Aborted (6)
[white27:11204] Signal code: (-6)
[white27:11203] [ 0] [white27:11204] [ 0] [0x3fff90070478]
[white27:11203] [ 1] [0x3fffa3f00478]
...
Randomly looking at the output of several of the other tests I looked at all show errors like shown above.
This is an important build because we are targeting this build on 'white' and 'ride' as a Trilinos PR testing build (see #2464 (closed) ). Also, SPARC uses ROL and as part of https://software-sandbox.sandia.gov/jira/browse/TRIL-212 we are about to update the ATDM Trilinos configuration to test ROL on many platforms (including CUDA builds) so it is critical to get these tests cleaned up for ATDM.
Steps to reproduce
One should be able to reproduce these build errors on either 'white' or 'ride' by cloning the Trilinos git repo, checking out the 'develop' branch, creating a build directory, and then doing:
$ cd <some_build_dir>/
$ source $TRILINOS_DIR/cmake/std/atdm/load-env.sh cuda-9.2-release-debug
$ cmake \
-GNinja \
-DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnv.cmake \
-DTrilinos_ENABLE_TESTS=ON -DTrilinos_ENABLE_ROL=ON \
$TRILINOS_DIR
$ make NP=16
$ bsub -x -Is -q rhel7F -n 16 ctest -j16