Skip to content

GitLab

  • Menu
Projects Groups Snippets
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
  • Sign in
  • T Trilinos
  • Project information
    • Project information
    • Activity
    • Labels
    • Members
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
  • Issues 936
    • Issues 936
    • List
    • Boards
    • Service Desk
    • Milestones
  • Merge requests 22
    • Merge requests 22
  • CI/CD
    • CI/CD
    • Pipelines
    • Jobs
    • Schedules
  • Deployments
    • Deployments
    • Environments
    • Releases
  • Monitor
    • Monitor
    • Incidents
  • Analytics
    • Analytics
    • Value stream
    • CI/CD
    • Repository
  • Wiki
    • Wiki
  • Snippets
    • Snippets
  • Activity
  • Graph
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
Collapse sidebar
  • James Willenbring
  • Trilinos
  • Issues
  • #3543

Closed
Open
Created Oct 02, 2018 by James Willenbring@jmwilleMaintainer

ROL tests failing in targeted CUDA PR build Trilinos-atdm-white-ride-cuda-9.2-release-debug-pt

Created by: bartlettroscoe

CC: @trilinos/rol , @rppawlo (Trilinos Nonlinear Solvers Product Area Lead)

Next Action Status

Description

The ROL package has 66 failing tests in the build Trilinos-atdm-white-ride-cuda-9.2-release-debug-pt on 'white' and 'ride' as shown here which shows the failing tests:

  • ROL_example_PDE-OPT_0ld_adv-diff-react_example_01_MPI_4
  • ROL_example_PDE-OPT_0ld_adv-diff-react_example_02_MPI_4
  • ROL_example_PDE-OPT_0ld_poisson_example_01_MPI_4
  • ROL_example_PDE-OPT_0ld_stefan-boltzmann_example_03_MPI_4
  • ROL_example_PDE-OPT_navier-stokes_example_01_MPI_4
  • ROL_example_PDE-OPT_navier-stokes_example_02_MPI_4
  • ROL_example_PDE-OPT_nonlinear-elliptic_example_01_MPI_4
  • ROL_example_PDE-OPT_nonlinear-elliptic_example_02_MPI_4
  • ROL_example_PDE-OPT_obstacle_example_01_MPI_4
  • ROL_example_PDE-OPT_stefan-boltzmann_example_01_MPI_4
  • ROL_example_PDE-OPT_stefan-boltzmann_example_03_MPI_4
  • ROL_example_PDE-OPT_topo-opt_poisson_example_01_MPI_4
  • ROL_example_tempus_example_parabolic_modeleval_MPI_1
  • ROL_example_tempus_example_parabolic_thyravec_MPI_1
  • ROL_test_elementwise_TpetraMultiVector_MPI_4

The first failing test ROL_example_PDE-OPT_0ld_adv-diff-react_example_01_MPI_4 with detailed output shown here shows:

Total number of processors: 4
Number of nodes = 1089
Number of cells = 1024
Number of edges = 2112
Cell offsets across processors: {0, 256, 512, 768}
terminate called after throwing an instance of 'std::runtime_error'
terminate called after throwing an instance of 'std::runtime_error'
  what():  cudaGetLastError() error( cudaErrorIllegalAddress): an illegal memory access was encountered /home/jenkins/white/workspace/Trilinos-atdm-white-ride-cuda-9.2-release-debug-pt/SRC_AND_BUILD/Trilinos/packages/kokkos/core/src/Cuda/Kokkos_CudaExec.hpp:401
Traceback functionality not available
  what():  cudaGetLastError() error( cudaErrorIllegalAddress): an illegal memory access was encountered /home/jenkins/white/workspace/Trilinos-atdm-white-ride-cuda-9.2-release-debug-pt/SRC_AND_BUILD/Trilinos/packages/kokkos/core/src/Cuda/Kokkos_CudaExec.hpp:401
Traceback functionality not available


[white27:11203] *** Process received signal ***
[white27:11204] *** Process received signal ***
[white27:11203] Signal: Aborted (6)
[white27:11203] Signal code:  (-6)
[white27:11204] Signal: Aborted (6)
[white27:11204] Signal code:  (-6)
[white27:11203] [ 0] [white27:11204] [ 0] [0x3fff90070478]
[white27:11203] [ 1] [0x3fffa3f00478]
...

Randomly looking at the output of several of the other tests I looked at all show errors like shown above.

This is an important build because we are targeting this build on 'white' and 'ride' as a Trilinos PR testing build (see #2464 (closed) ). Also, SPARC uses ROL and as part of https://software-sandbox.sandia.gov/jira/browse/TRIL-212 we are about to update the ATDM Trilinos configuration to test ROL on many platforms (including CUDA builds) so it is critical to get these tests cleaned up for ATDM.

Steps to reproduce

One should be able to reproduce these build errors on either 'white' or 'ride' by cloning the Trilinos git repo, checking out the 'develop' branch, creating a build directory, and then doing:

$ cd <some_build_dir>/

$ source $TRILINOS_DIR/cmake/std/atdm/load-env.sh cuda-9.2-release-debug

$ cmake \
  -GNinja \
  -DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnv.cmake \
  -DTrilinos_ENABLE_TESTS=ON -DTrilinos_ENABLE_ROL=ON \
  $TRILINOS_DIR

$ make NP=16

$ bsub -x -Is -q rhel7F -n 16 ctest -j16
Assignee
Assign to
Time tracking