Skip to content

Fix ROL CUDA build failure (#3072)

Created by: bartlettroscoe

CC: @trilinos/rol, @dridzal (ROL package lead)

Description

Fixes the ROL CUDA build failure described in #3072 (closed). The fix was trivial (not sure why other compilers did not catch this or at least prove a warning).

I also included a commit to add debug print info for nvcc_wrapper (see kokkos/nvcc_wrapper#19 and kokkos/nvcc_wrapper#20).

Motivation and Context

ROL was not building for a CUDA build (see #3072 (closed)). We wold like an auto PR CUDA build that includes all Primary Tested packages and ROL is a PT package (see #2464 (closed)). Also, SPARC uses ROL and adding support for SPARC means testing ROL on all of the platforms where SPARC uses ROL and CUDA is an important build on many of those platforms.

How Has This Been Tested?

I tested this on 'white' with:

$ cd ~/Trilinos.base/BUILD/WHITE/CUDA/CUDA-DEBUG/

$ source ~/Trilinos.base/Trilinos/cmake/std/atdm/load-env.sh cuda-debug
Hostname 'white11' matches known ATDM host 'white' and system 'ride'
ATDM_CONFIG_TRILNOS_DIR = /home/rabartl/Trilinos.base/Trilinos
Setting default compiler and build options for JOB_NAME='cuda-debug'
Using white/ride compiler stack CUDA to build DEBUG code with Kokkos node type CUDA

$ time cmake \
   -GNinja
  -DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnvAllPtPackages.cmake \
  -DTrilinos_ENABLE_TESTS=ON -DTrilinos_ENABLE_ROL=ON \
   ~/Trilinos.base/Trilinos \
   &> configure.out

real    1m43.759s
user    0m58.268s
sys     0m17.081s

$ time make NP=16 &> make.out

real    54m28.573s
user    696m12.668s
sys     80m53.877s

$ time bsub -x -Is -q rhel7F -n 16 ctest -j16 --timeout 600 &> ctest.out

real    14m51.969s
user    0m0.032s
sys     0m0.035s

and the build passed and the test results were:

90% tests passed, 16 tests failed out of 156

Subproject Time Summary:
ROL    = 11219.28 sec*proc (156 tests)

Total Test time (real) = 890.82 sec

The following tests FAILED:
	 32 - ROL_test_elementwise_TpetraMultiVector_MPI_4 (Failed)
	130 - ROL_example_PDE-OPT_0ld_poisson_example_01_MPI_4 (Failed)
	131 - ROL_example_PDE-OPT_0ld_stefan-boltzmann_example_03_MPI_4 (Failed)
	134 - ROL_example_PDE-OPT_0ld_adv-diff-react_example_01_MPI_4 (Failed)
	135 - ROL_example_PDE-OPT_0ld_adv-diff-react_example_02_MPI_4 (Timeout)
	136 - ROL_example_PDE-OPT_0ld_stoch-adv-diff_example_01_MPI_4 (Timeout)
	137 - ROL_example_PDE-OPT_poisson_example_01_MPI_4 (Failed)
	139 - ROL_example_PDE-OPT_stefan-boltzmann_example_01_MPI_4 (Failed)
	141 - ROL_example_PDE-OPT_stefan-boltzmann_example_03_MPI_4 (Failed)
	142 - ROL_example_PDE-OPT_adv-diff-react_example_02_MPI_4 (Failed)
	143 - ROL_example_PDE-OPT_navier-stokes_example_01_MPI_4 (Timeout)
	144 - ROL_example_PDE-OPT_navier-stokes_example_02_MPI_4 (Failed)
	145 - ROL_example_PDE-OPT_obstacle_example_01_MPI_4 (Failed)
	150 - ROL_example_PDE-OPT_nonlinear-elliptic_example_01_MPI_4 (Failed)
	151 - ROL_example_PDE-OPT_nonlinear-elliptic_example_02_MPI_4 (Failed)
	152 - ROL_example_PDE-OPT_topo-opt_poisson_example_01_MPI_4 (Failed)
Errors while running CTest

Those are the same 16 tests already shown failing in the build Trilinos-atdm-white-ride-cuda-debug-pt-all-at-once for example shown here. (I will create a new GitHub issue for those failing tests once this PR is merge.)

Checklist

  • My commit messages mention the appropriate GitHub issue numbers.

Merge request reports