Skip to content

Use ctest -j4 for CUDA builds by default for sems-rhel7 (ATDV-144, #4599, #4790, #4801)

Created by: bartlettroscoe

Description

It appears that changing the ctest parallel test level from the default of 10 to 4 on 'ascicgpu15' seems to have fixed all of the test failures in the build:

  • Trilinos-atdm-sems-rhel7-cuda-9.2-Volta70-complex-static-release-debug

and therefore I also suspect for the build:

  • Trilinos-atdm-sems-rhel7-cuda-9.2-Volta70-complex-shared-release-debug

as the 'static' build has more errors than the 'shared' build. These builds support the ATDM APP Gemma (see TRIL-255).

Therefore, this change this would seem to resolve the issues #4599, #4790 (closed), and #4801 (closed).

See ATDV-144.

How Has This Been Tested?

I tested this on 'ascicgpu15' which should be identical to 'ascicgpu14' that runs the builds:

  • Trilinos-atdm-sems-rhel7-cuda-9.2-Volta70-complex-shared-release-debug
  • Trilinos-atdm-sems-rhel7-cuda-9.2-Volta70-complex-static-release-debug

I tested this on 'ascicgpu15' with:

$ cd /scratch/rabartl/Trilinos.base/BUILDS/ATDM/SEMS-RHEL7/CTEST_S/

$ ./ctest-s-local-test-driver.sh \
  sems-rhel7-cuda-9.2-Volta70-complex-static-release-debug

***
*** ./ctest-s-local-test-driver.sh  sems-rhel7-cuda-9.2-Volta70-complex-static-release-debug
***

ATDM_TRILINOS_DIR = '/scratch/rabartl/Trilinos.base/Trilinos'

Load some env to get python, cmake, etc ...

Hostname 'ascicgpu15' matches known ATDM host 'sems-rhel7' and system 'sems-rhel7'
Setting compiler and build options for buld name 'default'
Using SEMS RHEL7 compiler stack GNU-7.2.0 to build DEBUG code with Kokkos node type SERIAL

Running builds:
    sems-rhel7-cuda-9.2-Volta70-complex-static-release-debug


Running Jenkins driver Trilinos-atdm-sems-rhel7-cuda-9.2-Volta70-complex-static-release-debug.sh ...

Creating directory: Trilinos-atdm-sems-rhel7-cuda-9.2-Volta70-complex-static-release-debug
Creating directory: SRC_AND_BUILD

real    172m44.096s
user    1302m14.099s
sys     316m47.719s

That submitted to:

This resulted in all 1982 passing tests in 24 labels! (MueLU is disabled due to #4599). So shoot, that appears to have fixed all of the runtime problems. It looks like we were overloading the GPU on 'ascicgpu14' (and hopefully 'ascicgpu15' is identical to 'ascgpu14' that runs the Jenkins jobs at night). Also, the wall-clock test only went up from 1h5m to 1h15m. This is a slam dunk.

Merge request reports