Created by: bartlettroscoe
Description
It appears that changing the ctest parallel test level from the default of 10 to 4 on 'ascicgpu15' seems to have fixed all of the test failures in the build:
Trilinos-atdm-sems-rhel7-cuda-9.2-Volta70-complex-static-release-debug
and therefore I also suspect for the build:
Trilinos-atdm-sems-rhel7-cuda-9.2-Volta70-complex-shared-release-debug
as the 'static' build has more errors than the 'shared' build. These builds support the ATDM APP Gemma (see TRIL-255).
Therefore, this change this would seem to resolve the issues #4599, #4790 (closed), and #4801 (closed).
See ATDV-144.
How Has This Been Tested?
I tested this on 'ascicgpu15' which should be identical to 'ascicgpu14' that runs the builds:
Trilinos-atdm-sems-rhel7-cuda-9.2-Volta70-complex-shared-release-debug
Trilinos-atdm-sems-rhel7-cuda-9.2-Volta70-complex-static-release-debug
I tested this on 'ascicgpu15' with:
$ cd /scratch/rabartl/Trilinos.base/BUILDS/ATDM/SEMS-RHEL7/CTEST_S/
$ ./ctest-s-local-test-driver.sh \
sems-rhel7-cuda-9.2-Volta70-complex-static-release-debug
***
*** ./ctest-s-local-test-driver.sh sems-rhel7-cuda-9.2-Volta70-complex-static-release-debug
***
ATDM_TRILINOS_DIR = '/scratch/rabartl/Trilinos.base/Trilinos'
Load some env to get python, cmake, etc ...
Hostname 'ascicgpu15' matches known ATDM host 'sems-rhel7' and system 'sems-rhel7'
Setting compiler and build options for buld name 'default'
Using SEMS RHEL7 compiler stack GNU-7.2.0 to build DEBUG code with Kokkos node type SERIAL
Running builds:
sems-rhel7-cuda-9.2-Volta70-complex-static-release-debug
Running Jenkins driver Trilinos-atdm-sems-rhel7-cuda-9.2-Volta70-complex-static-release-debug.sh ...
Creating directory: Trilinos-atdm-sems-rhel7-cuda-9.2-Volta70-complex-static-release-debug
Creating directory: SRC_AND_BUILD
real 172m44.096s
user 1302m14.099s
sys 316m47.719s
That submitted to:
This resulted in all 1982 passing tests in 24 labels! (MueLU is disabled due to #4599). So shoot, that appears to have fixed all of the runtime problems. It looks like we were overloading the GPU on 'ascicgpu14' (and hopefully 'ascicgpu15' is identical to 'ascgpu14' that runs the Jenkins jobs at night). Also, the wall-clock test only went up from 1h5m to 1h15m. This is a slam dunk.