Use ninja job pool limits to fix cuda+rdc+static builds on 'ride' and 'waterman' (#4502)
Created by: bartlettroscoe
This PR should fix the cuda+rdc+static builds on 'waterman' and 'ride' (#4502 (closed)). I built from complete scratch on 'ride' and 'waterman' and the builds completed with no build errors at all and the only test failures where those that we already knew about for the Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-rdc-release-debug-pt
build on 'ride' and 'white'.
This required an update to TriBITS to take advantage of the CMake Ninja job pool levels. See:
After this merge, we will be able to promote the builds:
- Trilinos-atdm-waterman-cuda-9.2-rdc-release-debug
- Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-rdc-release-debug
to the "ATDM" CDash group since they should be 100% clean (at least they are for this Trilinos version). Then later perhaps set up a cuda+rdc+shared PR build to protect RDC before merging to 'develop'.
Info about the tests run are detailed below with test build and test results.
Build and test results summary (click to expand)
Testing out on 'ride':
$ cd /home/rabartl/Trilinos.base/BUILDS/RIDE/CHECKIN/
$ bsub -x -Is -q rhel7F -n 16 \
./checkin-test-atdm.sh \
cuda-9.2-gnu-7.2.0-rdc-release-debug-pt \
--enable-all-packages=on --local-do-all --wipe-clean
That actually passed showing:
passed: Trilinos/cuda-9.2-gnu-7.2.0-rdc-release-debug-pt: passed=2042,notpassed=0
Thu Mar 28 22:50:21 MDT 2019
Enabled Packages:
Enabled all Packages
Hostname: ride12
Source Dir: /home/rabartl/Trilinos.base/Trilinos/cmake/tribits/ci_support/../../..
Build Dir: /home/rabartl/Trilinos.base/BUILDS/RIDE/CHECKIN/cuda-9.2-gnu-7.2.0-rdc-release-debug-pt
CMake Cache Varibles: -GNinja -DTrilinos_TRIBITS_DIR:PATH=/home/rabartl/Trilinos.base/Trilinos/cmake/tribits -DTrilinos_ENABLE_TESTS:BOOL=ON -DTrilinos_TEST_CATEGORIES:STRING=NIGHTLY -DTrilinos_ALLOW_NO_PACKAGES:BOOL=OFF -DDART_TESTING_TIMEOUT:STRING=600.0 -GNinja -DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnv.cmake -DTrilinos_TRACE_ADD_TEST=ON -DTrilinos_ENABLE_ALL_OPTIONAL_PACKAGES:BOOL=ON -DTrilinos_ENABLE_ALL_PACKAGES:BOOL=ON -DTrilinos_ENABLE_ALL_FORWARD_DEP_PACKAGES:BOOL=OFF
Make Options: -j 32
CTest Options: -j 8
Pull: Not Performed
Configure: Passed (3.70 min)
Build: Passed (221.32 min)
Test: Passed (45.78 min)
100% tests passed, 0 tests failed out of 2042
Subproject Time Summary:
Amesos2 = 67.44 sec*proc (8 tests)
Anasazi = 381.67 sec*proc (74 tests)
Belos = 529.80 sec*proc (100 tests)
Ifpack2 = 494.89 sec*proc (45 tests)
Intrepid2 = 601.38 sec*proc (267 tests)
Kokkos = 185.34 sec*proc (27 tests)
KokkosKernels = 194.02 sec*proc (8 tests)
MueLu = 3177.65 sec*proc (108 tests)
NOX = 359.50 sec*proc (105 tests)
Panzer = 10880.13 sec*proc (165 tests)
Phalanx = 37.60 sec*proc (27 tests)
Piro = 28.83 sec*proc (12 tests)
Rythmos = 77.59 sec*proc (83 tests)
SEACAS = 22.15 sec*proc (23 tests)
STK = 5.68 sec*proc (4 tests)
Sacado = 176.82 sec*proc (300 tests)
Stratimikos = 42.22 sec*proc (39 tests)
Teko = 588.52 sec*proc (18 tests)
Tempus = 415.07 sec*proc (80 tests)
Teuchos = 156.92 sec*proc (137 tests)
Thyra = 113.77 sec*proc (82 tests)
Tpetra = 1754.58 sec*proc (201 tests)
Xpetra = 358.29 sec*proc (18 tests)
Zoltan2 = 959.38 sec*proc (111 tests)
Total Test time (real) = 2746.69 sec
Total time for cuda-9.2-gnu-7.2.0-rdc-release-debug-pt = 270.80 min
It is good news that passed. It means the ATDM Trilinos builds targeting the ATDM APPs passes with RDC after this change!
Unfortunately, that is not all of the PT packages :-( Something went wrong wtih the new 'pt' build-name keyword support. I see the error. The var name was wrong in:
IF (NOT ATDM_CONFIG_PT_PACKAGES)
INCLUDE("${CMAKE_CURRENT_LIST_DIR}/ATDMDisables.cmake")
ENDIF()
I fixed that and am now trying again on 'ride':
$ cd /home/rabartl/Trilinos.base/BUILDS/RIDE/CHECKIN/
$ bsub -x -Is -q rhel7F -n 16 \
./checkin-test-atdm.sh \
cuda-9.2-gnu-7.2.0-rdc-release-debug-pt \
--enable-all-packages=on --local-do-all --wipe-clean
This time it is not disabling the packages ATDM is not using:
Final set of non-enabled packages: Pliris Claps Trios Komplex TriKota Moertel PyTrilinos NewPackage 8
Thus gave the result:
FAILED: Trilinos/cuda-9.2-gnu-7.2.0-rdc-release-debug-pt: passed=2970,notpassed=20
Fri Mar 29 12:38:25 MDT 2019
Enabled Packages:
Enabled all Packages
Hostname: ride12
Source Dir: /home/rabartl/Trilinos.base/Trilinos/cmake/tribits/ci_support/../../..
Build Dir: /home/rabartl/Trilinos.base/BUILDS/RIDE/CHECKIN/cuda-9.2-gnu-7.2.0-rdc-release-debug-pt
CMake Cache Varibles: -GNinja -DTrilinos_TRIBITS_DIR:PATH=/home/rabartl/Trilinos.base/Trilinos/cmake/tribits -DTrilinos_ENABLE_TESTS:BOOL=ON -DTrilinos_TEST_CATEGORIES:STRING=NIGHTLY -DTrilinos_ALLOW_NO_PACKAGES:BOOL=OFF -DDART_TESTING_TIMEOUT:STRING=600.0 -GNinja -DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnv.cmake -DTrilinos_TRACE_ADD_TEST=ON -DTrilinos_ENABLE_ALL_OPTIONAL_PACKAGES:BOOL=ON -DTrilinos_ENABLE_ALL_PACKAGES:BOOL=ON -DTrilinos_ENABLE_ALL_FORWARD_DEP_PACKAGES:BOOL=OFF
Make Options: -j 32
CTest Options: -j 8
Pull: Not Performed
Configure: Passed (6.43 min)
Build: Passed (353.43 min)
Test: FAILED (86.80 min)
99% tests passed, 20 tests failed out of 2990
Subproject Time Summary:
Amesos = 32.54 sec*proc (13 tests)
Amesos2 = 62.88 sec*proc (8 tests)
Anasazi = 380.36 sec*proc (74 tests)
AztecOO = 28.74 sec*proc (17 tests)
Belos = 534.52 sec*proc (100 tests)
Domi = 356.08 sec*proc (125 tests)
Epetra = 82.97 sec*proc (63 tests)
EpetraExt = 29.09 sec*proc (10 tests)
FEI = 54.33 sec*proc (43 tests)
Galeri = 22.28 sec*proc (9 tests)
GlobiPack = 3.53 sec*proc (6 tests)
Ifpack = 109.05 sec*proc (48 tests)
Ifpack2 = 490.22 sec*proc (45 tests)
Intrepid = 392.89 sec*proc (144 tests)
Intrepid2 = 543.77 sec*proc (267 tests)
Isorropia = 14.28 sec*proc (6 tests)
Kokkos = 175.29 sec*proc (27 tests)
KokkosKernels = 184.60 sec*proc (8 tests)
ML = 85.52 sec*proc (34 tests)
MiniTensor = 4.31 sec*proc (2 tests)
MueLu = 3237.20 sec*proc (108 tests)
NOX = 337.59 sec*proc (106 tests)
OptiPack = 8.73 sec*proc (5 tests)
Panzer = 17893.59 sec*proc (165 tests)
Phalanx = 38.91 sec*proc (27 tests)
Pike = 4.07 sec*proc (7 tests)
Piro = 111.38 sec*proc (13 tests)
ROL = 5758.31 sec*proc (180 tests)
RTOp = 20.73 sec*proc (24 tests)
Rythmos = 75.11 sec*proc (83 tests)
SEACAS = 20.46 sec*proc (23 tests)
STK = 145.06 sec*proc (17 tests)
Sacado = 170.38 sec*proc (300 tests)
Shards = 1.89 sec*proc (4 tests)
ShyLU_DD = 449.88 sec*proc (37 tests)
ShyLU_Node = 3.90 sec*proc (5 tests)
Stokhos = 1184.27 sec*proc (84 tests)
Stratimikos = 40.50 sec*proc (39 tests)
Teko = 598.76 sec*proc (18 tests)
Tempus = 436.99 sec*proc (80 tests)
Teuchos = 159.24 sec*proc (137 tests)
Thyra = 106.49 sec*proc (82 tests)
Tpetra = 1664.15 sec*proc (201 tests)
TrilinosCouplings = 246.86 sec*proc (26 tests)
TrilinosFrameworkTests = 6.10 sec*proc (4 tests)
Triutils = 4.40 sec*proc (2 tests)
Xpetra = 317.21 sec*proc (18 tests)
Zoltan = 3178.22 sec*proc (35 tests)
Zoltan2 = 943.44 sec*proc (111 tests)
Total Test time (real) = 5207.77 sec
The following tests FAILED:
573 - Zoltan_ch_ewgt_zoltan_parallel (Failed)
574 - Zoltan_ch_grid20x19_zoltan_parallel (Failed)
578 - Zoltan_ch_nograph_zoltan_parallel (Failed)
581 - Zoltan_ch_simple_zoltan_parallel (Failed)
2629 - ROL_test_elementwise_TpetraMultiVector_MPI_4 (Failed)
2750 - ROL_example_PDE-OPT_0ld_poisson_example_01_MPI_4 (Failed)
2751 - ROL_example_PDE-OPT_0ld_stefan-boltzmann_example_03_MPI_4 (Failed)
2754 - ROL_example_PDE-OPT_0ld_adv-diff-react_example_01_MPI_4 (Failed)
2755 - ROL_example_PDE-OPT_0ld_adv-diff-react_example_02_MPI_4 (Failed)
2759 - ROL_example_PDE-OPT_stefan-boltzmann_example_01_MPI_4 (Failed)
2761 - ROL_example_PDE-OPT_stefan-boltzmann_example_03_MPI_4 (Failed)
2763 - ROL_example_PDE-OPT_navier-stokes_example_01_MPI_4 (Timeout)
2764 - ROL_example_PDE-OPT_navier-stokes_example_02_MPI_4 (Failed)
2765 - ROL_example_PDE-OPT_obstacle_example_01_MPI_4 (Failed)
2770 - ROL_example_PDE-OPT_nonlinear-elliptic_example_01_MPI_4 (Failed)
2771 - ROL_example_PDE-OPT_nonlinear-elliptic_example_02_MPI_4 (Failed)
2772 - ROL_example_PDE-OPT_topo-opt_poisson_example_01_MPI_4 (Failed)
2930 - PanzerAdaptersSTK_MixedCurlLaplacianExample-ConvTest-Tri-Order-1 (Timeout)
2968 - TrilinosCouplings_Example_Maxwell_MueLu_MPI_1 (Failed)
2969 - TrilinosCouplings_Example_Maxwell_MueLu_MPI_4 (Failed)
Errors while running CTest
We already know about those failing tests in issues #4042, #3749, #3542 (closed).
What is interesting is there the STK test in issue #3544 (closed) is not failing.
I made the same changes on 'waterman' and am now testing there as well:
$ cd /home/rabartl/Trilinos.base/BUILDS/WATERMAN/CHECKIN/
$ bsub -x -Is -n 20 \
./checkin-test-atdm.sh \
cuda-9.2-rdc-release-debug \
--enable-all-packages=on --local-do-all --wipe-clean
That fully passed with the result:
passed: Trilinos/cuda-9.2-rdc-release-debug: passed=2030,notpassed=0
Fri Mar 29 09:27:06 MDT 2019
Enabled Packages:
Enabled all Packages
Hostname: waterman1
Source Dir: /home/rabartl/Trilinos.base/Trilinos/cmake/tribits/ci_support/../../..
Build Dir: /home/rabartl/Trilinos.base/BUILDS/WATERMAN/CHECKIN/cuda-9.2-rdc-release-debug
CMake Cache Varibles: -GNinja -DTrilinos_TRIBITS_DIR:PATH=/home/rabartl/Trilinos.base/Trilinos/cmake/tribits -DTrilinos_ENABLE_TESTS:BOOL=ON -DTrilinos_TEST_CATEGORIES:STRING=NIGHTLY -DTrilinos_ALLOW_NO_PACKAG
Make Options: -j 32
CTest Options: -j 8
Pull: Not Performed
Configure: Passed (4.60 min)
Build: Passed (190.97 min)
Test: Passed (55.86 min)
100% tests passed, 0 tests failed out of 2030
Subproject Time Summary:
Amesos2 = 126.73 sec*proc (8 tests)
Anasazi = 711.36 sec*proc (74 tests)
Belos = 867.57 sec*proc (100 tests)
Ifpack2 = 960.22 sec*proc (45 tests)
Intrepid2 = 1095.80 sec*proc (267 tests)
Kokkos = 269.76 sec*proc (27 tests)
KokkosKernels = 171.75 sec*proc (8 tests)
MueLu = 6511.60 sec*proc (110 tests)
NOX = 451.95 sec*proc (105 tests)
Panzer = 6610.07 sec*proc (154 tests)
Phalanx = 78.25 sec*proc (27 tests)
Piro = 42.59 sec*proc (12 tests)
Rythmos = 90.85 sec*proc (83 tests)
SEACAS = 22.77 sec*proc (20 tests)
STK = 8.57 sec*proc (4 tests)
Sacado = 322.10 sec*proc (300 tests)
Stratimikos = 50.45 sec*proc (39 tests)
Teko = 814.35 sec*proc (18 tests)
Tempus = 632.85 sec*proc (80 tests)
Teuchos = 187.10 sec*proc (137 tests)
Thyra = 147.04 sec*proc (82 tests)
Tpetra = 3683.06 sec*proc (201 tests)
Xpetra = 700.12 sec*proc (18 tests)
Zoltan2 = 1951.14 sec*proc (111 tests)
Total Test time (real) = 3351.41 sec
Total time for cuda-9.2-rdc-release-debug = 251.43 min
So we can go ahead and promote that build once this merges!