Skip to content

Use ninja job pool limits to fix cuda+rdc+static builds on 'ride' and 'waterman' (#4502)

Created by: bartlettroscoe

This PR should fix the cuda+rdc+static builds on 'waterman' and 'ride' (#4502 (closed)). I built from complete scratch on 'ride' and 'waterman' and the builds completed with no build errors at all and the only test failures where those that we already knew about for the Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-rdc-release-debug-pt build on 'ride' and 'white'.

This required an update to TriBITS to take advantage of the CMake Ninja job pool levels. See:

After this merge, we will be able to promote the builds:

  • Trilinos-atdm-waterman-cuda-9.2-rdc-release-debug
  • Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-rdc-release-debug

to the "ATDM" CDash group since they should be 100% clean (at least they are for this Trilinos version). Then later perhaps set up a cuda+rdc+shared PR build to protect RDC before merging to 'develop'.

Info about the tests run are detailed below with test build and test results.

Build and test results summary (click to expand)

Testing out on 'ride':

$ cd /home/rabartl/Trilinos.base/BUILDS/RIDE/CHECKIN/

$ bsub -x -Is -q rhel7F -n 16 \
  ./ \
  cuda-9.2-gnu-7.2.0-rdc-release-debug-pt \
  --enable-all-packages=on --local-do-all --wipe-clean

That actually passed showing:

  passed: Trilinos/cuda-9.2-gnu-7.2.0-rdc-release-debug-pt: passed=2042,notpassed=0
  Thu Mar 28 22:50:21 MDT 2019
  Enabled Packages: 
  Enabled all Packages
  Hostname: ride12
  Source Dir: /home/rabartl/Trilinos.base/Trilinos/cmake/tribits/ci_support/../../..
  Build Dir: /home/rabartl/Trilinos.base/BUILDS/RIDE/CHECKIN/cuda-9.2-gnu-7.2.0-rdc-release-debug-pt
  Make Options: -j 32
  CTest Options: -j 8
  Pull: Not Performed
  Configure: Passed (3.70 min)
  Build: Passed (221.32 min)
  Test: Passed (45.78 min)
  100% tests passed, 0 tests failed out of 2042
  Subproject Time Summary:
  Amesos2          =  67.44 sec*proc (8 tests)
  Anasazi          = 381.67 sec*proc (74 tests)
  Belos            = 529.80 sec*proc (100 tests)
  Ifpack2          = 494.89 sec*proc (45 tests)
  Intrepid2        = 601.38 sec*proc (267 tests)
  Kokkos           = 185.34 sec*proc (27 tests)
  KokkosKernels    = 194.02 sec*proc (8 tests)
  MueLu            = 3177.65 sec*proc (108 tests)
  NOX              = 359.50 sec*proc (105 tests)
  Panzer           = 10880.13 sec*proc (165 tests)
  Phalanx          =  37.60 sec*proc (27 tests)
  Piro             =  28.83 sec*proc (12 tests)
  Rythmos          =  77.59 sec*proc (83 tests)
  SEACAS           =  22.15 sec*proc (23 tests)
  STK              =   5.68 sec*proc (4 tests)
  Sacado           = 176.82 sec*proc (300 tests)
  Stratimikos      =  42.22 sec*proc (39 tests)
  Teko             = 588.52 sec*proc (18 tests)
  Tempus           = 415.07 sec*proc (80 tests)
  Teuchos          = 156.92 sec*proc (137 tests)
  Thyra            = 113.77 sec*proc (82 tests)
  Tpetra           = 1754.58 sec*proc (201 tests)
  Xpetra           = 358.29 sec*proc (18 tests)
  Zoltan2          = 959.38 sec*proc (111 tests)
  Total Test time (real) = 2746.69 sec
  Total time for cuda-9.2-gnu-7.2.0-rdc-release-debug-pt = 270.80 min

It is good news that passed. It means the ATDM Trilinos builds targeting the ATDM APPs passes with RDC after this change!

Unfortunately, that is not all of the PT packages :-( Something went wrong wtih the new 'pt' build-name keyword support. I see the error. The var name was wrong in:


I fixed that and am now trying again on 'ride':

$ cd /home/rabartl/Trilinos.base/BUILDS/RIDE/CHECKIN/

$ bsub -x -Is -q rhel7F -n 16 \
  ./ \
  cuda-9.2-gnu-7.2.0-rdc-release-debug-pt \
  --enable-all-packages=on --local-do-all --wipe-clean

This time it is not disabling the packages ATDM is not using:

Final set of non-enabled packages:  Pliris Claps Trios Komplex TriKota Moertel PyTrilinos NewPackage 8

Thus gave the result:

  FAILED: Trilinos/cuda-9.2-gnu-7.2.0-rdc-release-debug-pt: passed=2970,notpassed=20
  Fri Mar 29 12:38:25 MDT 2019
  Enabled Packages: 
  Enabled all Packages
  Hostname: ride12
  Source Dir: /home/rabartl/Trilinos.base/Trilinos/cmake/tribits/ci_support/../../..
  Build Dir: /home/rabartl/Trilinos.base/BUILDS/RIDE/CHECKIN/cuda-9.2-gnu-7.2.0-rdc-release-debug-pt
  Make Options: -j 32
  CTest Options: -j 8
  Pull: Not Performed
  Configure: Passed (6.43 min)
  Build: Passed (353.43 min)
  Test: FAILED (86.80 min)
  99% tests passed, 20 tests failed out of 2990
  Subproject Time Summary:
  Amesos                    =  32.54 sec*proc (13 tests)
  Amesos2                   =  62.88 sec*proc (8 tests)
  Anasazi                   = 380.36 sec*proc (74 tests)
  AztecOO                   =  28.74 sec*proc (17 tests)
  Belos                     = 534.52 sec*proc (100 tests)
  Domi                      = 356.08 sec*proc (125 tests)
  Epetra                    =  82.97 sec*proc (63 tests)
  EpetraExt                 =  29.09 sec*proc (10 tests)
  FEI                       =  54.33 sec*proc (43 tests)
  Galeri                    =  22.28 sec*proc (9 tests)
  GlobiPack                 =   3.53 sec*proc (6 tests)
  Ifpack                    = 109.05 sec*proc (48 tests)
  Ifpack2                   = 490.22 sec*proc (45 tests)
  Intrepid                  = 392.89 sec*proc (144 tests)
  Intrepid2                 = 543.77 sec*proc (267 tests)
  Isorropia                 =  14.28 sec*proc (6 tests)
  Kokkos                    = 175.29 sec*proc (27 tests)
  KokkosKernels             = 184.60 sec*proc (8 tests)
  ML                        =  85.52 sec*proc (34 tests)
  MiniTensor                =   4.31 sec*proc (2 tests)
  MueLu                     = 3237.20 sec*proc (108 tests)
  NOX                       = 337.59 sec*proc (106 tests)
  OptiPack                  =   8.73 sec*proc (5 tests)
  Panzer                    = 17893.59 sec*proc (165 tests)
  Phalanx                   =  38.91 sec*proc (27 tests)
  Pike                      =   4.07 sec*proc (7 tests)
  Piro                      = 111.38 sec*proc (13 tests)
  ROL                       = 5758.31 sec*proc (180 tests)
  RTOp                      =  20.73 sec*proc (24 tests)
  Rythmos                   =  75.11 sec*proc (83 tests)
  SEACAS                    =  20.46 sec*proc (23 tests)
  STK                       = 145.06 sec*proc (17 tests)
  Sacado                    = 170.38 sec*proc (300 tests)
  Shards                    =   1.89 sec*proc (4 tests)
  ShyLU_DD                  = 449.88 sec*proc (37 tests)
  ShyLU_Node                =   3.90 sec*proc (5 tests)
  Stokhos                   = 1184.27 sec*proc (84 tests)
  Stratimikos               =  40.50 sec*proc (39 tests)
  Teko                      = 598.76 sec*proc (18 tests)
  Tempus                    = 436.99 sec*proc (80 tests)
  Teuchos                   = 159.24 sec*proc (137 tests)
  Thyra                     = 106.49 sec*proc (82 tests)
  Tpetra                    = 1664.15 sec*proc (201 tests)
  TrilinosCouplings         = 246.86 sec*proc (26 tests)
  TrilinosFrameworkTests    =   6.10 sec*proc (4 tests)
  Triutils                  =   4.40 sec*proc (2 tests)
  Xpetra                    = 317.21 sec*proc (18 tests)
  Zoltan                    = 3178.22 sec*proc (35 tests)
  Zoltan2                   = 943.44 sec*proc (111 tests)
  Total Test time (real) = 5207.77 sec
  The following tests FAILED:
  	573 - Zoltan_ch_ewgt_zoltan_parallel (Failed)
  	574 - Zoltan_ch_grid20x19_zoltan_parallel (Failed)
  	578 - Zoltan_ch_nograph_zoltan_parallel (Failed)
  	581 - Zoltan_ch_simple_zoltan_parallel (Failed)
  	2629 - ROL_test_elementwise_TpetraMultiVector_MPI_4 (Failed)
  	2750 - ROL_example_PDE-OPT_0ld_poisson_example_01_MPI_4 (Failed)
  	2751 - ROL_example_PDE-OPT_0ld_stefan-boltzmann_example_03_MPI_4 (Failed)
  	2754 - ROL_example_PDE-OPT_0ld_adv-diff-react_example_01_MPI_4 (Failed)
  	2755 - ROL_example_PDE-OPT_0ld_adv-diff-react_example_02_MPI_4 (Failed)
  	2759 - ROL_example_PDE-OPT_stefan-boltzmann_example_01_MPI_4 (Failed)
  	2761 - ROL_example_PDE-OPT_stefan-boltzmann_example_03_MPI_4 (Failed)
  	2763 - ROL_example_PDE-OPT_navier-stokes_example_01_MPI_4 (Timeout)
  	2764 - ROL_example_PDE-OPT_navier-stokes_example_02_MPI_4 (Failed)
  	2765 - ROL_example_PDE-OPT_obstacle_example_01_MPI_4 (Failed)
  	2770 - ROL_example_PDE-OPT_nonlinear-elliptic_example_01_MPI_4 (Failed)
  	2771 - ROL_example_PDE-OPT_nonlinear-elliptic_example_02_MPI_4 (Failed)
  	2772 - ROL_example_PDE-OPT_topo-opt_poisson_example_01_MPI_4 (Failed)
  	2930 - PanzerAdaptersSTK_MixedCurlLaplacianExample-ConvTest-Tri-Order-1 (Timeout)
  	2968 - TrilinosCouplings_Example_Maxwell_MueLu_MPI_1 (Failed)
  	2969 - TrilinosCouplings_Example_Maxwell_MueLu_MPI_4 (Failed)
  Errors while running CTest

We already know about those failing tests in issues #4042, #3749, #3542 (closed).

What is interesting is there the STK test in issue #3544 (closed) is not failing.

I made the same changes on 'waterman' and am now testing there as well:

$ cd /home/rabartl/Trilinos.base/BUILDS/WATERMAN/CHECKIN/

$ bsub -x -Is -n 20 \
  ./ \
  cuda-9.2-rdc-release-debug \
  --enable-all-packages=on --local-do-all --wipe-clean

That fully passed with the result:

  passed: Trilinos/cuda-9.2-rdc-release-debug: passed=2030,notpassed=0
  Fri Mar 29 09:27:06 MDT 2019
  Enabled Packages: 
  Enabled all Packages
  Hostname: waterman1
  Source Dir: /home/rabartl/Trilinos.base/Trilinos/cmake/tribits/ci_support/../../..
  Build Dir: /home/rabartl/Trilinos.base/BUILDS/WATERMAN/CHECKIN/cuda-9.2-rdc-release-debug
  CMake Cache Varibles: -GNinja -DTrilinos_TRIBITS_DIR:PATH=/home/rabartl/Trilinos.base/Trilinos/cmake/tribits -DTrilinos_ENABLE_TESTS:BOOL=ON -DTrilinos_TEST_CATEGORIES:STRING=NIGHTLY -DTrilinos_ALLOW_NO_PACKAG
  Make Options: -j 32
  CTest Options: -j 8
  Pull: Not Performed
  Configure: Passed (4.60 min)
  Build: Passed (190.97 min)
  Test: Passed (55.86 min)
  100% tests passed, 0 tests failed out of 2030
  Subproject Time Summary:
  Amesos2          = 126.73 sec*proc (8 tests)
  Anasazi          = 711.36 sec*proc (74 tests)
  Belos            = 867.57 sec*proc (100 tests)
  Ifpack2          = 960.22 sec*proc (45 tests)
  Intrepid2        = 1095.80 sec*proc (267 tests)
  Kokkos           = 269.76 sec*proc (27 tests)
  KokkosKernels    = 171.75 sec*proc (8 tests)
  MueLu            = 6511.60 sec*proc (110 tests)
  NOX              = 451.95 sec*proc (105 tests)
  Panzer           = 6610.07 sec*proc (154 tests)
  Phalanx          =  78.25 sec*proc (27 tests)
  Piro             =  42.59 sec*proc (12 tests)
  Rythmos          =  90.85 sec*proc (83 tests)
  SEACAS           =  22.77 sec*proc (20 tests)
  STK              =   8.57 sec*proc (4 tests)
  Sacado           = 322.10 sec*proc (300 tests)
  Stratimikos      =  50.45 sec*proc (39 tests)
  Teko             = 814.35 sec*proc (18 tests)
  Tempus           = 632.85 sec*proc (80 tests)
  Teuchos          = 187.10 sec*proc (137 tests)
  Thyra            = 147.04 sec*proc (82 tests)
  Tpetra           = 3683.06 sec*proc (201 tests)
  Xpetra           = 700.12 sec*proc (18 tests)
  Zoltan2          = 1951.14 sec*proc (111 tests)
  Total Test time (real) = 3351.41 sec
  Total time for cuda-9.2-rdc-release-debug = 251.43 min

So we can go ahead and promote that build once this merges!

Merge request reports