Skip to content

Switch to new devpack/20180521/openmpi/2.1.2/gcc/7.2.0/cuda/9.2.88 env on 'white' and 'ride' (#3290)

Created by: bartlettroscoe

CC: @trilinos/seacas, @fryeguy52

Description

This changes to use a new consistent GCC 7.2.0 + OpenMPI 2.1.2 + CUDA 9.2 + TPLs env on 'white' and 'ride'. Before this, on 'white' we were using OpenMPI 2.1.2 to compile and link libraries and execuables in Trilinos but were linking against TPLs built with OpenMPI 3.1.0 (see #3290 (closed)). This new set of modules uses OpenMPI 2.1.2 for the TPLs as well to create a 100% consistent env.

@nmhamster noted that he had to reduce the compiler optimization level of HDF5 from -O3 to -O2 or you get a could SEACAS tests that fail, as described in #3288 (closed). (So the problem in #3288 (closed) was not the NetCDF configuration but was the HDF5 compiler options).

Motivation and Context

It was not good to be using a mixture of OpenMPI 2.1.2 and OpenMPI 3.1.0 on 'white'. And we need to use OpenMPI 2.1.2 to avoid problems with MPI handling of CUDA memory impacting Ifpack2 (see #3290 (closed)). Also, we need the TPLs built so that the SEACAS tests pass as described above and in #3288 (closed).

How Has This Been Tested?

I tested this on 'white' by running:

$ cd ~/Trilinos.base/BUILD/WHITE/CHECKIN/

$ bsub -x -I -q rhel7F -n 16 \
  ./checkin-test-atdm.sh gnu-opt-openmp-Power8 cuda-9.2-opt-Power8-Kepler37 \
  --enable-all-packages=on --local-do-all

That gave the test results ...

gnu-opt-openmp-Power8:

  99% tests passed, 4 tests failed out of 2259
  
  Subproject Time Summary:
  Amesos                    =  34.52 sec*proc (13 tests)
  Amesos2                   =  16.71 sec*proc (8 tests)
  Anasazi                   = 215.21 sec*proc (74 tests)
  AztecOO                   =  32.36 sec*proc (17 tests)
  Belos                     = 247.43 sec*proc (83 tests)
  Domi                      = 199.80 sec*proc (125 tests)
  Epetra                    = 104.40 sec*proc (63 tests)
  EpetraExt                 =  31.64 sec*proc (10 tests)
  Galeri                    =   8.90 sec*proc (9 tests)
  Ifpack                    = 117.91 sec*proc (48 tests)
  Ifpack2                   =  76.19 sec*proc (36 tests)
  Intrepid2                 = 261.37 sec*proc (248 tests)
  Kokkos                    = 251.78 sec*proc (27 tests)
  KokkosKernels             = 433.57 sec*proc (8 tests)
  Komplex                   =   2.34 sec*proc (1 test)
  ML                        =  90.20 sec*proc (34 tests)
  MueLu                     = 422.49 sec*proc (52 tests)
  NOX                       = 218.62 sec*proc (105 tests)
  Panzer                    = 1254.57 sec*proc (159 tests)
  Phalanx                   =  27.17 sec*proc (27 tests)
  Pike                      =   3.71 sec*proc (7 tests)
  Piro                      =  26.25 sec*proc (11 tests)
  Pliris                    =   4.81 sec*proc (2 tests)
  RTOp                      =  27.72 sec*proc (24 tests)
  Rythmos                   =  76.80 sec*proc (83 tests)
  SEACAS                    =  21.90 sec*proc (20 tests)
  STK                       =   1.20 sec*proc (1 test)
  Sacado                    = 183.44 sec*proc (297 tests)
  Shards                    =   1.46 sec*proc (4 tests)
  Stratimikos               =  44.43 sec*proc (39 tests)
  Teko                      =  66.47 sec*proc (19 tests)
  Tempus                    = 514.51 sec*proc (54 tests)
  Teuchos                   = 190.46 sec*proc (135 tests)
  Thyra                     = 112.25 sec*proc (81 tests)
  Tpetra                    = 325.77 sec*proc (174 tests)
  TrilinosCouplings         =   4.45 sec*proc (1 test)
  TrilinosFrameworkTests    =   6.10 sec*proc (4 tests)
  Triutils                  =   4.34 sec*proc (2 tests)
  Xpetra                    =  62.20 sec*proc (18 tests)
  Zoltan                    = 3368.46 sec*proc (35 tests)
  Zoltan2                   = 255.36 sec*proc (101 tests)
  
  Total Test time (real) = 668.82 sec
  
  The following tests FAILED:
  	566 - Zoltan_ch_ewgt_zoltan_parallel (Failed)
  	567 - Zoltan_ch_grid20x19_zoltan_parallel (Failed)
  	571 - Zoltan_ch_nograph_zoltan_parallel (Failed)
  	574 - Zoltan_ch_simple_zoltan_parallel (Failed)
  Errors while running CTest

cuda-9.2-opt-Power8-Kepler37:

  99% tests passed, 8 tests failed out of 2264
  
  Subproject Time Summary:
  Amesos                    =  28.89 sec*proc (13 tests)
  Amesos2                   =  47.96 sec*proc (8 tests)
  Anasazi                   = 356.47 sec*proc (74 tests)
  AztecOO                   =  30.41 sec*proc (17 tests)
  Belos                     = 349.99 sec*proc (81 tests)
  Domi                      = 417.10 sec*proc (125 tests)
  Epetra                    = 120.54 sec*proc (63 tests)
  EpetraExt                 =  30.12 sec*proc (10 tests)
  Galeri                    =  14.84 sec*proc (9 tests)
  Ifpack                    = 109.28 sec*proc (48 tests)
  Ifpack2                   = 480.91 sec*proc (36 tests)
  Intrepid2                 = 912.53 sec*proc (255 tests)
  Kokkos                    = 776.72 sec*proc (27 tests)
  KokkosKernels             = 287.47 sec*proc (8 tests)
  Komplex                   =   2.41 sec*proc (1 test)
  ML                        =  84.72 sec*proc (34 tests)
  MueLu                     = 3228.43 sec*proc (50 tests)
  NOX                       = 439.74 sec*proc (105 tests)
  Panzer                    = 6489.05 sec*proc (156 tests)
  Phalanx                   =  52.69 sec*proc (27 tests)
  Pike                      =  11.69 sec*proc (7 tests)
  Piro                      =  23.42 sec*proc (11 tests)
  Pliris                    =  14.03 sec*proc (2 tests)
  RTOp                      =  27.86 sec*proc (24 tests)
  Rythmos                   = 131.34 sec*proc (83 tests)
  SEACAS                    =  29.45 sec*proc (20 tests)
  STK                       =   1.64 sec*proc (1 test)
  Sacado                    = 230.45 sec*proc (300 tests)
  Shards                    =   1.61 sec*proc (4 tests)
  Stratimikos               =  68.10 sec*proc (39 tests)
  Teko                      = 659.03 sec*proc (19 tests)
  Tempus                    = 883.42 sec*proc (54 tests)
  Teuchos                   = 236.24 sec*proc (135 tests)
  Thyra                     = 185.69 sec*proc (81 tests)
  Tpetra                    = 2127.52 sec*proc (176 tests)
  TrilinosCouplings         =   4.47 sec*proc (1 test)
  TrilinosFrameworkTests    =  10.87 sec*proc (4 tests)
  Triutils                  =   6.44 sec*proc (2 tests)
  Xpetra                    = 235.55 sec*proc (18 tests)
  Zoltan                    = 4126.41 sec*proc (35 tests)
  Zoltan2                   = 634.72 sec*proc (101 tests)
  
  Total Test time (real) = 3047.83 sec
  
  The following tests FAILED:
  	  6 - KokkosCore_UnitTest_Cuda_MPI_1 (Timeout)
  	569 - Zoltan_ch_ewgt_zoltan_parallel (Failed)
  	570 - Zoltan_ch_grid20x19_zoltan_parallel (Failed)
  	574 - Zoltan_ch_nograph_zoltan_parallel (Failed)
  	577 - Zoltan_ch_simple_zoltan_parallel (Failed)
  	1013 - Pliris_vector_random_MPI_3 (Failed)
  	1014 - Pliris_vector_random_MPI_4 (Failed)
  	2237 - PanzerAdaptersSTK_PoissonInterfaceExample_2d_diffsideids_MPI_1 (Timeout)
  Errors while running CTest

Note that we don't run run Zoltan or Pliris tests as part of the current ATDM Trilinos builds. I just use --enable-all-packages=on as a shortcut. (I will set up the ATDM Trilinos build to automatically disable tests in the non-tested packages so that we can use this shortcut in the future).

As for the timing out KokkosCore and Panzer tests, I ran them by by themselves and they both passed just fine as shown with:

$ cd cuda-9.2-opt-Power8-Kepler37

$ . load-env.sh 
Hostname 'white11' matches known ATDM host 'white' and system 'ride'
ATDM_CONFIG_TRILNOS_DIR = /home/rabartl/Trilinos.base/Trilinos
Setting default compiler and build options for ATDM_CONFIG_JOB_NAME='cuda-9.2-opt-Power8-Kepler37'
Using white/ride compiler stack CUDA-9.2 to build RELEASE code with Kokkos node type CUDA and KOKKOS_ARCH=Power8,Kepler37

$ cd packages/panzer/

$ bsub -x -I -q rhel7F -n 16 ctest -R PanzerAdaptersSTK_PoissonInterfaceExample_2d_diffsideids_MPI_1
...
    Start 137: PanzerAdaptersSTK_PoissonInterfaceExample_2d_diffsideids_MPI_1
1/1 Test #137: PanzerAdaptersSTK_PoissonInterfaceExample_2d_diffsideids_MPI_1 ...   Passed  129.70 sec

100% tests passed, 0 tests failed out of 1

Label Time Summary:
Panzer    = 129.70 sec*proc (1 test)

Total Test time (real) = 129.86 sec

$ cd ../kokkos

$ bsub -x -I -q rhel7F -n 16 ctest -R KokkosCore_UnitTest_Cuda_MPI_1
...
    Start 2: KokkosCore_UnitTest_Cuda_MPI_1
1/1 Test #2: KokkosCore_UnitTest_Cuda_MPI_1 ...   Passed  117.60 sec

100% tests passed, 0 tests failed out of 1

Label Time Summary:
Kokkos    = 117.60 sec*proc (1 test)

Total Test time (real) = 117.64 sec

We know that these tests run on top of each other and all run on the same GPU so increased runtimes are expected. Hopefully the nightly drivers will not show any timeouts. If they do, we will deal with them.

Checklist

  • My commit messages mention the appropriate GitHub issue numbers.
  • All new and existing tests passed.

Merge request reports