Stokhos tests failing in Trilinos-atdm-white-ride-cuda-9.2-debug-pt build
Created by: bartlettroscoe
CC: @trilinos/stokhos, @rppawlo (Trilinos Nonlinear Solvers Product Area Lead)
Next Action Status
PR #3741 merged on 10/26/2018 repalced STEQR with PTEQR in Stokhos and on 10/27/2018 all 84 Stokhos tests passed in the Trilinos-atdm-white-ride-cuda-9.2-debug-pt
build on 'ride' including these 66 tests that went from failing to passing.
Description
The Stokhos package has 66 failing tests in the build Trilinos-atdm-white-ride-cuda-9.2-debug-pt
on 'white' and 'ride' as shown here which shows the failing tests:
- Stokhos_AdaptivityToolsUnitTest_MPI_1
- Stokhos_AlgebraicExpansionUnitTest_MPI_1
- Stokhos_BasisInteractionGraphUnitTest_MPI_1
- Stokhos_division_example_MPI_1
- Stokhos_DivisionOperatorUnitTest_MPI_1
- Stokhos_ExponentialRandomFieldUnitTest_MPI_1
- Stokhos_GramSchmidtUnitTest_MPI_1
- Stokhos_hermite_example_MPI_1
- Stokhos_HermiteBasisUnitTest_MPI_1
- Stokhos_InterlacedMapUnitTest_MPI_2
- Stokhos_InterlacedOpUnitTest_MPI_2
- Stokhos_JacobiBasisUnitTest_MPI_1
- Stokhos_KokkosArrayKernelsUnitTest_Cuda_MPI_1
- Stokhos_KokkosArrayKernelsUnitTest_Serial_MPI_1
- Stokhos_KokkosCrsMatrixUQPCEUnitTest_Cuda_MPI_1
- Stokhos_KokkosCrsMatrixUQPCEUnitTest_Serial_MPI_1
- Stokhos_KokkosViewUQPCEUnitTest_Cuda_MPI_1
- Stokhos_KokkosViewUQPCEUnitTest_Serial_MPI_1
- Stokhos_LanczosUnitTest_MPI_1
- Stokhos_LegendreBasisUnitTest_MPI_1
- Stokhos_LexicographicTreeBasisUnitTest_MPI_1
- Stokhos_Linear2D_Diffusion_CG_AGS_MPI_2
- Stokhos_Linear2D_Diffusion_GMRES_AGS_MPI_2
- Stokhos_Linear2D_Diffusion_GMRES_AJ_MPI_2
- Stokhos_Linear2D_Diffusion_GMRES_FA_MPI_2
- Stokhos_Linear2D_Diffusion_GMRES_GS_MPI_2
- Stokhos_Linear2D_Diffusion_GMRES_KL_MPI_2
- Stokhos_Linear2D_Diffusion_GMRES_KLR_MPI_2
- Stokhos_Linear2D_Diffusion_GMRES_KP_MPI_2
- Stokhos_Linear2D_Diffusion_GMRES_Mean_Based_MPI_2
- Stokhos_Linear2D_Diffusion_GS_MPI_2
- Stokhos_Linear2D_Diffusion_GSLN_MPI_2
- Stokhos_Linear2D_Diffusion_JA_MPI_2
- Stokhos_Linear2D_Diffusion_LN_MPI_2
- Stokhos_Linear2D_Diffusion_PCE_Example_MPI_2
- Stokhos_Linear2D_Diffusion_PCE_Interlaced_Example_MPI_2
- Stokhos_Linear2D_Diffusion_PCE_NOX_Example_MPI_2
- Stokhos_LogNormalUnitTest_MPI_1
- Stokhos_MatrixFreeOperatorUnitTest_MPI_1
- Stokhos_NormalizedHermiteBasisUnitTest_MPI_1
- Stokhos_NormalizedLegendreBasisUnitTest_MPI_1
- Stokhos_nox_example_MPI_1
- Stokhos_ProductBasisUtilsUnitTest_MPI_1
- Stokhos_QuadExpansionUnitTest_MPI_1
- Stokhos_QuadraturePseudoSpectralExpansionUnitTest_MPI_1
- Stokhos_sacado_ensemble_example_MPI_1
- Stokhos_sacado_example_MPI_1
- Stokhos_SacadoETPCEUnitTest_MPI_1
- Stokhos_SacadoPCECommTests_MPI_1
- Stokhos_SacadoPCESerializationTests_MPI_1
- Stokhos_SacadoPCEUnitTest_MPI_1
- Stokhos_SacadoUQPCECommTests_MPI_1
- Stokhos_SacadoUQPCESerializationTests_MPI_1
- Stokhos_SacadoUQPCEUnitTest_MPI_1
- Stokhos_SmolyakBasisUnitTest_MPI_1
- Stokhos_SmolyakPseudoSpectralExpansionUnitTest_MPI_1
- Stokhos_Sparse3TensorUnitTest_MPI_1
- Stokhos_SparseGridQuadratureUnitTest_MPI_1
- Stokhos_StieltjesUnitTest_MPI_1
- Stokhos_TensorProductBasisUnitTest_MPI_1
- Stokhos_TensorProductPseudoSpectralExpansionUnitTest_MPI_1
- Stokhos_TensorProductPseudoSpectralOperatorUnitTest_MPI_1
- Stokhos_TotalOrderBasisUnitTest_MPI_1
- Stokhos_TpetraCrsMatrixUQPCEUnitTest_Cuda_MPI_4
- Stokhos_TpetraCrsMatrixUQPCEUnitTest_Serial_MPI_4
- Stokhos_uq_handbook_nonlinear_sg_example_MPI_1
The first failing test Stokhos_AdaptivityToolsUnitTest_MPI_1
with detailed output shown here shows:
Sorting tests by group name then by the order they were added ... (time = 7.44e-06)
Running unit tests ...
--------------------------------------------------------------------------
mpiexec noticed that process rank 0 with PID 0 on node white27 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
Randomly looking at the output of several of the other tests I looked at all show segfaults like above.
This is an important build because we are targeting this build on 'white' and 'ride' as a Trilinos PR testing build (see #2464 (closed) ).
Steps to reproduce
One should be able to reproduce these build errors on either 'white' or 'ride' by cloning the Trilinos git repo, checking out the 'develop' branch, creating a build directory, and then doing:
$ cd <some_build_dir>/
$ source $TRILINOS_DIR/cmake/std/atdm/load-env.sh cuda-9.2-debug
$ cmake \
-GNinja \
-DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnvAllPtPackages.cmake \
-DTrilinos_ENABLE_TESTS=ON -DTrilinos_ENABLE_Stokhos=ON \
$TRILINOS_DIR
$ make NP=16
$ bsub -x -Is -q rhel7F -n 16 ctest -j16