STK tests failing in targeted CUDA PR builds Trilinos-atdm-white-ride-cuda-9.2-release-debug-pt
Created by: bartlettroscoe
CC: @trilinos/stk , @kddevin (Trilinos Data Services Product Area Lead)
Next Action Status
Description
The STK package has 3 failing tests in the build Trilinos-atdm-white-ride-cuda-9.2-release-debug-pt
on 'white' and 'ride' as shown here which shows the failing tests:
- STKUnit_tests_stk_mesh_unit_tests_MPI_4
- STKUnit_tests_stk_ngp_unit_tests_MPI_4
- STKUnit_tests_stk_tools_unit_tests_MPI_4
The test STKUnit_tests_stk_mesh_unit_tests_MPI_4
has output shown here and shows the failures:
*** Starting test BulkData.testFieldComm from UnitTestBulkData.cpp:2177
[white27:4867] *** An error occurred in MPI_Neighbor_alltoallv
[white27:4867] *** reported by process [2079326209,2]
[white27:4867] *** on communicator MPI COMMUNICATOR 3 CREATE FROM 0
[white27:4867] *** MPI_ERR_ARG: invalid argument of some other kind
[white27:4867] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[white27:4867] *** and potentially your MPI job)
[white27:04859] 1 more process has sent help message help-mpi-errors.txt / mpi_errors_are_fatal
[white27:04859] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
The test output for the test STKUnit_tests_stk_ngp_unit_tests_MPI_4
shown here shows:
*** Starting test NGP_Kokkos.bucket0 from KokkosBulkDataBucketCentroidCalculation.cpp:230
[ OK ] NGP_Kokkos.bucket0 (7002379918061956201 ms)
*** Starting test NGP_Kokkos.calculate_centroid_field_with_gather_on_device_bucket from KokkosBulkDataBucketCentroidCalculation.cpp:235
[white27:4914] *** An error occurred in MPI_Neighbor_alltoallv
[white27:4914] *** reported by process [2048983041,2]
[white27:4914] *** on communicator MPI COMMUNICATOR 3 CREATE FROM 0
[white27:4914] *** MPI_ERR_ARG: invalid argument of some other kind
[white27:4914] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[white27:4914] *** and potentially your MPI job)
[white27:04906] 3 more processes have sent help message help-mpi-errors.txt / mpi_errors_are_fatal
[white27:04906] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
Finally, the output for the test STKUnit_tests_stk_tools_unit_tests_MPI_4
shown here shows:
*** Starting test CloningMesh.cloningBulkDataNoAura_selector from UnitTestBulkDataClone.cpp:418
[ OK ] CloningMesh.cloningBulkDataNoAura_selector (0 ms)
*** Starting test CloningMesh.cloningBulkDataWithAura_selector from UnitTestBulkDataClone.cpp:427
[white27:5053] *** An error occurred in MPI_Neighbor_alltoallv
[white27:5053] *** reported by process [2059272193,2]
[white27:5053] *** on communicator MPI COMMUNICATOR 3 CREATE FROM 0
[white27:5053] *** MPI_ERR_ARG: invalid argument of some other kind
[white27:5053] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[white27:5053] *** and potentially your MPI job)
[white27:05045] 3 more processes have sent help message help-mpi-errors.txt / mpi_errors_are_fatal
[white27:05045] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
Therefore, it seems all of these tests are failing due to errors in the function MPI_Neighbor_alltoallv()
.
This is an important build because we are targeting this build on 'white' and 'ride' as a Trilinos PR testing build (see #2464 (closed) ). Also, EMPIRE and SPARC uses STK. However, SPARC and EMPIRE do not enable the STKUnit_tests
packages that has these tests. Therefore, from the ATDM perspective, we could disable these tests for ATDM. But since one would assume theses tests check the correct functioning of STK it would be good to get these fixed.
Steps to reproduce
One should be able to reproduce these build errors on either 'white' or 'ride' by cloning the Trilinos git repo, checking out the 'develop' branch, creating a build directory, and then doing:
$ cd <some_build_dir>/
$ source $TRILINOS_DIR/cmake/std/atdm/load-env.sh cuda-9.2-release-debug
$ cmake \
-GNinja \
-DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnvAllPtPackages.cmake \
-DTrilinos_ENABLE_TESTS=ON -DTrilinos_ENABLE_STK=ON \
$TRILINOS_DIR
$ make NP=16
$ bsub -x -Is -q rhel7F -n 16 ctest -j16