Skip to content

GitLab

  • Menu
Projects Groups Snippets
    • Loading...
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
  • Sign in
  • T Trilinos
  • Project information
    • Project information
    • Activity
    • Labels
    • Planning hierarchy
    • Members
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
  • Issues 936
    • Issues 936
    • List
    • Boards
    • Service Desk
    • Milestones
  • Merge requests 22
    • Merge requests 22
  • CI/CD
    • CI/CD
    • Pipelines
    • Jobs
    • Schedules
  • Deployments
    • Deployments
    • Environments
    • Releases
  • Monitor
    • Monitor
    • Incidents
  • Analytics
    • Analytics
    • Value stream
    • CI/CD
    • Repository
  • Wiki
    • Wiki
  • Snippets
    • Snippets
  • Activity
  • Graph
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
Collapse sidebar
  • James Willenbring
  • Trilinos
  • Issues
  • #3544

Closed
Open
Created Oct 02, 2018 by James Willenbring@jmwilleMaintainer

STK tests failing in targeted CUDA PR builds Trilinos-atdm-white-ride-cuda-9.2-release-debug-pt

Created by: bartlettroscoe

CC: @trilinos/stk , @kddevin (Trilinos Data Services Product Area Lead)

Next Action Status

Description

The STK package has 3 failing tests in the build Trilinos-atdm-white-ride-cuda-9.2-release-debug-pt on 'white' and 'ride' as shown here which shows the failing tests:

  • STKUnit_tests_stk_mesh_unit_tests_MPI_4
  • STKUnit_tests_stk_ngp_unit_tests_MPI_4
  • STKUnit_tests_stk_tools_unit_tests_MPI_4

The test STKUnit_tests_stk_mesh_unit_tests_MPI_4 has output shown here and shows the failures:

*** Starting test BulkData.testFieldComm from UnitTestBulkData.cpp:2177
[white27:4867] *** An error occurred in MPI_Neighbor_alltoallv
[white27:4867] *** reported by process [2079326209,2]
[white27:4867] *** on communicator MPI COMMUNICATOR 3 CREATE FROM 0
[white27:4867] *** MPI_ERR_ARG: invalid argument of some other kind
[white27:4867] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[white27:4867] ***    and potentially your MPI job)
[white27:04859] 1 more process has sent help message help-mpi-errors.txt / mpi_errors_are_fatal
[white27:04859] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

The test output for the test STKUnit_tests_stk_ngp_unit_tests_MPI_4 shown here shows:

*** Starting test NGP_Kokkos.bucket0 from KokkosBulkDataBucketCentroidCalculation.cpp:230
[       OK ] NGP_Kokkos.bucket0 (7002379918061956201 ms)
*** Starting test NGP_Kokkos.calculate_centroid_field_with_gather_on_device_bucket from KokkosBulkDataBucketCentroidCalculation.cpp:235
[white27:4914] *** An error occurred in MPI_Neighbor_alltoallv
[white27:4914] *** reported by process [2048983041,2]
[white27:4914] *** on communicator MPI COMMUNICATOR 3 CREATE FROM 0
[white27:4914] *** MPI_ERR_ARG: invalid argument of some other kind
[white27:4914] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[white27:4914] ***    and potentially your MPI job)
[white27:04906] 3 more processes have sent help message help-mpi-errors.txt / mpi_errors_are_fatal
[white27:04906] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

Finally, the output for the test STKUnit_tests_stk_tools_unit_tests_MPI_4 shown here shows:

*** Starting test CloningMesh.cloningBulkDataNoAura_selector from UnitTestBulkDataClone.cpp:418
[       OK ] CloningMesh.cloningBulkDataNoAura_selector (0 ms)
*** Starting test CloningMesh.cloningBulkDataWithAura_selector from UnitTestBulkDataClone.cpp:427
[white27:5053] *** An error occurred in MPI_Neighbor_alltoallv
[white27:5053] *** reported by process [2059272193,2]
[white27:5053] *** on communicator MPI COMMUNICATOR 3 CREATE FROM 0
[white27:5053] *** MPI_ERR_ARG: invalid argument of some other kind
[white27:5053] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[white27:5053] ***    and potentially your MPI job)
[white27:05045] 3 more processes have sent help message help-mpi-errors.txt / mpi_errors_are_fatal
[white27:05045] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

Therefore, it seems all of these tests are failing due to errors in the function MPI_Neighbor_alltoallv().

This is an important build because we are targeting this build on 'white' and 'ride' as a Trilinos PR testing build (see #2464 (closed) ). Also, EMPIRE and SPARC uses STK. However, SPARC and EMPIRE do not enable the STKUnit_tests packages that has these tests. Therefore, from the ATDM perspective, we could disable these tests for ATDM. But since one would assume theses tests check the correct functioning of STK it would be good to get these fixed.

Steps to reproduce

One should be able to reproduce these build errors on either 'white' or 'ride' by cloning the Trilinos git repo, checking out the 'develop' branch, creating a build directory, and then doing:

$ cd <some_build_dir>/

$ source $TRILINOS_DIR/cmake/std/atdm/load-env.sh cuda-9.2-release-debug

$ cmake \
  -GNinja \
  -DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnvAllPtPackages.cmake \
  -DTrilinos_ENABLE_TESTS=ON -DTrilinos_ENABLE_STK=ON \
  $TRILINOS_DIR

$ make NP=16

$ bsub -x -Is -q rhel7F -n 16 ctest -j16
Assignee
Assign to
Time tracking