Skip to content

GitLab

  • Menu
Projects Groups Snippets
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
  • Sign in
  • T Trilinos
  • Project information
    • Project information
    • Activity
    • Labels
    • Members
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
  • Issues 936
    • Issues 936
    • List
    • Boards
    • Service Desk
    • Milestones
  • Merge requests 22
    • Merge requests 22
  • CI/CD
    • CI/CD
    • Pipelines
    • Jobs
    • Schedules
  • Deployments
    • Deployments
    • Environments
    • Releases
  • Monitor
    • Monitor
    • Incidents
  • Analytics
    • Analytics
    • Value stream
    • CI/CD
    • Repository
  • Wiki
    • Wiki
  • Snippets
    • Snippets
  • Activity
  • Graph
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
Collapse sidebar
  • James Willenbring
  • Trilinos
  • Issues
  • #2398

Closed
Open
Created Mar 16, 2018 by James Willenbring@jmwilleOwner

Address failures on ride/white about BYSOCKET:OVERSUBSCRIBE failures

Created by: bartlettroscoe

CC: @fryeguy52, @nmhamster

Next Action Status

Changed mpiexec options from "--map-by socket:PE=8 --oversubscribe" to "--map-by socket:PE=4". All of these errors went away on 'ride' builds on 3/17/2018

Description:

There are several tests on 'ride' and 'white' that are failing in our automated ATDM builds being posted to:

  • https://testing-vm.sandia.gov/cdash/index.php?project=Trilinos&date=2018-03-16&filtercombine=and&filtercount=2&showfilters=1&filtercombine=and&field1=buildname&compare1=63&value1=-atdm-&field2=buildname&compare2=63&value2=white-ride

like the test TeuchosComm_Comm_test_MPI_4 run on ride in the build Trilinos-atdm-white-ride-gnu-opt-openmp shown at:

  • https://testing.sandia.gov/cdash/testDetails.php?test=45490082&build=3441246

that shows the the failure:

--------------------------------------------------------------------------
A request for multiple cpus-per-proc was given, but a directive
was also give to map to an object level that has less cpus than
requested ones:

  #cpus-per-proc:  8
  number of cpus:  7
  map-by:          BYSOCKET:OVERSUBSCRIBE

Please specify a mapping level that has more cpus, or else let us
define a default mapping that will allow multiple cpus-per-proc.
--------------------------------------------------------------------------

I looked at most of the failing Teuchos tests on 'ride' as shown at:

  • https://testing.sandia.gov/cdash/queryTests.php?project=Trilinos&date=2018-03-16&filtercount=4&showfilters=1&filtercombine=and&field1=buildname&compare1=65&value1=Trilinos-atdm-&field2=status&compare2=62&value2=Passed&field3=testname&compare3=65&value3=Teuchos&field4=site&compare4=61&value4=ride

and many off them (but not all) show this same error message.

The mpiexec command on this system is run using:

<full-path>/mpiexec \"-np\" \"4\" \"-map-by\" \"socket:PE=8\" \"--oversubscribe\" \"/home/jenkins/ride/workspace/Trilinos-atdm-white-ride-gnu-opt-openmp/SRC_AND_BUILD/BUILD/packages/teuchos/comm/test/Comm/TeuchosComm_Comm_test.exe\

This set of mpiexec options -map-by;socket:PE=8;--oversubscribe was taken from the EMPIRE configuration for Trilinos and it seems to work with the Panzer test suite as shown by that same build yesterday:

  • https://testing-vm.sandia.gov/cdash/viewTest.php?buildid=3369707

Is this set of options the cause of this problem and how do we fix this?

Steps to Reproduce

Following the instructions at:

  • https://github.com/trilinos/Trilinos/blob/develop/cmake/std/atdm/README.md#ridewhite

I tried to reproduce these failures on 'ride' using:

$ bsub -x -I -q rhel7F ./checkin-test-atdm.sh gnu-opt-openmp --enable-packages=Teuchos --local-do-all

and I got:

FAILED (NOT READY TO PUSH): Trilinos: ride12

Fri Mar 16 09:38:43 MDT 2018

Enabled Packages: Teuchos

Build test results:
-------------------
0) MPI_RELEASE_DEBUG_SHARED_PT => Test case MPI_RELEASE_DEBUG_SHARED_PT was not run! => Does not affect push readiness! (-1.00 min)
1) gnu-opt-openmp => FAILED: passed=130,notpassed=1 => Not ready to push! (7.11 min)

The only failing test was:

  99% tests passed, 1 tests failed out of 131
  
  Subproject Time Summary:
  Teuchos    = 434.10 sec*proc (131 tests)
  
  Total Test time (real) =  31.92 sec
  
  The following tests FAILED:
         49 - TeuchosCore_TypeConversions_UnitTest_MPI_1 (Failed)
  Errors while running CTest

(which I will create another GitHub issue for).

Therefore, I can't seem to reproduce this error locally it we may have to guess at how ti fix this and let the nightly builds run.

Assignee
Assign to
Time tracking