Address failures on ride/white about BYSOCKET:OVERSUBSCRIBE failures
Created by: bartlettroscoe
CC: @fryeguy52, @nmhamster
Next Action Status
Changed mpiexec options from "--map-by socket:PE=8 --oversubscribe" to "--map-by socket:PE=4". All of these errors went away on 'ride' builds on 3/17/2018
Description:
There are several tests on 'ride' and 'white' that are failing in our automated ATDM builds being posted to:
like the test TeuchosComm_Comm_test_MPI_4
run on ride
in the build Trilinos-atdm-white-ride-gnu-opt-openmp
shown at:
that shows the the failure:
--------------------------------------------------------------------------
A request for multiple cpus-per-proc was given, but a directive
was also give to map to an object level that has less cpus than
requested ones:
#cpus-per-proc: 8
number of cpus: 7
map-by: BYSOCKET:OVERSUBSCRIBE
Please specify a mapping level that has more cpus, or else let us
define a default mapping that will allow multiple cpus-per-proc.
--------------------------------------------------------------------------
I looked at most of the failing Teuchos tests on 'ride' as shown at:
and many off them (but not all) show this same error message.
The mpiexec command on this system is run using:
<full-path>/mpiexec \"-np\" \"4\" \"-map-by\" \"socket:PE=8\" \"--oversubscribe\" \"/home/jenkins/ride/workspace/Trilinos-atdm-white-ride-gnu-opt-openmp/SRC_AND_BUILD/BUILD/packages/teuchos/comm/test/Comm/TeuchosComm_Comm_test.exe\
This set of mpiexec options -map-by;socket:PE=8;--oversubscribe
was taken from the EMPIRE configuration for Trilinos and it seems to work with the Panzer test suite as shown by that same build yesterday:
Is this set of options the cause of this problem and how do we fix this?
Steps to Reproduce
Following the instructions at:
I tried to reproduce these failures on 'ride' using:
$ bsub -x -I -q rhel7F ./checkin-test-atdm.sh gnu-opt-openmp --enable-packages=Teuchos --local-do-all
and I got:
FAILED (NOT READY TO PUSH): Trilinos: ride12
Fri Mar 16 09:38:43 MDT 2018
Enabled Packages: Teuchos
Build test results:
-------------------
0) MPI_RELEASE_DEBUG_SHARED_PT => Test case MPI_RELEASE_DEBUG_SHARED_PT was not run! => Does not affect push readiness! (-1.00 min)
1) gnu-opt-openmp => FAILED: passed=130,notpassed=1 => Not ready to push! (7.11 min)
The only failing test was:
99% tests passed, 1 tests failed out of 131
Subproject Time Summary:
Teuchos = 434.10 sec*proc (131 tests)
Total Test time (real) = 31.92 sec
The following tests FAILED:
49 - TeuchosCore_TypeConversions_UnitTest_MPI_1 (Failed)
Errors while running CTest
(which I will create another GitHub issue for).
Therefore, I can't seem to reproduce this error locally it we may have to guess at how ti fix this and let the nightly builds run.