Test TeuchosComm_TimeMonitor_UnitTests_MPI_3 randomly failing in ATDM builds of Trilinos
Created by: bartlettroscoe
CC: @trilinos/teuchos
Next Action Status
Commit a4818467 merged to 'develop' on 3/30/2018 and no timeouts seen failing since then.
Description
As shown in the query from 2018-02-01 to 2018-03-31:
the test TeuchosComm_TimeMonitor_UnitTests_MPI_3
appears to be failing randomly. It failed 17 times in the following ATDM four builds on 'white' and 'ride':
Trilinos-atdm-white-ride-cuda-debug
Trilinos-atdm-white-ride-cuda-opt
Trilinos-atdm-white-ride-gnu-debug-openmp
Trilinos-atdm-white-ride-gnu-opt-openmp
Table of failing tests (sorted by date) (click to expand)
Site | Build Name | Status | Time | Details | Build Time |
---|---|---|---|---|---|
white | cuda-debug | Failed | 10m 20ms | Completed (Timeout) | 2018-03-30T06:20:33 UTC |
ride | gnu-opt-openmp | Failed | 10m 20ms | Completed (Timeout) | 2018-03-30T06:11:48 UTC |
ride | cuda-opt | Failed | 10m 20ms | Completed (Timeout) | 2018-03-27T06:50:29 UTC |
white | gnu-debug-openmp | Failed | 10m 20ms | Completed (Timeout) | 2018-03-27T06:16:00 UTC |
white | cuda-debug | Failed | 10m 20ms | Completed (Timeout) | 2018-03-22T06:17:49 UTC |
ride | gnu-opt-openmp | Failed | 10m 60ms | Completed (Timeout) | 2018-03-20T06:18:19 UTC |
ride | gnu-opt-openmp | Failed | 1s 650ms | Completed (Failed) | 2018-03-16T06:12:14 UTC |
ride | cuda-opt | Failed | 2s 180ms | Completed (Failed) | 2018-03-15T06:50:23 UTC |
white | cuda-debug | Failed | 10m 20ms | Completed (Timeout) | 2018-03-15T06:16:34 UTC |
ride | gnu-debug-openmp | Failed | 10m 20ms | Completed (Timeout) | 2018-03-14T06:58:58 UTC |
ride | cuda-debug | Failed | 10m 20ms | Completed (Timeout) | 2018-03-14T06:32:20 UTC |
white | gnu-debug-openmp | Failed | 10m 20ms | Completed (Timeout) | 2018-03-14T06:16:13 UTC |
ride | gnu-debug-openmp | Failed | 10m 20ms | Completed (Timeout) | 2018-03-11T18:37:16 UTC |
white | cuda-debug | Failed | 10m 20ms | Completed (Timeout) | 2018-03-10T07:17:18 UTC |
white | cuda-opt | Failed | 10m 20ms | Completed (Timeout) | 2018-03-07T08:13:51 UTC |
white | gnu-debug-openmp | Failed | 10m 20ms | Completed (Timeout) | 2018-03-07T05:07:23 UTC |
white | gnu-opt-openmp | Failed | 10m 20ms | Completed (Timeout) | 2018-03-07T05:02:21 UTC |
NOTE: All of the above builds really start with Trilinos-atdm-white-ride-
but that was removed to shorten the "Build Name" column.
Note taht all but two above the the above tests timed out. The two that faikled completed in under 3 sec.
When the test times out, it looks like it passes, for example as shown today at:
which shows:
***
*** Unit test suite ...
***
Sorting tests by group name then by the order they were added ... (time = 3.1e-06)
Running unit tests ...
0. TimeMonitor_FUNC_TIME_MONITOR_UnitTest ... [Passed] (0.00122 sec)
1. TimeMonitor_enableTimer_UnitTest ... [Passed] (0.00648 sec)
2. TimeMonitor_YamlLabelQuoting_UnitTest ... [Passed] (0.00905 sec)
3. TimeMonitor_TimerLabelFiltering_UnitTest ... [Passed] (0.00778 sec)
4. TimeMonitor_FUNC_TIME_MONITOR_tested_UnitTest ... [Passed] (0.00117 sec)
5. TimeMonitor_SUMMARIZE_diffTimerSets_UnitTest ... [Passed] (0.156 sec)
6. TimeMonitor_FILTER_ZERO_TIMERS_sameParallelCallCounts_UnitTest ... [Passed] (0.341 sec)
7. TimeMonitor_FILTER_ZERO_TIMERS_differentParallelCallCounts_UnitTest ... [Passed] (1.28 sec)
8. TimeMonitor_IgnoreMissingTimers_UnitTest ... [Passed] (1 sec)
Total Time: 2.8 sec
Summary: total = 9, run = 9, passed = 9, failed = 0
End Result: TEST PASSED
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
So it looks like on of the other processes is failing for some reason and that causes the other processes to hang.
When it did fail, for example on 2018-03-16 shown at:
it showed:
--------------------------------------------------------------------------
A request for multiple cpus-per-proc was given, but a directive
was also give to map to an object level that has less cpus than
requested ones:
#cpus-per-proc: 8
number of cpus: 7
map-by: BYSOCKET:OVERSUBSCRIBE
Please specify a mapping level that has more cpus, or else let us
define a default mapping that will allow multiple cpus-per-proc.
--------------------------------------------------------------------------
I think those failures were addressed by commit 114ca53c from "Fri Mar 16 10:17:14 2018" pushed that day. Therefore, I think we can ignore these failures. Since 2018-03-16, there have been only random timeouts so we will focus just on those timeouts.
Looking the the full history for this test for the build Trilinos-atdm-white-ride-cuda-debug
in white
(which is targeted for a Trilinos auto PR build in #2464 (closed)) in the query:
You can see that other than timing out at 10 minutes today, the test timed out three other days which were 2018-03-22, 2018-03-15 and 2018-03-10. When the test does pass, it does so in less than 6 sec in all cases. When the test does pass, for example yesterday in the above query, it shows:
***
*** Unit test suite ...
***
Sorting tests by group name then by the order they were added ... (time = 5.96e-06)
Running unit tests ...
0. TimeMonitor_FUNC_TIME_MONITOR_UnitTest ... [Passed] (0.00168 sec)
1. TimeMonitor_enableTimer_UnitTest ... [Passed] (0.00647 sec)
2. TimeMonitor_YamlLabelQuoting_UnitTest ... [Passed] (0.00907 sec)
3. TimeMonitor_TimerLabelFiltering_UnitTest ... [Passed] (0.00782 sec)
4. TimeMonitor_FUNC_TIME_MONITOR_tested_UnitTest ... [Passed] (0.00117 sec)
5. TimeMonitor_SUMMARIZE_diffTimerSets_UnitTest ... [Passed] (0.156 sec)
6. TimeMonitor_FILTER_ZERO_TIMERS_sameParallelCallCounts_UnitTest ... [Passed] (0.34 sec)
7. TimeMonitor_FILTER_ZERO_TIMERS_differentParallelCallCounts_UnitTest ... [Passed] (1.28 sec)
8. TimeMonitor_IgnoreMissingTimers_UnitTest ... [Passed] (1 sec)
Total Time: 2.8 sec
Summary: total = 9, run = 9, passed = 9, failed = 0
End Result: TEST PASSED
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpiexec detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[51006,1],2]
Exit code: 1
--------------------------------------------------------------------------
Wow, that is the same output for when the test times out!
Steps to Reproduce
Should be able to reproduce this by logging onto 'white' or 'ride' and then running any of the supported builds on this platform shown above and then enable and run the Teuchos tests as documented at:
Related Issues
- Related to: #2464 (closed)