Trilinos issueshttps://gitlab.osti.gov/jmwille/Trilinos/-/issues2019-01-24T23:43:03Zhttps://gitlab.osti.gov/jmwille/Trilinos/-/issues/2270New timed-out Amesos2 tests in Trilinos-atdm-sems-gcc-7-2-0 build on 2/20/20182019-01-24T23:43:03ZJames WillenbringNew timed-out Amesos2 tests in Trilinos-atdm-sems-gcc-7-2-0 build on 2/20/2018*Created by: bartlettroscoe*
**CC:** @trilinos/amesos2
## Next Action Status
The tests were disabled in all `Trilinos_ENABLE_DEBUG=ON` builds on 2/22/2018 (see [below](https://github.com/trilinos/Trilinos/issues/2270#issuecommen...*Created by: bartlettroscoe*
**CC:** @trilinos/amesos2
## Next Action Status
The tests were disabled in all `Trilinos_ENABLE_DEBUG=ON` builds on 2/22/2018 (see [below](https://github.com/trilinos/Trilinos/issues/2270#issuecomment-367815146)). Next: Fix the tests so that they pass?
## Description
The tests:
* Amesos2_KLU2_UnitTests_MPI_2
* Amesos2_Superlu_UnitTests_MPI_2
timed out at 10 minutes in the build `Trilinos-atdm-sems-gcc-7-2-0` this morning as shown at:
* https://testing.sandia.gov/cdash/index.php?project=Trilinos&parentid=3396946
Prior to this morning, these tests were taking:
* Amesos2_KLU2_UnitTests_MPI_2: 1.5s
* Amesos2_Superlu_UnitTests_MPI_2: 1.7s
It looks like these tests are hanging due to an exception being thrown?
## Steps to Reproduce
Using the `do-configure` script:
```
#!/bin/bash
cmake \
-DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/sems/atdm/SEMSATDMSettings.cmake,cmake/std/MpiReleaseDebugSharedPtSettings.cmake,cmake/std/BasicCiTestingSettings.cmake \
-DDART_TESTING_TIMEOUT:STRING=300.0 \
-DTrilinos_ENABLE_TESTS:BOOL=ON \
-DCTEST_BUILD_FLAGS=-j10 \
-DCTEST_PARALLEL_LEVEL=10 \
"$@" \
$TRILINOS_DIR
```
Anyone should be able to reproduce these failures on any SNL COE RHEL6 machine as shown below:
```
$ cd <some_build_dir>/
$ source $TRILINOS_DIR/cmake/std/sems/atdm/load_atdm_7.2_dev_env.sh
$ ./do-configure -DTrilinos_ENABLE_Amesos2=ON
$ make -j16
$ ctest -j16
```
NOTE: The timeout like `-DDART_TESTING_TIMEOUT:STRING=300.0` is important or ctest will never end.
Keep promoted "ATDM" builds of Trilinos cleanhttps://gitlab.osti.gov/jmwille/Trilinos/-/issues/3282Random test failure for KokkosAlgorithms_UnitTest_MPI_1 in ATDM build?2018-11-30T11:16:53ZJames WillenbringRandom test failure for KokkosAlgorithms_UnitTest_MPI_1 in ATDM build?*Created by: bartlettroscoe*
CC: @trilinos/kokkos, @fryeguy52, @kddevin (Trilinos Data Services Product Lead)
## Next Action Status
The test `KokkosAlgorithms_UnitTest_MPI_1` is expected to randomly fail occasionally as a comprise...*Created by: bartlettroscoe*
CC: @trilinos/kokkos, @fryeguy52, @kddevin (Trilinos Data Services Product Lead)
## Next Action Status
The test `KokkosAlgorithms_UnitTest_MPI_1` is expected to randomly fail occasionally as a comprise between test runtime and not randomly failing more often. Next: Match for this failing test more and decide how to handle it longer-term (such as treating it as "expected may fail" as part of #2933) ....
## Description
The test `KokkosAlgorithms_UnitTest_MPI_1` looks to have had a random failure in the ATDM Trilinos build `Trilinos-atdm-hansen-shiller-intel-opt-serial` shown [here](https://testing-vm.sandia.gov/cdash/testDetails.php?test=52067584&build=3819525) which shows the output:
```
[ RUN ] serial.Random_XorShift1024
Test Seed:1533901314858176575
Test Scalar=int
-- Testing randomness properties
Pass: 1 1 -2.75867e-05 -4.42521e-05 0.000158617 || 0.000502704
-- Testing 1-D histogram
Density 1D: 7.26597e-05 0.0178458 0.00132233 || 0.051031 2035 2407 || 2159.68 2198.22 || 18.2798 -0.159026
-- Testing 3-D histogram
Density 3D: 7.26597e-05 -0.00698725 -3.92889e-05 || 0.051031 1e+64 -1e+64
Test Scalar=unsigned int
-- Testing randomness properties
Pass: 1 1 1.68802e-05 9.3101e-05 -8.45415e-05 || 0.000502704
-- Testing 1-D histogram
Density 1D: 7.26597e-05 -0.00149644 -0.000557499 || 0.051031 2025 2373 || 2201.52 2198.22 || -7.70686 -0.159026
-- Testing 3-D histogram
Density 3D: 7.26597e-05 -0.00456695 -0.00055943 || 0.051031 1e+64 -1e+64
Test Scalar=int64_t
-- Testing randomness properties
Pass: 1 0 1.89669e-05 0.000837141 -0.000265228 || 0.000502704
-- Testing 1-D histogram
Density 1D: 7.26597e-05 -0.0166374 0.00138102 || 0.051031 2005 2386 || 2235.41 2198.22 || 19.0912 -0.159026
-- Testing 3-D histogram
Density 3D: 7.26597e-05 0.000533505 -0.000342069 || 0.051031 1e+64 -1e+64
/home/jenkins/hansen/workspace/Trilinos-atdm-hansen-shiller-intel-opt-serial/SRC_AND_BUILD/Trilinos/packages/kokkos/algorithms/unit_tests/TestRandom.hpp:426: Failure
Value of: 1
Expected: test_int64.pass_var
Which is: 0
[ FAILED ] serial.Random_XorShift1024 (1840 ms)
[ RUN ] serial.SortUnsigned
[ OK ] serial.SortUnsigned (2699 ms)
[----------] 3 tests from serial (8593 ms total)
[----------] Global test environment tear-down
[==========] 3 tests from 1 test case ran. (8593 ms total)
[ PASSED ] 2 tests.
[ FAILED ] 1 test, listed below:
[ FAILED ] serial.Random_XorShift1024
1 FAILED TEST
```
There was no updates to Kokkos from previous day for this build as shown [here](https://testing-vm.sandia.gov/cdash/viewNotes.php?buildid=3819519#!#note6). Therefore, one would assume this is a random failure of some type.
## Steps to reproduce
One should be able to produce this build and run this test on either 'hansen' or 'shiller' as described at:
* https://github.com/trilinos/Trilinos/blob/develop/cmake/std/atdm/README.md
More specifically, one can follow the instructions at:
* https://github.com/trilinos/Trilinos/blob/develop/cmake/std/atdm/README.md#shillerhansen
and use the build name `intel-opt-serial` enable the package `Kokkos` as:
```
$ cd <some_build_dir>/
$ source $TRILINOS_DIR/cmake/std/atdm/load-env.sh intel-opt-serial
$ cmake \
-GNinja \
-DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnv.cmake \
-DTrilinos_ENABLE_TESTS=ON -DTrilinos_ENABLE_Kokkos=ON \
$TRILINOS_DIR
$ make NP=16
$ srun ctest -j16
```
But given that this test looks to have randomly failed, it might be hard to reproduce.
Keep promoted "ATDM" builds of Trilinos cleanhttps://gitlab.osti.gov/jmwille/Trilinos/-/issues/2919Belos_rcg_hb_MPI_4 timing out in several ATDM Trilinos builds on 'hansen' sin...2018-11-30T11:16:53ZJames WillenbringBelos_rcg_hb_MPI_4 timing out in several ATDM Trilinos builds on 'hansen' since 5/29/2018*Created by: bartlettroscoe*
CC: @trilinos/belos, @fryeguy52, @srajama1 (Linear Solves Project Lead)
## Next Action Status
Test was disabled in these builds on 'hansen' in the commit 8850c64 pushed on 6/12/2018 and was shown to be...*Created by: bartlettroscoe*
CC: @trilinos/belos, @fryeguy52, @srajama1 (Linear Solves Project Lead)
## Next Action Status
Test was disabled in these builds on 'hansen' in the commit 8850c64 pushed on 6/12/2018 and was shown to be disabled in the builds on CDash 6/13/2018
## Description
As shown in [this large query](https://testing-vm.sandia.gov/cdash/queryTests.php?project=Trilinos&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercount=19&showfilters=1&filtercombine=and&field1=buildname&compare1=65&value1=Trilinos-atdm-&field2=buildname&compare2=62&value2=Trilinos-atdm-mutrino-intel-debug-openmp&field3=buildname&compare3=62&value3=Trilinos-atdm-mutrino-intel-opt-openmp&field4=buildname&compare4=62&value4=Trilinos-atdm-white-ride-cuda-debug-pt-all-at-once&field5=buildname&compare5=62&value5=Trilinos-atdm-serrano-intel-debug-openmp&field6=buildname&compare6=62&value6=Trilinos-atdm-serrano-intel-opt-openmp&field7=buildname&compare7=62&value7=Trilinos-atdm-chama-intel-opt-openmp&field8=buildname&compare8=62&value8=Trilinos-atdm-chama-intel-debug-openmp-panzer&field9=buildname&compare9=62&value9=Trilinos-atdm-chama-intel-debug-openmp&field10=buildname&compare10=62&value10=Trilinos-atdm-chama-intel-opt-openmp-panzer&field11=site&compare11=62&value11=ride&field12=testname&compare12=61&value12=Belos_rcg_hb_MPI_4&field13=buildstarttime&compare13=84&value13=2018-06-08&field14=buildstarttime&compare14=83&value14=2018-05-10&field15=buildname&compare15=62&value15=Trilinos-atdm-white-ride-cuda-opt&field16=buildname&compare16=62&value16=Trilinos-atdm-white-ride-gnu-opt-openmp&field17=site&compare17=62&value17=serrano&field18=site&compare18=62&value18=shiller&field19=buildname&compare19=62&value19=Trilinos-atdm-white-ride-cuda-debug-all-at-once) the test `Belos_rcg_hb_MPI_4` looks to be consistently timing out in the builds:
* Trilinos-atdm-hansen-shiller-cuda-8.0-debug
* Trilinos-atdm-hansen-shiller-cuda-8.0-opt
* Trilinos-atdm-hansen-shiller-cuda-9.0-debug
* Trilinos-atdm-hansen-shiller-cuda-9.0-opt
* Trilinos-atdm-hansen-shiller-gnu-debug-serial
* Trilinos-atdm-hansen-shiller-gnu-opt-serial
all on 'hansen' starting on 5/29/201 or 5/30/2018. (Since the these builds are pulling directly from the 'develop' branch, they may be testing different versions on the same day and this is UTC time so they may be on the same testing day in Mountain time.)
That same query shows that that test has been consistently passing in every other promoted build on every other ATDM Trilinos testing machine.
What that query also shows is that in those same builds that are now timing out, the test was taking upwards of 6+ minutes to complete before it started timing out at 10 minutes on 5/29/201 or 5/30/2018 as shown in the last non-timing-out builds:
* Trilinos-atdm-hansen-shiller-cuda-8.0-debug: 6m 26s 280ms
* Trilinos-atdm-hansen-shiller-cuda-8.0-opt: 6m 25s 680ms
* Trilinos-atdm-hansen-shiller-cuda-9.0-debug: 6m 22s 810ms
* Trilinos-atdm-hansen-shiller-cuda-9.0-opt: 6m 22s 440ms
* Trilinos-atdm-hansen-shiller-gnu-debug-serial: 6m 13s 150ms
* Trilinos-atdm-hansen-shiller-gnu-opt-serial: 5m 58s 960ms
But the other builds that are not showing any timeouts, that test completes very fast (in under 30 seconds in about every case). Some of the recent test times shown in that query for the various builds that don't have timeouts now are:
* Trilinos-atdm-hansen-shiller-gnu-debug-openmp: 23s 850ms
* Trilinos-atdm-hansen-shiller-gnu-opt-openmp: 8s 650ms
* Trilinos-atdm-hansen-shiller-intel-debug-openmp: 7s 720ms
* Trilinos-atdm-hansen-shiller-intel-debug-serial: 7s 950ms
* Trilinos-atdm-hansen-shiller-intel-opt-openmp: 6s 150ms
* Trilinos-atdm-hansen-shiller-intel-opt-serial: 5s 910ms
* Trilinos-atdm-rhel6-gnu-debug-openmp: 6s 840ms
* Trilinos-atdm-rhel6-gnu-debug-serial: 5s 340ms
* Trilinos-atdm-rhel6-gnu-opt-openmp: 5s 180ms
* Trilinos-atdm-rhel6-gnu-opt-serial: 4s 250ms
* Trilinos-atdm-rhel6-intel-opt-openmp: 3s 740ms
* Trilinos-atdm-sems-gcc-7-2-0: 5s 290ms
* Trilinos-atdm-white-ride-cuda-debug: 9s 430ms
* Trilinos-atdm-white-ride-gnu-debug-openmp: 9s 90ms
So this seems pretty crazy. How can the same test take over 6 minutes to complete for a CUDA 8.0 and 9.0 optimized build on 'hansen' and only take 9 seconds for a CUDA debug on 'white'? And this test takes a very long time (and are now timing out) for the `gnu-debug-serial` and `gnu-opt-serial` builds as well on 'hansen' but is fast for the `intel-debug-serial` and `intel-opt-serial` builds on the same machine. How can that be the case?
To try to get more insight about this test we can look at the test output for a case where it takes a long time to run (and is timing out currently) and compare that to the test output for a case that completes very quickly.
First, lets look at the last time this test passed for the `Trilinos-atdm-hansen-shiller-gnu-debug-serial` build on 'hansen' which took 6m 13s 150ms to complate and pass on 2018-05-29T06:41:19 UTC with output shown at:
* https://testing-vm.sandia.gov/cdash/testDetails.php?test=47454651&build=3555977
which shows:
```
Passed.......OR Combination ->
OK...........Number of Iterations = 2206 < 4000
Converged....(2-Norm Imp Res Vec) / (2-Norm Res0)
residual [ 0 ] = 9.56537e-07 < 1e-06
residual [ 1 ] = 9.4486e-07 < 1e-06
residual [ 2 ] = 9.24543e-07 < 1e-06
residual [ 3 ] = 9.44363e-07 < 1e-06
residual [ 4 ] = 9.64382e-07 < 1e-06
residual [ 5 ] = 9.14533e-07 < 1e-06
residual [ 6 ] = 9.50517e-07 < 1e-06
residual [ 7 ] = 8.31671e-07 < 1e-06
residual [ 8 ] = 9.59686e-07 < 1e-06
residual [ 9 ] = 9.74218e-07 < 1e-06
==================================================================================================================================
TimeMonitor results over 4 processors
Timer Name MinOverProcs MeanOverProcs MaxOverProcs MeanOverCallCounts
----------------------------------------------------------------------------------------------------------------------------------
Belos: Operation Op*x 1.489 (2.114e+04) 1.582 (2.114e+04) 1.668 (2.114e+04) 7.483e-05 (2.114e+04)
Belos: Operation Prec*x 0 (0) 0 (0) 0 (0) 0 (0)
Belos: RCGSolMgr total solve time 365.4 (1) 365.4 (1) 365.4 (1) 365.4 (1)
Epetra_CrsMatrix::Multiply(TransA,X,Y) 1.45 (2.114e+04) 1.542 (2.114e+04) 1.629 (2.114e+04) 7.295e-05 (2.114e+04)
==================================================================================================================================
```
And let's compare this to the test output for the build `Trilinos-atdm-hansen-shiller-intel-debug-serial` on 'hansen' which took 6s 740ms to complete and pass on 2018-05-29T14:52:35 UTC shown at:
* https://testing-vm.sandia.gov/cdash/testDetails.php?test=47482010&build=3557186
which shows:
```
Passed.......OR Combination ->
OK...........Number of Iterations = 2131 < 4000
Converged....(2-Norm Imp Res Vec) / (2-Norm Res0)
residual [ 0 ] = 9.5909e-07 < 1e-06
residual [ 1 ] = 9.65321e-07 < 1e-06
residual [ 2 ] = 8.59334e-07 < 1e-06
residual [ 3 ] = 9.55053e-07 < 1e-06
residual [ 4 ] = 9.97094e-07 < 1e-06
residual [ 5 ] = 7.53902e-07 < 1e-06
residual [ 6 ] = 8.46489e-07 < 1e-06
residual [ 7 ] = 9.64082e-07 < 1e-06
residual [ 8 ] = 9.92318e-07 < 1e-06
residual [ 9 ] = 9.92263e-07 < 1e-06
==================================================================================================================================
TimeMonitor results over 4 processors
Timer Name MinOverProcs MeanOverProcs MaxOverProcs MeanOverCallCounts
----------------------------------------------------------------------------------------------------------------------------------
Belos: Operation Op*x 2.026 (2.109e+04) 2.179 (2.109e+04) 2.403 (2.109e+04) 0.0001033 (2.109e+04)
Belos: Operation Prec*x 0 (0) 0 (0) 0 (0) 0 (0)
Belos: RCGSolMgr total solve time 5.945 (1) 5.946 (1) 5.946 (1) 5.946 (1)
Epetra_CrsMatrix::Multiply(TransA,X,Y) 1.975 (2.109e+04) 2.116 (2.109e+04) 2.316 (2.109e+04) 0.0001003 (2.109e+04)
==================================================================================================================================
```
The times for the individual operations is not that different but "Belos: RCGSolMgr total solve time" at 365.4 vs. 5.946 is the real problem. The final results show that the test is doing different computations in these two builds but the total number of operations is not radically different (e,g, 2.114e+04 vs. 2.109e+04 mat-vecs). So what is going on here to cause the huge increase in wall clock time for a serial Kokkos threading test?
Looking at the new commits pulled in when this started to fail for the build `Trilinos-atdm-hansen-shiller-gnu-opt-serial` on 2018-05-29 14:05:09 shown at:
* https://testing-vm.sandia.gov/cdash/viewNotes.php?buildid=3560199#!#note0
it is hard to tell what might have caused these tests to start timing out. I would guess that the most likely trigger was:
```
c840658: Switch to CMake 3.11.2, Ninja 1.8.2 and all-at-once mode on hansen/shiller (TRIL-209)
Author: Roscoe A. Bartlett <rabartl@sandia.gov>
Date: Tue May 29 08:12:42 2018 -0600
M cmake/ctest/drivers/atdm/shiller/local-driver.sh
M cmake/std/atdm/shiller/environment.sh
```
That will increase the number of tests running on the machine and could result in single tests taking longer to run.
But the fact that the same test takes 6 minutes GCC but only takes 7 seconds with Intel is a major problem, in my opinion and that has to be investigated.
Someone is going to need to add some more timers to account for where the time is going.
## Steps to reproduce
One should be able to follow the instructions at:
* https://github.com/trilinos/Trilinos/blob/develop/cmake/std/atdm/README.md
to reproduce this behavior on 'hansen' or 'shiller'. To avoid needing to run on a compute node, one could use the `gnu-debug-serial` build and do:
```
$ cd <some_build_dir>/
$ source $TRILINOS_DIR/cmake/std/atdm/load-env.sh gnu-debug-serial
$ cmake \
-GNinja \
-DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnv.cmake \
-DTrilinos_ENABLE_TESTS=ON -DTrilinos_ENABLE_Belos=ON \
$TRILINOS_DIR
$ make NP=16
$ ctest -VV -R Belos_rcg_hb_MPI_4
```
Keep promoted "ATDM" builds of Trilinos cleanhttps://gitlab.osti.gov/jmwille/Trilinos/-/issues/2925Test Stratimikos_test_aztecoo_thyra_driver_MPI_1 timing out in Trilinos-atdm...2018-11-30T11:16:53ZJames WillenbringTest Stratimikos_test_aztecoo_thyra_driver_MPI_1 timing out in Trilinos-atdm-hansen-shiller-gnu-debug-serial build since 5/30/2018*Created by: bartlettroscoe*
CC: @trilinos/stratimikos, @fryeguy52
## Next Action Stauts
Test was disabled for these two builds on 'hansen' in commit 73ae19c pushed on 6/12/2018 and this test disappeared in these builds on 6/13/2...*Created by: bartlettroscoe*
CC: @trilinos/stratimikos, @fryeguy52
## Next Action Stauts
Test was disabled for these two builds on 'hansen' in commit 73ae19c pushed on 6/12/2018 and this test disappeared in these builds on 6/13/2018.
## Description
As shown in [this query](https://testing-vm.sandia.gov/cdash/queryTests.php?project=Trilinos&date=2018-06-11&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercount=11&showfilters=1&filtercombine=and&field1=buildname&compare1=65&value1=Trilinos-atdm-&field2=status&compare2=62&value2=passed&field3=status&compare3=62&value3=notrun&field4=buildname&compare4=62&value4=Trilinos-atdm-white-ride-cuda-debug-pt-all-at-once&field5=site&compare5=62&value5=mutrino&field6=site&compare6=62&value6=serrano&field7=site&compare7=62&value7=chama&field8=site&compare8=62&value8=ride&field9=buildstarttime&compare9=84&value9=2018-06-11&field10=buildstarttime&compare10=83&value10=2018-05-20&field11=testname&compare11=65&value11=Stratimikos), the test `Stratimikos_test_aztecoo_thyra_driver_MPI_1` has been timing out in the builds `Trilinos-atdm-hansen-shiller-gnu-debug-serial` and `Trilinos-atdm-hansen-shiller-gnu-opt-serial` since 5/30/2018. (That query also shows this is the only Stratimikos test that has failed in any of the promoted "ATDM" builds since 5/20/2018.)
[This query](https://testing-vm.sandia.gov/cdash/queryTests.php?project=Trilinos&date=2018-06-11&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercount=5&showfilters=1&filtercombine=and&field1=buildname&compare1=65&value1=Trilinos-atdm-&field2=buildname&compare2=61&value2=Trilinos-atdm-hansen-shiller-gnu-debug-serial&field3=buildstarttime&compare3=84&value3=2018-06-11&field4=buildstarttime&compare4=83&value4=2018-05-20&field5=testname&compare5=61&value5=Stratimikos_test_aztecoo_thyra_driver_MPI_1) shows that the test `Stratimikos_test_aztecoo_thyra_driver_MPI_1` went from passing at under 21s every day to timing out at 10 minutes every day since 5/29/2018 (but it did pass once taking 9m 56s 930ms on 6/8/2018, the only time it did not time-out since 5/29/2018).
What changed from 5/29/2018 to 5/30/2018? Looking at the updates pulled in for the build `Trilinos-atdm-hansen-shiller-gnu-debug-serial` with build stamp `20180530-0400-ATDM` shown at:
* https://testing-vm.sandia.gov/cdash/viewNotes.php?buildid=3558860#!#note0
it seems like only commits that could have impacted this were:
```
c9ccf7d: Switch from srun to salloc on hansen/shiller (TRIL-209)
Author: Roscoe A. Bartlett <rabartl@sandia.gov>
Date: Tue May 29 08:35:16 2018 -0600
M cmake/ctest/drivers/atdm/shiller/local-driver.sh
M cmake/std/atdm/README.md
c840658: Switch to CMake 3.11.2, Ninja 1.8.2 and all-at-once mode on hansen/shiller (TRIL-209)
Author: Roscoe A. Bartlett <rabartl@sandia.gov>
Date: Tue May 29 08:12:42 2018 -0600
M cmake/ctest/drivers/atdm/shiller/local-driver.sh
M cmake/std/atdm/shiller/environment.sh
```
There are no other commits that I could see that could impact this AztecOO test. So it looks like moving to CMake/CTest 3.11.2 and to the all-at-once approach triggered this large increase in runtime for the test `Stratimikos_test_aztecoo_thyra_driver_MPI_1` for the build `Trilinos-atdm-hansen-shiller-gnu-debug-serial`. This may have been a result of having more tests running while this Stratimikos test is running.
Looking in [this query](https://testing-vm.sandia.gov/cdash/queryTests.php?project=Trilinos&date=2018-06-10&filtercombine=and&filtercount=2&showfilters=1&filtercombine=and&field1=buildname&compare1=65&value1=Trilinos-atdm-&field2=testname&compare2=61&value2=Stratimikos_test_aztecoo_thyra_driver_MPI_1), we can see that the test `Stratimikos_test_aztecoo_thyra_driver_MPI_1` timed-out in the build `Trilinos-atdm-hansen-shiller-gnu-debug-serial` yesterday 6/10/2018 but it took upwards of 2.5 to 3.5 minutes to run in the CUDA builds. Ohterwise, this test did not take any longer than 22s to run in all of the other ATDM builds of Trilinos. And what is also interesting is that query showed that this test passed in 4s 460ms for the build `Trilinos-atdm-hansen-shiller-intel-debug-serial` also run on 'hansen'. How can the same test pass on an `intel-debug-serial` build in under 5 seconds but then time out at 10 minutes for a `gnu-debug-serial` build on the same hardware with the same MPI implementation and settings?
For that matter, [this query](https://testing-vm.sandia.gov/cdash/queryTests.php?project=Trilinos&date=2018-06-10&filtercombine=and&filtercount=1&showfilters=1&field1=testname&compare1=61&value1=Stratimikos_test_aztecoo_thyra_driver_MPI_1) shows that other than the CUDA builds of Trilinos and the yet-to-be-cleaned-up 'mutrinos' build `Trilinos-atdm-mutrino-intel-debug-openmp`, this test did not take any longer than 22s to run in any of the 46 Trilinos builds where this test ran yesterday. On some platforms, this test completed in less than 2s!
This is very strange behavior for a test. There must be some type of machine or system usage issue going on here. But why would it impact a `gnu-debug-serial` build but not an `intel-debug-serial` build on the same machine?
## Steps to reproduce
Following the instructions at:
* https://github.com/trilinos/Trilinos/blob/develop/cmake/std/atdm/README.md#shillerhansen
one can log on to 'hansen' or 'shiller', clone Trilinos and get on to the 'develop' branch, and then do:
```
$ cd <some_build_dir>/
$ source $TRILINOS_DIR/cmake/std/atdm/load-env.sh intel-opt-openmp
$ cmake \
-GNinja \
-DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnv.cmake \
-DTrilinos_ENABLE_TESTS=ON -DTrilinos_ENABLE_Stratimikos=ON \
$TRILINOS_DIR
$ make NP=16
$ salloc ctest -j16
```
I did this on 'shiller' but unfortunately all of the Stratimikos tests passed:
```
100% tests passed, 0 tests failed out of 40
Subproject Time Summary:
Stratimikos = 256.50 sec*proc (40 tests)
Total Test time (real) = 20.84 sec
```
Therefore, I was not able to reproduce this behavior on 'shiller'. Therefore, this must be some type of system issue.
Keep promoted "ATDM" builds of Trilinos cleanhttps://gitlab.osti.gov/jmwille/Trilinos/-/issues/2751Address timing out (or failing) test PanzerAdaptersSTK_MixedPoissonExample-Co...2018-11-30T11:16:53ZJames WillenbringAddress timing out (or failing) test PanzerAdaptersSTK_MixedPoissonExample-ConvTest-Hex-Order-3 test in the Trilinos-atdm-toss3-intel-debug-openmp-panzer build*Created by: bartlettroscoe*
Summary:
**CC:** @trilinos/panzer, @fryeguy52
## Next Action Status
After the commit 652a011 was merged on 5/16/2018, the test `PanzerAdaptersSTK_MixedPoissonExample-ConvTest-Hex-Order-3` disappeare...*Created by: bartlettroscoe*
Summary:
**CC:** @trilinos/panzer, @fryeguy52
## Next Action Status
After the commit 652a011 was merged on 5/16/2018, the test `PanzerAdaptersSTK_MixedPoissonExample-ConvTest-Hex-Order-3` disappeared from the build `Trilinos-atdm-toss3-intel-debug-openmp-panzer` starting 5/17/2018. PR #3559 merged on 10/8/2018 disables this test on 'waterman' `cuda-9.2` builds and the test was seen removed on CDash on 10/9/2018. Next: Fix this?
## Description
As can be seen in the queries
* https://testing-vm.sandia.gov/cdash/queryTests.php?project=Trilinos&filtercombine=and&filtercombine=and&filtercombine=and&filtercount=3&showfilters=1&filtercombine=and&field1=buildname&compare1=61&value1=Trilinos-atdm-toss3-intel-debug-openmp-panzer&field2=testname&compare2=65&value2=PanzerAdaptersSTK_MixedPoissonExample-ConvTest-Hex-Order-3&field3=buildstarttime&compare3=84&value3=now
*
The test `PanzerAdaptersSTK_MixedPoissonExample-ConvTest-Hex-Order-3` is the only failing test in the panzer-only test build `Trilinos-atdm-toss3-intel-debug-openmp-panzer` over the last few days, except when there are system issues that cause many of the tests to fail (see #2699).
In all but one of the builds were this test was the only Panzer test that failed, it timed out a 10 minutes. The one exception was on 2018-05-10 where it failed as shown at:
* https://testing-vm.sandia.gov/cdash/testDetails.php?test=46467457&build=3498830
which showed the failure.
```
--------------------------------------------------------------------------
ORTE has lost communication with its daemon located on node:
hostname: ser285
This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.
--------------------------------------------------------------------------
```
## Steps to Reproduce
As described at:
* https://github.com/trilinos/Trilinos/blob/develop/cmake/std/atdm/README.md#chamaserrano
Clone the Trilinos git repo on 'serrano' and then do:
```
$ cd <some_build_dir>/
$ source $TRILINOS_DIR/cmake/std/atdm/load-env.sh intel-debug-openmp
$ cmake \
-DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnv.cmake \
-DTrilinos_ENABLE_TESTS=ON -DTrilinos_ENABLE_Panzer=ON \
$TRILINOS_DIR
$ make -j16
$ salloc -N1 --time=0:20:00 --account=<YOUR_WCID> ctest -VV -R PanzerAdaptersSTK_MixedPoissonExample-ConvTest-Hex-Order-3
```
## Related Issues
* Related to: #2699
https://gitlab.osti.gov/jmwille/Trilinos/-/issues/2712Test Teko_testdriver_tpetra_MPI_1 is failing in new GCC 4.8.4 + OpenMPI 1.10....2018-11-30T11:16:53ZJames WillenbringTest Teko_testdriver_tpetra_MPI_1 is failing in new GCC 4.8.4 + OpenMPI 1.10.1 + OpenMP build*Created by: bartlettroscoe*
**CC:** @trilinos/teko, @fryeguy52
## Next Action Status
PR #2715 was merged on 5/10/2018. The test `Teko_testdriver_tpetra_MPI_1` passed in build `GCC-4.8.4-OpenMPI-1.10.1-MpiReleaseDebugSharedPtOpen...*Created by: bartlettroscoe*
**CC:** @trilinos/teko, @fryeguy52
## Next Action Status
PR #2715 was merged on 5/10/2018. The test `Teko_testdriver_tpetra_MPI_1` passed in build `GCC-4.8.4-OpenMPI-1.10.1-MpiReleaseDebugSharedPtOpenMP` on 5/11/2018. Next: Find someone to debug and fix the test.
## Description
The test `Teko_testdriver_tpetra_MPI_1` is failing in the new build `GCC-4.8.4-OpenMPI-1.10.1-MpiReleaseDebugSharedPtOpenMP` as shown at:
* https://testing-vm.sandia.gov/cdash/queryTests.php?project=Trilinos&filtercombine=and&filtercombine=and&filtercount=3&showfilters=1&filtercombine=and&field1=buildname&compare1=65&value1=GCC-4.8.4-OpenMPI-1.10.1-MpiReleaseDebugSharedPtOpenMP&field2=testname&compare2=61&value2=Teko_testdriver_tpetra_MPI_1&field3=buildstarttime&compare3=84&value3=now
The most recent run of this test shown at:
* https://testing-vm.sandia.gov/cdash/testDetails.php?test=46422819&build=3496203
shows the failure output:
```
...
Teko: LSCPrecFact::buildPO BuildStateTime = 0.04688
Teko: LSCPrecFact::buildPO GetInvTime = 2.86102e-06
Teko: LSCPrecFact::buildPO TotalTime = 0.0475988
"strategy" ... FAILED ( PID = 0 )
Test "LSCStabilized_tpetra" completed ... FAILED (1)
...
Teko: Building LSC strategy "The Cat"
LSC Construction failed: Strategy "The Cat" could not be constructed
Teko: Begin debug MSG
Looked up "NS LSC"
Built Teuchos::RCP<Teko::PreconditionerFactory>{ptr=0x1d14a48,node=0x1e48230,strong_count=1,weak_count=0}
Teko: End debug MSG
LSC Construction failed: Strategy "The Cat" requires a "Strategy Settings" sublist
...
Tests Passed: 90, Tests Failed: 1
(Incidently, you want no failures)
Teko tests failed
```
**NOTE:** This build satisfies all of the requirements for the GCC 4.8.4 build described in #2317 and #2462. Other than the ShyLU_DD failures addressed in #2691, this failing test is the only test blocking this build from being 100% passing.
## Steps to Reproduce
One should be able to reproduce this failing test on any SNL COE RHEL6 machine that mounts the SEMS env. After cloning the Trilinos git repo and checking out the `develop` branch, one should be able to do:
```
$ cd <some-build-dir>/
$ source <trilinos-dir>/cmake/std/GCC-4.8.4-OpenMPI-1.10.1-MpiReleaseDebugSharedPtOpenMP_env.sh
$ cmake \
-C <trilinos-dir>/cmake/std/GCC-4.8.4-OpenMPI-1.10.1-MpiReleaseDebugSharedPtOpenMP.cmake \
-DTrilinos_ENABLE_TESTS=ON -DTrilinos_ENABLE_Teko=ON \
<trilinos-dir>
$ make -j16
$ ctest -VV -R Teko_testdriver_tpetra_MPI_1
```
## Related Issues
* Blocking Issues: #2462https://gitlab.osti.gov/jmwille/Trilinos/-/issues/2691Three ShyLU_DDFROSch_test_frosch_XXX tests failing in new GCC 4.8.4 + OpenMPI...2018-11-30T11:16:53ZJames WillenbringThree ShyLU_DDFROSch_test_frosch_XXX tests failing in new GCC 4.8.4 + OpenMPI 1.10.1 + OpenMP build*Created by: bartlettroscoe*
**CC:** @trilinos/shylu, @trilinos/framework , @srajama1
## Description
As shown at:
* https://testing-vm.sandia.gov/cdash/index.php?project=Trilinos&parentid=3490484
* https://testing-vm.sandia.g...*Created by: bartlettroscoe*
**CC:** @trilinos/shylu, @trilinos/framework , @srajama1
## Description
As shown at:
* https://testing-vm.sandia.gov/cdash/index.php?project=Trilinos&parentid=3490484
* https://testing-vm.sandia.gov/cdash/viewTest.php?onlyfailed&buildid=3490522
the tests:
* `ShyLU_DDFROSch_test_frosch_interfacesets_2D_MPI_4`
* `ShyLU_DDFROSch_test_frosch_laplacian_epetra_2d_gdsw_MPI_4`
* `ShyLU_DDFROSch_test_frosch_laplacian_epetra_2d_rgdsw_MPI_4`
are failing in the new GCC 4.8.4 + OpenMPI 1.10.1 + OpenMP build (as on the SNL COE RHEL6 machine crf450 which is submitted to CDash).
This build is getting cleaned up to provide the GCC 4.8.4 auto PR build described in #2317 and #2462.
These tests all fail by throwing the exception shown below:
```
terminate called after throwing an instance of 'Xpetra::Exceptions::RuntimeError'
Xpetra::Exceptions::RuntimeError'
what(): /ascldap/users/rabartl/Trilinos.base/NightlyBuilds/SRC_AND_BUILD/Trilinos/packages/xpetra/src/CrsMatrix/Xpetra_EpetraCrsMatrix.hpp:222:
Throw number = 1
Throw test that evaluated to true: true
Xpetra::EpetraCrsMatrix only available for GO=int or GO=long long with EpetraNode (Serial or OpenMP depending on configuration)
```
This then terminates the test program.
## Steps to reproduce
One should be able to reproduce these failing tests on any SNL COE RHEL6 machine that has the SEMS env. For example, on the CEE machine 'ceerws1113', I reproduced this by updating Trilinos and then doing:
```
$ cd <some-build-dir>/
$ source <trilinos-dir>/cmake/std/GCC-4.8.4-OpenMPI-1.10.1-MpiReleaseDebugSharedPtOpenMP_env.sh
$ module list
Currently Loaded Modulefiles:
1) sems-env
2) atdm-env
3) sems-python/2.7.9
4) atdm-cmake/3.11.1
5) sems-git/2.10.1
6) atdm-ninja_fortran/1.7.2
7) sems-gcc/4.8.4
8) sems-openmpi/1.10.1
9) sems-boost/1.63.0/base
10) sems-zlib/1.2.8/base
11) sems-hdf5/1.8.12/parallel
12) sems-netcdf/4.4.1/exo_parallel
13) sems-parmetis/4.0.3/parallel
14) sems-scotch/6.0.3/nopthread_64bit_parallel
15) sems-superlu/4.3/base
$ which cmake
/projects/sems/install/rhel6-x86_64/atdm/binary-install/cmake-3.11.1-Linux-x86_64/bin/cmake
$ rm -r CMake*
$ time cmake \
-C <trilinos-dir>/cmake/std/GCC-4.8.4-OpenMPI-1.10.1-MpiReleaseDebugSharedPtOpenMP.cmake \
-DTrilinos_ENABLE_TESTS=ON -DTrilinos_ENABLE_ShyLU_DD=ON \
<trilinos-dir> \
&> configure.out
real 0m22.379s
user 0m13.932s
sys 0m5.872s
$ time make -j16 &> make.out
real 34m48.506s
user 310m18.610s
sys 19m41.674s
$ time ctest -j16 &> ctest.out
real 0m4.584s
user 0m17.113s
sys 0m4.140s
```
This produced the test results:
```
$ grep -A 100 "tests failed out of" ctest.out
40% tests passed, 3 tests failed out of 5
Label Time Summary:
ShyLU_DD = 14.19 sec (5 tests)
Total Test time (real) = 4.56 sec
The following tests FAILED:
1 - ShyLU_DDFROSch_test_frosch_laplacian_epetra_2d_gdsw_MPI_4 (Failed)
2 - ShyLU_DDFROSch_test_frosch_laplacian_epetra_2d_rgdsw_MPI_4 (Failed)
5 - ShyLU_DDFROSch_test_frosch_interfacesets_2D_MPI_4 (Failed)
Errors while running CTest
```
The output from these failing tests seem to show the same throws and terminate:
```
terminate called after throwing an instance of 'Xpetra::Exceptions::RuntimeError'
what(): /scratch/rabartl/Trilinos.base/Trilinos/packages/xpetra/src/CrsMatrix/Xpetra_EpetraCrsMatrix.hpp:222:
Throw number = 1
Throw test that evaluated to true: true
Xpetra::EpetraCrsMatrix only available for GO=int or GO=long long with EpetraNode (Serial or OpenMP depending on configuration)
```
## Related Issues
* Blocking Issues: #2462
https://gitlab.osti.gov/jmwille/Trilinos/-/issues/2524 TEUCHOS_UNREACHABLE_RETURN(Teuchos::null) still throws a warning in cuda.2018-11-30T11:16:52ZJames Willenbring TEUCHOS_UNREACHABLE_RETURN(Teuchos::null) still throws a warning in cuda.*Created by: bathmatt*
Even with this call at the end of a function I get warnings in cuda. Not sure on the magic that is needed but ...
```
/home/mbetten/Trilinos/EMPIRE/src/PIC/InjectionBCs.hpp(289): warning: missing return state...*Created by: bathmatt*
Even with this call at the end of a function I get warnings in cuda. Not sure on the magic that is needed but ...
```
/home/mbetten/Trilinos/EMPIRE/src/PIC/InjectionBCs.hpp(289): warning: missing return statement at end of non-void function "empire::pic::createInjectionBC(const empire::pic::ElementalParticleData<MESH_TRAITS> &, const Teuchos::RCP<panzer::UniqueGlobalIndexerBase> &, const panzer_stk::STK_Interface &, const std::__cxx11::string &, const Teuchos::ParameterList &) [with MESH_TRAITS=empire::TetSecondOrder]"
```
but this is the end of that function
```
TEUCHOS_UNREACHABLE_RETURN(Teuchos::null);
}
```https://gitlab.osti.gov/jmwille/Trilinos/-/issues/2455Address timing out test Anasazi_Epetra_BlockDavidson_auxtest_MPI_4 in ATDM bu...2018-11-30T11:16:52ZJames WillenbringAddress timing out test Anasazi_Epetra_BlockDavidson_auxtest_MPI_4 in ATDM builds of Trilinos*Created by: bartlettroscoe*
**CC:** @trilinos/anasazi
## Next Action Status
Tests `Anasazi_Epetra_BlockDavidson_auxtest_MPI_4` and `Anasazi_Epetra_LOBPCG_auxtest_MPI_4` are disabled in several builds in the commits 8f23641 and c...*Created by: bartlettroscoe*
**CC:** @trilinos/anasazi
## Next Action Status
Tests `Anasazi_Epetra_BlockDavidson_auxtest_MPI_4` and `Anasazi_Epetra_LOBPCG_auxtest_MPI_4` are disabled in several builds in the commits 8f23641 and c66a268 and did not timeout in any builds on 3/27/2018 (see [below](https://github.com/trilinos/Trilinos/issues/2455#issuecomment-376619629)).
## Description
This Story is to address the test `Anasazi_Epetra_BlockDavidson_auxtest_MPI_4` that times out in several builds as shown in results yesterday on CDash at:
* https://testing-vm.sandia.gov/cdash/queryTests.php?project=Trilinos&date=2018-03-25&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercount=3&showfilters=1&filtercombine=and&field1=testname&compare1=61&value1=Anasazi_Epetra_BlockDavidson_auxtest_MPI_4&field2=status&compare2=62&value2=passed&field3=status&compare3=62&value3=notrun
This shows the test timing out at 10 minutes on the builds:
* `Trilinos-atdm-hansen-shiller-cuda-debug`
* `Trilinos-atdm-hansen-shiller-cuda-opt`
* `Trilinos-atdm-hansen-shiller-gnu-debug-serial`
* `Trilinos-atdm-hansen-shiller-gnu-opt-serial`
and the failing in the builds:
* `Trilinos-atdm-white-ride-cuda-opt`
* `Trilinos-atdm-white-ride-gnu-opt-openmp`
These failures show segfaults and are likely due to the compiler defect reported in #1208 and there are many Anasazi and Belos tests that segfault due to this as shown in #2454.
Therefore, this Story will only consider the timing-out tests, not the failing tests in the 'opt' builds on 'white' and 'ride' (since that is being covered in #2454).
Initial cleanup of new ATDM builds of Trilinoshttps://gitlab.osti.gov/jmwille/Trilinos/-/issues/2446Address expensive Panzer tests that timeout at 10 minutes in ATDM builds2018-11-30T11:16:52ZJames WillenbringAddress expensive Panzer tests that timeout at 10 minutes in ATDM builds*Created by: bartlettroscoe*
**CC:** @trilinos/panzer, @bathmatt, @fryeguy52
## Next Action Status
Pushed the commits 245e01d and d852fa3 to 'develop' to address timeouts and it removed the timing out tests on 3/25/2108. Addressi...*Created by: bartlettroscoe*
**CC:** @trilinos/panzer, @bathmatt, @fryeguy52
## Next Action Status
Pushed the commits 245e01d and d852fa3 to 'develop' to address timeouts and it removed the timing out tests on 3/25/2108. Addressing memory issues and re-enabling these tests will be done in other follow-on issues.
## Description
This story is to analyze and then to address some expensive Panzer tests that are timing out routinely in the ATDM Trilinos builds as shown, for example, in the following query that lists all of the timing out tests over the last week as shown in the query:
* https://testing-vm.sandia.gov/cdash/queryTests.php?project=Trilinos&date=2018-03-21&filtercombine=and&filtercombine=and&filtercombine=and&filtercount=7&showfilters=1&filtercombine=and&field1=buildname&compare1=65&value1=Trilinos-atdm-&field2=testname&compare2=65&value2=Panzer&field3=status&compare3=62&value3=passed&field4=status&compare4=62&value4=notrun&field5=buildstarttime&compare5=84&value5=2018-03-23&field6=buildstarttime&compare6=83&value6=2018-03-16&field7=details&compare7=63&value7=timeout
This query shows the following 6 timing out tests:
* `PanzerAdaptersSTK_CurlLaplacianExample-ConvTest-Quad-Order-4`
* `PanzerAdaptersSTK_main_driver_energy-ss-loca-eigenvalue`
* `PanzerAdaptersSTK_MixedPoissonExample-ConvTest-Hex-Order-2`
* `PanzerAdaptersSTK_MixedPoissonExample-ConvTest-Hex-Order-3`
* `PanzerAdaptersSTK_PoissonInterfaceExample_2d_diffsideids_MPI_1`
* `PanzerAdaptersSTK_PoissonInterfaceExample_2d_MPI_4`
which include the builds:
* `Trilinos-atdm-hansen-shiller-cuda-debug`
* `Trilinos-atdm-hansen-shiller-cuda-opt`
* `Trilinos-atdm-hansen-shiller-intel-debug-serial`
* `Trilinos-atdm-white-ride-cuda-debug`
* `Trilinos-atdm-white-ride-cuda-opt`
* `Trilinos-atdm-white-ride-gnu-debug-openmp`
As was discovered in https://github.com/trilinos/Trilinos/issues/2318#issuecomment-375494367, many of these tests will actually complete if you increase the timeouts . In particular, for the CUDA builds on hansen/shiller the following set of 5 tests all passed once the timeouts were increased to over 40 minutes for those CUDA builds:
* `PanzerAdaptersSTK_CurlLaplacianExample-ConvTest-Quad-Order-4`
* `PanzerAdaptersSTK_MixedPoissonExample-ConvTest-Hex-Order-2`
* `PanzerAdaptersSTK_MixedPoissonExample-ConvTest-Hex-Order-3`
* `PanzerAdaptersSTK_PoissonInterfaceExample_2d_diffsideids_MPI_1`
* `PanzerAdaptersSTK_PoissonInterfaceExample_2d_MPI_4`
The only test missing from the above list for CUDA builds on hansen/shiller was `PanzerAdaptersSTK_main_driver_energy-ss-loca-eigenvalue` and that test only timed out on the `Trilinos-atdm-white-ride-cuda-opt` build.
This Issue will be to investigate these tests some more and then decide how to address them.
## Tasks:
0. Inspect the timing out tests in the last week on all builds of Trilinos ... All can be addressed with increasing timesouts and one disable (see [below](https://github.com/trilinos/Trilinos/issues/2446#issuecomment-375730569)) **[DONE]**
1. Increase timeouts on all of the timing out Panzer tests in the last week to 45 minutes and set `CATEGORIES NIGHTLY` ...
2. See if these tests pass with longer timeouts in automated builds and see what their runtimes are when they are displayed on CDash ...
3. Decrease the timeouts for some of the tests that are not taking 45 minutes to complete ...
5. ???
## Related Issues
* Related to #2318
Initial cleanup of new ATDM builds of Trilinos