Trilinos issueshttps://gitlab.osti.gov/jmwille/Trilinos/-/issues2019-05-02T13:20:11Zhttps://gitlab.osti.gov/jmwille/Trilinos/-/issues/3276Trilinos auto PR tester stability issues2019-05-02T13:20:11ZJames WillenbringTrilinos auto PR tester stability issues*Created by: bartlettroscoe*
@trilinos/framework
## Description
Over the last few weeks and months, the Trilinos auto PR tester has seen several cases where one or more PR builds for a given PR testing iteration failed to produce ...*Created by: bartlettroscoe*
@trilinos/framework
## Description
Over the last few weeks and months, the Trilinos auto PR tester has seen several cases where one or more PR builds for a given PR testing iteration failed to produce results on CDash or showed build or test failures that were not related to the changes on that particular PR.
This Story is to log these fails and keep track of them in order to provide some statistics about these cases in order to inform how to address them. This should replace making comments in individual PRs that exhibit these types of problems like #3260 and #3213.
## PR Builds Showing Random Failures
Below are a few examples of the stability problems (but are not all of the problems).
| PR ID | Num PR Builds to reach passing | First test trigger | Start first test| Passing test | Merge PR |
| --: | --: | --: | --: | --: | --: |
| #3258 | 2 | [8/8/2018 2:35 PM ET](https://github.com/trilinos/Trilinos/pull/3258#issue-207098955) | [8/8/2018 2:44 PM](https://github.com/trilinos/Trilinos/pull/3258#issuecomment-411510956) | [8/8/2018 9:15 PM ET]() | Not merged |
| #3260 | 4 | [8/8/2018 5:22 PM ET](https://github.com/trilinos/Trilinos/pull/3260#issue-207141537) | [8/8/2018 6:31 PM ET](https://github.com/trilinos/Trilinos/pull/3260#issuecomment-411574370) | [8/10/2018 4:13 AM ET](https://github.com/trilinos/Trilinos/pull/3260#issuecomment-412010497) | [8/10/2018 8:25 AM](https://github.com/trilinos/Trilinos/pull/3260#event-1782381644) |
| #3213 | 3 | [7/31/2018 4:30 PM ET](https://github.com/trilinos/Trilinos/pull/3213#issue-205233060) | [7/31/2018 4:57 PM ET](https://github.com/trilinos/Trilinos/pull/3213#issuecomment-409365522) | [8/1/2018 9:48 AM ET](https://github.com/trilinos/Trilinos/pull/3213#issuecomment-409580677) | [8/1/2018 9:53 AM ET](https://github.com/trilinos/Trilinos/pull/3213#event-1765281809) |
| #3098 | 4 | [7/12/2018 12:52 PM ET](https://github.com/trilinos/Trilinos/pull/3098#issue-201063953) | [7/12/2018 1:07 PM ET](https://github.com/trilinos/Trilinos/pull/3098#issuecomment-404582631) | [7/13/2018 11:12 PM ET](https://github.com/trilinos/Trilinos/pull/3098#issuecomment-404994581) | [7/14/2018 10:59 PM ET](https://github.com/trilinos/Trilinos/pull/3098#event-1733896640) |
| #3369 | 6 | [8/29/2018 9:08 AM ET](https://github.com/trilinos/Trilinos/pull/3369#issue-211746901) | [8/29/2018 9:16 AM ET](https://github.com/trilinos/Trilinos/pull/3369#issuecomment-416948915) | [8/31/2018 6:09 AM ET](https://github.com/trilinos/Trilinos/pull/3369#issuecomment-417618824) | [8/31/2018 8:33 AM ET](https://github.com/trilinos/Trilinos/pull/3369#event-1820478271) |
Improve productivity, stability, and quality of Trilinoshttps://gitlab.osti.gov/jmwille/Trilinos/-/issues/3233Strange KokkosCore_UnitTest_Serial_MPI_1 test failure in build Trilinos-atdm...2019-03-01T20:02:39ZJames WillenbringStrange KokkosCore_UnitTest_Serial_MPI_1 test failure in build Trilinos-atdm-hansen-shiller-intel-debug-serial on 'hansen'*Created by: bartlettroscoe*
@trilinos/kokkos, @kddevin (Trilinos Data Services Product Lead)
## Next Action Status
Test was disabled in PR #3455 merged in 9/17/2018 which pulled in the commit
b8e53a5 and the test was missing in...*Created by: bartlettroscoe*
@trilinos/kokkos, @kddevin (Trilinos Data Services Product Lead)
## Next Action Status
Test was disabled in PR #3455 merged in 9/17/2018 which pulled in the commit
b8e53a5 and the test was missing in this build `Trilinos-atdm-hansen-shiller-intel-debug-serial` on 9/18/2018. Next: Revert the commit b8e53a5 on the next Kokkos snapshot ...
Fix merged to kokkos 'develop' branch in kokkos/kokkos#1729 on 8/8/2018. Next: Disable this test in the `Trilinos-atdm-hansen-shiller-intel-debug-serial` build while waiting for next Kokkos snapshot to Trilinos ...
## Description
The test `KokkosCore_UnitTest_Serial_MPI_1` failed in the ATDM Trilinos build `Trilinos-atdm-hansen-shiller-intel-debug-serial` on 8/3/2018 as shown [here](https://testing-vm.sandia.gov/cdash/testDetails.php?test=51601654&build=3793146) showing the strange error:
```
[ RUN ] serial.cxx11
CXX11 ( test = 'ReduceTest TeamPolicy' FAILED : 2082.71 != 2082.71
/home/jenkins/hansen/workspace/Trilinos-atdm-hansen-shiller-intel-debug-serial/SRC_AND_BUILD/Trilinos/packages/kokkos/core/unit_test/TestCXX11.hpp:354: Failure
Value of: ( TestCXX11::Test< TEST_EXECSPACE >( 4 ) )
Actual: false
Expected: true
[ FAILED ] serial.cxx11 (1 ms)
```
Looking at [this query](https://testing-vm.sandia.gov/cdash/queryTests.php?project=Trilinos&date=2018-08-03&filtercombine=and&filtercombine=and&filtercount=4&showfilters=1&filtercombine=and&field1=buildname&compare1=65&value1=Trilinos-atdm-&field2=testname&compare2=61&value2=KokkosCore_UnitTest_Serial_MPI_1&field3=groupname&compare3=61&value3=ATDM&field4=buildstarttime&compare4=83&value4=2018-07-01), this is the only time this test has failed in any of the promoted ATDM builds of Trilinos since 7/1/2018. Therefore, this is either a new test failure or it is a randomly failing test.
## Steps to Reproduce
Enable the `Kokkos` package with the build `intel-debug-serial` on 'hansen' or 'shiller' as described at:
* https://github.com/trilinos/Trilinos/blob/develop/cmake/std/atdm/README.md#shillerhansen
Keep promoted "ATDM" builds of Trilinos cleanhttps://gitlab.osti.gov/jmwille/Trilinos/-/issues/3178Pull request testing should set -Werror2019-05-02T22:07:04ZJames WillenbringPull request testing should set -Werror*Created by: mhoemmen*
@vbrunini asks whether pull request testing could set `-Werror`, so as to avoid issues like #3177.
@trilinos/framework @khpierson
## Expectations
Trilinos -- at least the library, not necessarily tests a...*Created by: mhoemmen*
@vbrunini asks whether pull request testing could set `-Werror`, so as to avoid issues like #3177.
@trilinos/framework @khpierson
## Expectations
Trilinos -- at least the library, not necessarily tests and examples -- should build warning-free.
## Current Behavior
See #3177. There is an issue that it's impossible to fix warnings in some packages.
## Motivation and Context
Sierra builds with warnings as errors, so they want Trilinos to build warning-free.
## Possible Solution
Exclude legacy packages like ML. Fix warnings. Add `-Werror` to at least one PR build.
## Related Issues
* Related to #3177 Improve productivity, stability, and quality of Trilinoshttps://gitlab.osti.gov/jmwille/Trilinos/-/issues/3105Random timeouts in auto PR builds2018-07-13T14:33:40ZJames WillenbringRandom timeouts in auto PR builds*Created by: bartlettroscoe*
@trilinos/framework
## Expectations
Tests will not fail or timeout unless changes in a PR branch cause the failures or timeouts.
## Current Behavior
Several auto PR build iterations have been fa...*Created by: bartlettroscoe*
@trilinos/framework
## Expectations
Tests will not fail or timeout unless changes in a PR branch cause the failures or timeouts.
## Current Behavior
Several auto PR build iterations have been failing for a while due to random timeouts in various auto PR builds. For example, the first PR testing iteration in my PR #3104 failed last night due to random timeouts as shown [here](https://testing-vm.sandia.gov/cdash/index.php?project=Trilinos&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercount=2&showfilters=1&filtercombine=and&field1=buildname&compare1=63&value1=PR-3104-test&field2=buildstarttime&compare2=84&value2=NOW). It is impossible for that one change to have impacted these test timeouts in any way.
You can see timeouts impacting other PR testing iterations as well in just the last 12 days in [this query](https://testing-vm.sandia.gov/cdash/queryTests.php?project=Trilinos&date=2018-07-12&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercount=3&showfilters=1&filtercombine=and&field1=buildstarttime&compare1=83&value1=2018-07-01&field2=groupname&compare2=61&value2=Pull%20Request&field3=details&compare3=63&value3=timeout). This shows 12 tests timing out in 3 different PRs over the last 12 days. Looking at the numbering of the builds, it seems likely that up to 4 or 5 PR testing iterations failed due to this issue. (It is hard to know how many PR testing iterations failed due to these random timeouts due to the naming of the PR builds since it is hard to differentiate different PR testing iterations in the same PR or to match up which builds of different compilers match up to the same PR testing iteration). Also, it is possible that some of these timeouts were due to changes in the PR branch.
## Motivation and Context
We want PR testing iterations to only fail if they are triggered by changes in the specific topic branch being tested.
While 4 or 5 failed PR testing iterations over 12 days may not seem like a lot, when combined with randomly failing tests (see #3103) and randomly failures in the Jenkins jobs (git pull fails, etc.), these add up to make the auto PR testing pretty unstable.
## Definition of Done
* Eliminate randomly failing tests.
## Possible Solution
The likely cause is that the Jenkins build farm machines are being overloaded. The very setup of the Jenkins site allows for this to occur because jobs can use more cores (i.e. "executors") than they declare and therefore overload the machine. Many of these timeouts occur late at night or in the early morning when the Jenkins build farm machines are likely to be processing nightly jobs.
## Steps to Reproduce
Hard to reproduce because these are random failures. I wish we had load statistics on these Jenkins build farm machine so that we could know when these we see timeouts like this.
Improve productivity, stability, and quality of Trilinoshttps://gitlab.osti.gov/jmwille/Trilinos/-/issues/2919Belos_rcg_hb_MPI_4 timing out in several ATDM Trilinos builds on 'hansen' sin...2018-11-30T11:16:53ZJames WillenbringBelos_rcg_hb_MPI_4 timing out in several ATDM Trilinos builds on 'hansen' since 5/29/2018*Created by: bartlettroscoe*
CC: @trilinos/belos, @fryeguy52, @srajama1 (Linear Solves Project Lead)
## Next Action Status
Test was disabled in these builds on 'hansen' in the commit 8850c64 pushed on 6/12/2018 and was shown to be...*Created by: bartlettroscoe*
CC: @trilinos/belos, @fryeguy52, @srajama1 (Linear Solves Project Lead)
## Next Action Status
Test was disabled in these builds on 'hansen' in the commit 8850c64 pushed on 6/12/2018 and was shown to be disabled in the builds on CDash 6/13/2018
## Description
As shown in [this large query](https://testing-vm.sandia.gov/cdash/queryTests.php?project=Trilinos&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercount=19&showfilters=1&filtercombine=and&field1=buildname&compare1=65&value1=Trilinos-atdm-&field2=buildname&compare2=62&value2=Trilinos-atdm-mutrino-intel-debug-openmp&field3=buildname&compare3=62&value3=Trilinos-atdm-mutrino-intel-opt-openmp&field4=buildname&compare4=62&value4=Trilinos-atdm-white-ride-cuda-debug-pt-all-at-once&field5=buildname&compare5=62&value5=Trilinos-atdm-serrano-intel-debug-openmp&field6=buildname&compare6=62&value6=Trilinos-atdm-serrano-intel-opt-openmp&field7=buildname&compare7=62&value7=Trilinos-atdm-chama-intel-opt-openmp&field8=buildname&compare8=62&value8=Trilinos-atdm-chama-intel-debug-openmp-panzer&field9=buildname&compare9=62&value9=Trilinos-atdm-chama-intel-debug-openmp&field10=buildname&compare10=62&value10=Trilinos-atdm-chama-intel-opt-openmp-panzer&field11=site&compare11=62&value11=ride&field12=testname&compare12=61&value12=Belos_rcg_hb_MPI_4&field13=buildstarttime&compare13=84&value13=2018-06-08&field14=buildstarttime&compare14=83&value14=2018-05-10&field15=buildname&compare15=62&value15=Trilinos-atdm-white-ride-cuda-opt&field16=buildname&compare16=62&value16=Trilinos-atdm-white-ride-gnu-opt-openmp&field17=site&compare17=62&value17=serrano&field18=site&compare18=62&value18=shiller&field19=buildname&compare19=62&value19=Trilinos-atdm-white-ride-cuda-debug-all-at-once) the test `Belos_rcg_hb_MPI_4` looks to be consistently timing out in the builds:
* Trilinos-atdm-hansen-shiller-cuda-8.0-debug
* Trilinos-atdm-hansen-shiller-cuda-8.0-opt
* Trilinos-atdm-hansen-shiller-cuda-9.0-debug
* Trilinos-atdm-hansen-shiller-cuda-9.0-opt
* Trilinos-atdm-hansen-shiller-gnu-debug-serial
* Trilinos-atdm-hansen-shiller-gnu-opt-serial
all on 'hansen' starting on 5/29/201 or 5/30/2018. (Since the these builds are pulling directly from the 'develop' branch, they may be testing different versions on the same day and this is UTC time so they may be on the same testing day in Mountain time.)
That same query shows that that test has been consistently passing in every other promoted build on every other ATDM Trilinos testing machine.
What that query also shows is that in those same builds that are now timing out, the test was taking upwards of 6+ minutes to complete before it started timing out at 10 minutes on 5/29/201 or 5/30/2018 as shown in the last non-timing-out builds:
* Trilinos-atdm-hansen-shiller-cuda-8.0-debug: 6m 26s 280ms
* Trilinos-atdm-hansen-shiller-cuda-8.0-opt: 6m 25s 680ms
* Trilinos-atdm-hansen-shiller-cuda-9.0-debug: 6m 22s 810ms
* Trilinos-atdm-hansen-shiller-cuda-9.0-opt: 6m 22s 440ms
* Trilinos-atdm-hansen-shiller-gnu-debug-serial: 6m 13s 150ms
* Trilinos-atdm-hansen-shiller-gnu-opt-serial: 5m 58s 960ms
But the other builds that are not showing any timeouts, that test completes very fast (in under 30 seconds in about every case). Some of the recent test times shown in that query for the various builds that don't have timeouts now are:
* Trilinos-atdm-hansen-shiller-gnu-debug-openmp: 23s 850ms
* Trilinos-atdm-hansen-shiller-gnu-opt-openmp: 8s 650ms
* Trilinos-atdm-hansen-shiller-intel-debug-openmp: 7s 720ms
* Trilinos-atdm-hansen-shiller-intel-debug-serial: 7s 950ms
* Trilinos-atdm-hansen-shiller-intel-opt-openmp: 6s 150ms
* Trilinos-atdm-hansen-shiller-intel-opt-serial: 5s 910ms
* Trilinos-atdm-rhel6-gnu-debug-openmp: 6s 840ms
* Trilinos-atdm-rhel6-gnu-debug-serial: 5s 340ms
* Trilinos-atdm-rhel6-gnu-opt-openmp: 5s 180ms
* Trilinos-atdm-rhel6-gnu-opt-serial: 4s 250ms
* Trilinos-atdm-rhel6-intel-opt-openmp: 3s 740ms
* Trilinos-atdm-sems-gcc-7-2-0: 5s 290ms
* Trilinos-atdm-white-ride-cuda-debug: 9s 430ms
* Trilinos-atdm-white-ride-gnu-debug-openmp: 9s 90ms
So this seems pretty crazy. How can the same test take over 6 minutes to complete for a CUDA 8.0 and 9.0 optimized build on 'hansen' and only take 9 seconds for a CUDA debug on 'white'? And this test takes a very long time (and are now timing out) for the `gnu-debug-serial` and `gnu-opt-serial` builds as well on 'hansen' but is fast for the `intel-debug-serial` and `intel-opt-serial` builds on the same machine. How can that be the case?
To try to get more insight about this test we can look at the test output for a case where it takes a long time to run (and is timing out currently) and compare that to the test output for a case that completes very quickly.
First, lets look at the last time this test passed for the `Trilinos-atdm-hansen-shiller-gnu-debug-serial` build on 'hansen' which took 6m 13s 150ms to complate and pass on 2018-05-29T06:41:19 UTC with output shown at:
* https://testing-vm.sandia.gov/cdash/testDetails.php?test=47454651&build=3555977
which shows:
```
Passed.......OR Combination ->
OK...........Number of Iterations = 2206 < 4000
Converged....(2-Norm Imp Res Vec) / (2-Norm Res0)
residual [ 0 ] = 9.56537e-07 < 1e-06
residual [ 1 ] = 9.4486e-07 < 1e-06
residual [ 2 ] = 9.24543e-07 < 1e-06
residual [ 3 ] = 9.44363e-07 < 1e-06
residual [ 4 ] = 9.64382e-07 < 1e-06
residual [ 5 ] = 9.14533e-07 < 1e-06
residual [ 6 ] = 9.50517e-07 < 1e-06
residual [ 7 ] = 8.31671e-07 < 1e-06
residual [ 8 ] = 9.59686e-07 < 1e-06
residual [ 9 ] = 9.74218e-07 < 1e-06
==================================================================================================================================
TimeMonitor results over 4 processors
Timer Name MinOverProcs MeanOverProcs MaxOverProcs MeanOverCallCounts
----------------------------------------------------------------------------------------------------------------------------------
Belos: Operation Op*x 1.489 (2.114e+04) 1.582 (2.114e+04) 1.668 (2.114e+04) 7.483e-05 (2.114e+04)
Belos: Operation Prec*x 0 (0) 0 (0) 0 (0) 0 (0)
Belos: RCGSolMgr total solve time 365.4 (1) 365.4 (1) 365.4 (1) 365.4 (1)
Epetra_CrsMatrix::Multiply(TransA,X,Y) 1.45 (2.114e+04) 1.542 (2.114e+04) 1.629 (2.114e+04) 7.295e-05 (2.114e+04)
==================================================================================================================================
```
And let's compare this to the test output for the build `Trilinos-atdm-hansen-shiller-intel-debug-serial` on 'hansen' which took 6s 740ms to complete and pass on 2018-05-29T14:52:35 UTC shown at:
* https://testing-vm.sandia.gov/cdash/testDetails.php?test=47482010&build=3557186
which shows:
```
Passed.......OR Combination ->
OK...........Number of Iterations = 2131 < 4000
Converged....(2-Norm Imp Res Vec) / (2-Norm Res0)
residual [ 0 ] = 9.5909e-07 < 1e-06
residual [ 1 ] = 9.65321e-07 < 1e-06
residual [ 2 ] = 8.59334e-07 < 1e-06
residual [ 3 ] = 9.55053e-07 < 1e-06
residual [ 4 ] = 9.97094e-07 < 1e-06
residual [ 5 ] = 7.53902e-07 < 1e-06
residual [ 6 ] = 8.46489e-07 < 1e-06
residual [ 7 ] = 9.64082e-07 < 1e-06
residual [ 8 ] = 9.92318e-07 < 1e-06
residual [ 9 ] = 9.92263e-07 < 1e-06
==================================================================================================================================
TimeMonitor results over 4 processors
Timer Name MinOverProcs MeanOverProcs MaxOverProcs MeanOverCallCounts
----------------------------------------------------------------------------------------------------------------------------------
Belos: Operation Op*x 2.026 (2.109e+04) 2.179 (2.109e+04) 2.403 (2.109e+04) 0.0001033 (2.109e+04)
Belos: Operation Prec*x 0 (0) 0 (0) 0 (0) 0 (0)
Belos: RCGSolMgr total solve time 5.945 (1) 5.946 (1) 5.946 (1) 5.946 (1)
Epetra_CrsMatrix::Multiply(TransA,X,Y) 1.975 (2.109e+04) 2.116 (2.109e+04) 2.316 (2.109e+04) 0.0001003 (2.109e+04)
==================================================================================================================================
```
The times for the individual operations is not that different but "Belos: RCGSolMgr total solve time" at 365.4 vs. 5.946 is the real problem. The final results show that the test is doing different computations in these two builds but the total number of operations is not radically different (e,g, 2.114e+04 vs. 2.109e+04 mat-vecs). So what is going on here to cause the huge increase in wall clock time for a serial Kokkos threading test?
Looking at the new commits pulled in when this started to fail for the build `Trilinos-atdm-hansen-shiller-gnu-opt-serial` on 2018-05-29 14:05:09 shown at:
* https://testing-vm.sandia.gov/cdash/viewNotes.php?buildid=3560199#!#note0
it is hard to tell what might have caused these tests to start timing out. I would guess that the most likely trigger was:
```
c840658: Switch to CMake 3.11.2, Ninja 1.8.2 and all-at-once mode on hansen/shiller (TRIL-209)
Author: Roscoe A. Bartlett <rabartl@sandia.gov>
Date: Tue May 29 08:12:42 2018 -0600
M cmake/ctest/drivers/atdm/shiller/local-driver.sh
M cmake/std/atdm/shiller/environment.sh
```
That will increase the number of tests running on the machine and could result in single tests taking longer to run.
But the fact that the same test takes 6 minutes GCC but only takes 7 seconds with Intel is a major problem, in my opinion and that has to be investigated.
Someone is going to need to add some more timers to account for where the time is going.
## Steps to reproduce
One should be able to follow the instructions at:
* https://github.com/trilinos/Trilinos/blob/develop/cmake/std/atdm/README.md
to reproduce this behavior on 'hansen' or 'shiller'. To avoid needing to run on a compute node, one could use the `gnu-debug-serial` build and do:
```
$ cd <some_build_dir>/
$ source $TRILINOS_DIR/cmake/std/atdm/load-env.sh gnu-debug-serial
$ cmake \
-GNinja \
-DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnv.cmake \
-DTrilinos_ENABLE_TESTS=ON -DTrilinos_ENABLE_Belos=ON \
$TRILINOS_DIR
$ make NP=16
$ ctest -VV -R Belos_rcg_hb_MPI_4
```
Keep promoted "ATDM" builds of Trilinos cleanhttps://gitlab.osti.gov/jmwille/Trilinos/-/issues/2925Test Stratimikos_test_aztecoo_thyra_driver_MPI_1 timing out in Trilinos-atdm...2018-11-30T11:16:53ZJames WillenbringTest Stratimikos_test_aztecoo_thyra_driver_MPI_1 timing out in Trilinos-atdm-hansen-shiller-gnu-debug-serial build since 5/30/2018*Created by: bartlettroscoe*
CC: @trilinos/stratimikos, @fryeguy52
## Next Action Stauts
Test was disabled for these two builds on 'hansen' in commit 73ae19c pushed on 6/12/2018 and this test disappeared in these builds on 6/13/2...*Created by: bartlettroscoe*
CC: @trilinos/stratimikos, @fryeguy52
## Next Action Stauts
Test was disabled for these two builds on 'hansen' in commit 73ae19c pushed on 6/12/2018 and this test disappeared in these builds on 6/13/2018.
## Description
As shown in [this query](https://testing-vm.sandia.gov/cdash/queryTests.php?project=Trilinos&date=2018-06-11&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercount=11&showfilters=1&filtercombine=and&field1=buildname&compare1=65&value1=Trilinos-atdm-&field2=status&compare2=62&value2=passed&field3=status&compare3=62&value3=notrun&field4=buildname&compare4=62&value4=Trilinos-atdm-white-ride-cuda-debug-pt-all-at-once&field5=site&compare5=62&value5=mutrino&field6=site&compare6=62&value6=serrano&field7=site&compare7=62&value7=chama&field8=site&compare8=62&value8=ride&field9=buildstarttime&compare9=84&value9=2018-06-11&field10=buildstarttime&compare10=83&value10=2018-05-20&field11=testname&compare11=65&value11=Stratimikos), the test `Stratimikos_test_aztecoo_thyra_driver_MPI_1` has been timing out in the builds `Trilinos-atdm-hansen-shiller-gnu-debug-serial` and `Trilinos-atdm-hansen-shiller-gnu-opt-serial` since 5/30/2018. (That query also shows this is the only Stratimikos test that has failed in any of the promoted "ATDM" builds since 5/20/2018.)
[This query](https://testing-vm.sandia.gov/cdash/queryTests.php?project=Trilinos&date=2018-06-11&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercount=5&showfilters=1&filtercombine=and&field1=buildname&compare1=65&value1=Trilinos-atdm-&field2=buildname&compare2=61&value2=Trilinos-atdm-hansen-shiller-gnu-debug-serial&field3=buildstarttime&compare3=84&value3=2018-06-11&field4=buildstarttime&compare4=83&value4=2018-05-20&field5=testname&compare5=61&value5=Stratimikos_test_aztecoo_thyra_driver_MPI_1) shows that the test `Stratimikos_test_aztecoo_thyra_driver_MPI_1` went from passing at under 21s every day to timing out at 10 minutes every day since 5/29/2018 (but it did pass once taking 9m 56s 930ms on 6/8/2018, the only time it did not time-out since 5/29/2018).
What changed from 5/29/2018 to 5/30/2018? Looking at the updates pulled in for the build `Trilinos-atdm-hansen-shiller-gnu-debug-serial` with build stamp `20180530-0400-ATDM` shown at:
* https://testing-vm.sandia.gov/cdash/viewNotes.php?buildid=3558860#!#note0
it seems like only commits that could have impacted this were:
```
c9ccf7d: Switch from srun to salloc on hansen/shiller (TRIL-209)
Author: Roscoe A. Bartlett <rabartl@sandia.gov>
Date: Tue May 29 08:35:16 2018 -0600
M cmake/ctest/drivers/atdm/shiller/local-driver.sh
M cmake/std/atdm/README.md
c840658: Switch to CMake 3.11.2, Ninja 1.8.2 and all-at-once mode on hansen/shiller (TRIL-209)
Author: Roscoe A. Bartlett <rabartl@sandia.gov>
Date: Tue May 29 08:12:42 2018 -0600
M cmake/ctest/drivers/atdm/shiller/local-driver.sh
M cmake/std/atdm/shiller/environment.sh
```
There are no other commits that I could see that could impact this AztecOO test. So it looks like moving to CMake/CTest 3.11.2 and to the all-at-once approach triggered this large increase in runtime for the test `Stratimikos_test_aztecoo_thyra_driver_MPI_1` for the build `Trilinos-atdm-hansen-shiller-gnu-debug-serial`. This may have been a result of having more tests running while this Stratimikos test is running.
Looking in [this query](https://testing-vm.sandia.gov/cdash/queryTests.php?project=Trilinos&date=2018-06-10&filtercombine=and&filtercount=2&showfilters=1&filtercombine=and&field1=buildname&compare1=65&value1=Trilinos-atdm-&field2=testname&compare2=61&value2=Stratimikos_test_aztecoo_thyra_driver_MPI_1), we can see that the test `Stratimikos_test_aztecoo_thyra_driver_MPI_1` timed-out in the build `Trilinos-atdm-hansen-shiller-gnu-debug-serial` yesterday 6/10/2018 but it took upwards of 2.5 to 3.5 minutes to run in the CUDA builds. Ohterwise, this test did not take any longer than 22s to run in all of the other ATDM builds of Trilinos. And what is also interesting is that query showed that this test passed in 4s 460ms for the build `Trilinos-atdm-hansen-shiller-intel-debug-serial` also run on 'hansen'. How can the same test pass on an `intel-debug-serial` build in under 5 seconds but then time out at 10 minutes for a `gnu-debug-serial` build on the same hardware with the same MPI implementation and settings?
For that matter, [this query](https://testing-vm.sandia.gov/cdash/queryTests.php?project=Trilinos&date=2018-06-10&filtercombine=and&filtercount=1&showfilters=1&field1=testname&compare1=61&value1=Stratimikos_test_aztecoo_thyra_driver_MPI_1) shows that other than the CUDA builds of Trilinos and the yet-to-be-cleaned-up 'mutrinos' build `Trilinos-atdm-mutrino-intel-debug-openmp`, this test did not take any longer than 22s to run in any of the 46 Trilinos builds where this test ran yesterday. On some platforms, this test completed in less than 2s!
This is very strange behavior for a test. There must be some type of machine or system usage issue going on here. But why would it impact a `gnu-debug-serial` build but not an `intel-debug-serial` build on the same machine?
## Steps to reproduce
Following the instructions at:
* https://github.com/trilinos/Trilinos/blob/develop/cmake/std/atdm/README.md#shillerhansen
one can log on to 'hansen' or 'shiller', clone Trilinos and get on to the 'develop' branch, and then do:
```
$ cd <some_build_dir>/
$ source $TRILINOS_DIR/cmake/std/atdm/load-env.sh intel-opt-openmp
$ cmake \
-GNinja \
-DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnv.cmake \
-DTrilinos_ENABLE_TESTS=ON -DTrilinos_ENABLE_Stratimikos=ON \
$TRILINOS_DIR
$ make NP=16
$ salloc ctest -j16
```
I did this on 'shiller' but unfortunately all of the Stratimikos tests passed:
```
100% tests passed, 0 tests failed out of 40
Subproject Time Summary:
Stratimikos = 256.50 sec*proc (40 tests)
Total Test time (real) = 20.84 sec
```
Therefore, I was not able to reproduce this behavior on 'shiller'. Therefore, this must be some type of system issue.
Keep promoted "ATDM" builds of Trilinos cleanhttps://gitlab.osti.gov/jmwille/Trilinos/-/issues/2864Auto PR build failures due to Intel License server problems2018-06-01T18:11:28ZJames WillenbringAuto PR build failures due to Intel License server problems*Created by: bartlettroscoe*
CC: @trilinos/framework
## Expectations
Auto PR builds should be robust and not fail unless there is a failure in code itself.
## Current Behavior
The new Intel auto PR build fails randomly due ...*Created by: bartlettroscoe*
CC: @trilinos/framework
## Expectations
Auto PR builds should be robust and not fail unless there is a failure in code itself.
## Current Behavior
The new Intel auto PR build fails randomly due to Intel license server problems such as shown at:
* https://github.com/trilinos/Trilinos/pull/2860#issuecomment-393742327
which showed the build failure:
* https://testing-vm.sandia.gov/cdash/viewBuildError.php?buildid=3564874
which showed:
```
Error: A license for Comp-CL is not available now (-15,570,115).
A connection to the license server could not be made. You should
make sure that your license daemon process is running: both an
lmgrd process and an INTEL process should be running
if your license limits you to a specified number of licenses in use
at a time. Also, check to see if the wrong port@host or the wrong
license file is being used, or if the port or hostname in the license
file has changed.
License file(s) used were (in this order):
1. Trusted Storage
** 2. /projects/sems/install/rhel6-x86_64/sems/compiler/intel/17.0.1/base/Licenses/intel-Linux-SRN.lic
** 3. /projects/sems/install/rhel6-x86_64/sems/compiler/intel/17.0.1/base/compilers_and_libraries_2017.1.132/linux/bin/intel64/../../Licenses
** 4. /ascldap/users/trilinos/Licenses
```
The reason that I noticed this was because it happened in my PR #2860.
Looking at the current PR build history at:
* https://testing-vm.sandia.gov/cdash/index.php?project=Trilinos&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercount=3&showfilters=1&filtercombine=and&field1=buildname&compare1=63&value1=-test-Trilinos_pullrequest_intel_17.0.1&field2=buildstarttime&compare2=84&value2=2018-06-02&field3=buildstarttime&compare3=83&value3=2018-05-14
out of 10 builds, it failed twice with these Intel license server problems. That is an only 80% success rate so far. That is not robust enough for an auto PR build.
## Motivation and Context
Auto PR builds block what goes into the 'develop' branch and long delays make things harder.
## Definition of Done
Auto PR builds should only fail due to non-code issues very infrequently.
## Possible Solution
Don't know.
## Steps to Reproduce
Don't know.
## Your Environment
N.A. This is the auto PR builds that define their own env.
Improve productivity, stability, and quality of Trilinoshttps://gitlab.osti.gov/jmwille/Trilinos/-/issues/2455Address timing out test Anasazi_Epetra_BlockDavidson_auxtest_MPI_4 in ATDM bu...2018-11-30T11:16:52ZJames WillenbringAddress timing out test Anasazi_Epetra_BlockDavidson_auxtest_MPI_4 in ATDM builds of Trilinos*Created by: bartlettroscoe*
**CC:** @trilinos/anasazi
## Next Action Status
Tests `Anasazi_Epetra_BlockDavidson_auxtest_MPI_4` and `Anasazi_Epetra_LOBPCG_auxtest_MPI_4` are disabled in several builds in the commits 8f23641 and c...*Created by: bartlettroscoe*
**CC:** @trilinos/anasazi
## Next Action Status
Tests `Anasazi_Epetra_BlockDavidson_auxtest_MPI_4` and `Anasazi_Epetra_LOBPCG_auxtest_MPI_4` are disabled in several builds in the commits 8f23641 and c66a268 and did not timeout in any builds on 3/27/2018 (see [below](https://github.com/trilinos/Trilinos/issues/2455#issuecomment-376619629)).
## Description
This Story is to address the test `Anasazi_Epetra_BlockDavidson_auxtest_MPI_4` that times out in several builds as shown in results yesterday on CDash at:
* https://testing-vm.sandia.gov/cdash/queryTests.php?project=Trilinos&date=2018-03-25&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercount=3&showfilters=1&filtercombine=and&field1=testname&compare1=61&value1=Anasazi_Epetra_BlockDavidson_auxtest_MPI_4&field2=status&compare2=62&value2=passed&field3=status&compare3=62&value3=notrun
This shows the test timing out at 10 minutes on the builds:
* `Trilinos-atdm-hansen-shiller-cuda-debug`
* `Trilinos-atdm-hansen-shiller-cuda-opt`
* `Trilinos-atdm-hansen-shiller-gnu-debug-serial`
* `Trilinos-atdm-hansen-shiller-gnu-opt-serial`
and the failing in the builds:
* `Trilinos-atdm-white-ride-cuda-opt`
* `Trilinos-atdm-white-ride-gnu-opt-openmp`
These failures show segfaults and are likely due to the compiler defect reported in #1208 and there are many Anasazi and Belos tests that segfault due to this as shown in #2454.
Therefore, this Story will only consider the timing-out tests, not the failing tests in the 'opt' builds on 'white' and 'ride' (since that is being covered in #2454).
Initial cleanup of new ATDM builds of Trilinoshttps://gitlab.osti.gov/jmwille/Trilinos/-/issues/2446Address expensive Panzer tests that timeout at 10 minutes in ATDM builds2018-11-30T11:16:52ZJames WillenbringAddress expensive Panzer tests that timeout at 10 minutes in ATDM builds*Created by: bartlettroscoe*
**CC:** @trilinos/panzer, @bathmatt, @fryeguy52
## Next Action Status
Pushed the commits 245e01d and d852fa3 to 'develop' to address timeouts and it removed the timing out tests on 3/25/2108. Addressi...*Created by: bartlettroscoe*
**CC:** @trilinos/panzer, @bathmatt, @fryeguy52
## Next Action Status
Pushed the commits 245e01d and d852fa3 to 'develop' to address timeouts and it removed the timing out tests on 3/25/2108. Addressing memory issues and re-enabling these tests will be done in other follow-on issues.
## Description
This story is to analyze and then to address some expensive Panzer tests that are timing out routinely in the ATDM Trilinos builds as shown, for example, in the following query that lists all of the timing out tests over the last week as shown in the query:
* https://testing-vm.sandia.gov/cdash/queryTests.php?project=Trilinos&date=2018-03-21&filtercombine=and&filtercombine=and&filtercombine=and&filtercount=7&showfilters=1&filtercombine=and&field1=buildname&compare1=65&value1=Trilinos-atdm-&field2=testname&compare2=65&value2=Panzer&field3=status&compare3=62&value3=passed&field4=status&compare4=62&value4=notrun&field5=buildstarttime&compare5=84&value5=2018-03-23&field6=buildstarttime&compare6=83&value6=2018-03-16&field7=details&compare7=63&value7=timeout
This query shows the following 6 timing out tests:
* `PanzerAdaptersSTK_CurlLaplacianExample-ConvTest-Quad-Order-4`
* `PanzerAdaptersSTK_main_driver_energy-ss-loca-eigenvalue`
* `PanzerAdaptersSTK_MixedPoissonExample-ConvTest-Hex-Order-2`
* `PanzerAdaptersSTK_MixedPoissonExample-ConvTest-Hex-Order-3`
* `PanzerAdaptersSTK_PoissonInterfaceExample_2d_diffsideids_MPI_1`
* `PanzerAdaptersSTK_PoissonInterfaceExample_2d_MPI_4`
which include the builds:
* `Trilinos-atdm-hansen-shiller-cuda-debug`
* `Trilinos-atdm-hansen-shiller-cuda-opt`
* `Trilinos-atdm-hansen-shiller-intel-debug-serial`
* `Trilinos-atdm-white-ride-cuda-debug`
* `Trilinos-atdm-white-ride-cuda-opt`
* `Trilinos-atdm-white-ride-gnu-debug-openmp`
As was discovered in https://github.com/trilinos/Trilinos/issues/2318#issuecomment-375494367, many of these tests will actually complete if you increase the timeouts . In particular, for the CUDA builds on hansen/shiller the following set of 5 tests all passed once the timeouts were increased to over 40 minutes for those CUDA builds:
* `PanzerAdaptersSTK_CurlLaplacianExample-ConvTest-Quad-Order-4`
* `PanzerAdaptersSTK_MixedPoissonExample-ConvTest-Hex-Order-2`
* `PanzerAdaptersSTK_MixedPoissonExample-ConvTest-Hex-Order-3`
* `PanzerAdaptersSTK_PoissonInterfaceExample_2d_diffsideids_MPI_1`
* `PanzerAdaptersSTK_PoissonInterfaceExample_2d_MPI_4`
The only test missing from the above list for CUDA builds on hansen/shiller was `PanzerAdaptersSTK_main_driver_energy-ss-loca-eigenvalue` and that test only timed out on the `Trilinos-atdm-white-ride-cuda-opt` build.
This Issue will be to investigate these tests some more and then decide how to address them.
## Tasks:
0. Inspect the timing out tests in the last week on all builds of Trilinos ... All can be addressed with increasing timesouts and one disable (see [below](https://github.com/trilinos/Trilinos/issues/2446#issuecomment-375730569)) **[DONE]**
1. Increase timeouts on all of the timing out Panzer tests in the last week to 45 minutes and set `CATEGORIES NIGHTLY` ...
2. See if these tests pass with longer timeouts in automated builds and see what their runtimes are when they are displayed on CDash ...
3. Decrease the timeouts for some of the tests that are not taking 45 minutes to complete ...
5. ???
## Related Issues
* Related to #2318
Initial cleanup of new ATDM builds of Trilinoshttps://gitlab.osti.gov/jmwille/Trilinos/-/issues/2410Test TeuchosNumerics_LAPACK_test_MPI_1 fails in all 'debug' builds on power8 ...2018-12-19T14:42:14ZJames WillenbringTest TeuchosNumerics_LAPACK_test_MPI_1 fails in all 'debug' builds on power8 'ride'*Created by: bartlettroscoe*
**CC:** @trilinos/teuchos
## Next Action Status:
PR #2447 was merged on 3/23/2018 which disabled the test. PR #4064 which enables the whole test `TeuchosNumerics_LAPACK_test_MPI_1` but disables the ...*Created by: bartlettroscoe*
**CC:** @trilinos/teuchos
## Next Action Status:
PR #2447 was merged on 3/23/2018 which disabled the test. PR #4064 which enables the whole test `TeuchosNumerics_LAPACK_test_MPI_1` but disables the single unit test for `STEQR()` merged to 'develop' on 12/18/2018. Next: Watch for test running and passing (minus `STEQR()` unit test) on 'release-debug' and 'opt' builds on 'white', 'ride', and 'waterman' on 12/19/2018 ...
## Description
The test `TeuchosNumerics_LAPACK_test_MPI_1` segfaults on the 'debug' builds `Trilinos-atdm-white-ride-cuda-debug` and `Trilinos-atdm-white-ride-gnu-debug-openmp` on 'ride' and 'white' but passes in all of the 'opt' builds on these same machines as well as for all of the builds on `hansen` as shown this morning in:
* https://testing-vm.sandia.gov/cdash/queryTests.php?project=Trilinos&date=2018-03-17&filtercombine=and&filtercombine=and&filtercount=2&showfilters=1&filtercombine=and&field1=buildname&compare1=65&value1=Trilinos-atdm-&field2=testname&compare2=61&value2=TeuchosNumerics_LAPACK_test_MPI_1
The failing tests all show segfaults showing the output:
```
Teuchos in Trilinos 12.13 (Dev)
GESV test ... passed!
LAPY2 test ... passed!
--------------------------------------------------------------------------
mpiexec noticed that process rank 0 with PID 16320 on node white24 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
```
What is interesting is that this test only failed in all of the Trilinos builds that were done yesterday in the query:
* https://testing-vm.sandia.gov/cdash/queryTests.php?project=Trilinos&date=2018-03-16&filtercombine=and&filtercount=3&showfilters=1&filtercombine=and&field1=buildname&compare1=65&value1=Trilinos-atdm-&field2=testname&compare2=61&value2=TeuchosNumerics_LAPACK_test_MPI_1&field3=status&compare3=62&value3=passed
May this be the same error reported in #1208 that we basically gave up on?
## Steps to Reproduce
Following the instructions at:
* https://github.com/trilinos/Trilinos/blob/develop/cmake/std/atdm/README.md#ridewhite
one can reproduce this failing test by enabling the Teuchos package for the builds `gnu-debug-openmp` or `cuda-debug` and running the failing test.
## Related issues
* Related to: #1208?
Initial cleanup of new ATDM builds of Trilinoshttps://gitlab.osti.gov/jmwille/Trilinos/-/issues/2312Issues to be address before making automated PR testing and merging mandatory2018-03-27T14:09:26ZJames WillenbringIssues to be address before making automated PR testing and merging mandatory*Created by: bartlettroscoe*
**CC:** @trilinos/framework, @rppawlo, @maherou, @srajama1
## Description:
There has been recent discussion about when the time to make automated PR testing and merging system being developed by the @t...*Created by: bartlettroscoe*
**CC:** @trilinos/framework, @rppawlo, @maherou, @srajama1
## Description:
There has been recent discussion about when the time to make automated PR testing and merging system being developed by the @trilinos/framework team mandatory. From my own experience and conversations with other people, the following issues would need to be addressed before that can occur:
* Update and streamline the builds used by the automated PR testing system (to make sure that we are getting the needed testing, having the builds not take too many computer resoruces, and getting the biggest bang for our buck in protecting important functionality.) (see #2317)
* **STATUS:** ???
* Post all build results to CDash to see exactly what is getting tested and what the failures are (and allow it to be reviewed before the final merge).
* Complete the upgrade of the new CDash version that supports all-at-once configure, build, test, and submit (see https://gitlab.kitware.com/snl/project-1/issues/33).
* **STATUS:** Currently, the PR build results are being submitted to the trial testing-vm.sandia.gov/cdash site and the Framework team is okay with the stability of that site for managing PR builds so this is not currently considered an impediment.
* Upgrade the minimum version of CMake used by Trilinos from 2.8.11 to 3.10 in order to take advantage of all-at-once configure, build, test, and submit to new CDash version. (Also, once the auto PR testing is using CMake 3.10 it will allow CMake code in packages that does not work with older versions of CMake and therefore will destabilize other builds of Trilinos unless we force the upgrade).
* **STATUS:** This can be done later. For now, this might allow CMakeLists.txt files to get pushed that will break the configure of Trilinos with older CMake versions but the current post-push CI build and other nightly builds would catch that.
* Make it easy and obvious how to reproduce the exact auto PR builds and results shown on CDash and to allow people to run these locally even **before** they post their PR if they would like (e.g. #2295)
(Which includes making the env and/or machine accessible to all Trilinos developers, see https://github.com/trilinos/Trilinos/issues/2317#issuecomment-369987520.)
* **STATUS:** ???
* Auto-merge with final staged testing before merge to ‘develop’ (i.e. avoid violations of the [additive test assumption of branches](https://docs.google.com/document/d/1uVQYI2cmNx09fDkHDA136yqDTqayhxqfvjFiuUue7wo/edit#bookmark=id.d1jneh8ubsyn) that resulted #2264).
* **STATUS:** Currently plan is to mark older PRs as "Stale" but not strictly guarantee stability of 'develop' due to violations of the "additive test assumption of branches"?
* Auto wiping the build directory when requested so that it will deal with cases where you need to build from scratch (e.g. when Panzer files get update such as described in https://github.com/trilinos/Trilinos/issues/1304#issuecomment-355786180 and https://github.com/trilinos/Trilinos/issues/1304#issuecomment-356053919).
* **STATUS:** Currently a complete build from scratch with each PR build is being done to address this. With this implementation, this is not an impediment.
* Address cases where no packages get enabled and therefore no tests are run. Will the auto PR system mark them as "passed" and allow the PR to be merged?
* **STATUS:** ???
* Document the PR and auto testing process so people can understand how to use it correctly and understand why it is firing off tests in some cases and not in other cases.
* **STATUS:** ???
* Ensure the current testing infrastructure will scale when everyone is required to use topic branches for pushing anything (even documentation changes). For example, will this scale when there are 20 active PRs being edited at the same time? When there are more and more topic branches, how long with the lags be before things get tested and eventually merged?
* **STATUS:** This will be addressed by "Update and streamline the builds used by the automated PR testing system" above?
Other issues that should be resolve (quickly) after initial adoption:
* Remove manual updates of the list of packages from mapping of directories to packages. (e.g. use self-maintaining existing code in TriBITS tools for that?). If you don’t things will not get tested correctly without manual intervention to constantly keep that manual list up to date.
The motivation for this issue was conversations with @rppawlo, @jwillenbring and the [2018-02-28 Trilinos Planning meeting](https://docs.google.com/document/d/1JClLSR3n79XJT_yLPPoQv_e3rJToHxl8fUzRYojtY5Y/edit)
## Related Issues
* Related to: #1154, #1304
* Composed of: #2317
## Tasks:
1. Define the set of builds to be used in the auto PR system with a working group (#2317). See #2317 ...
1. Complete and robustify the submitting of results to CDash (there is a start to this but it is having problems, see https://github.com/trilinos/Trilinos/pull/2310#issuecomment-369390502).
1. ???
Improve productivity, stability, and quality of Trilinoshttps://gitlab.osti.gov/jmwille/Trilinos/-/issues/2270New timed-out Amesos2 tests in Trilinos-atdm-sems-gcc-7-2-0 build on 2/20/20182019-01-24T23:43:03ZJames WillenbringNew timed-out Amesos2 tests in Trilinos-atdm-sems-gcc-7-2-0 build on 2/20/2018*Created by: bartlettroscoe*
**CC:** @trilinos/amesos2
## Next Action Status
The tests were disabled in all `Trilinos_ENABLE_DEBUG=ON` builds on 2/22/2018 (see [below](https://github.com/trilinos/Trilinos/issues/2270#issuecommen...*Created by: bartlettroscoe*
**CC:** @trilinos/amesos2
## Next Action Status
The tests were disabled in all `Trilinos_ENABLE_DEBUG=ON` builds on 2/22/2018 (see [below](https://github.com/trilinos/Trilinos/issues/2270#issuecomment-367815146)). Next: Fix the tests so that they pass?
## Description
The tests:
* Amesos2_KLU2_UnitTests_MPI_2
* Amesos2_Superlu_UnitTests_MPI_2
timed out at 10 minutes in the build `Trilinos-atdm-sems-gcc-7-2-0` this morning as shown at:
* https://testing.sandia.gov/cdash/index.php?project=Trilinos&parentid=3396946
Prior to this morning, these tests were taking:
* Amesos2_KLU2_UnitTests_MPI_2: 1.5s
* Amesos2_Superlu_UnitTests_MPI_2: 1.7s
It looks like these tests are hanging due to an exception being thrown?
## Steps to Reproduce
Using the `do-configure` script:
```
#!/bin/bash
cmake \
-DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/sems/atdm/SEMSATDMSettings.cmake,cmake/std/MpiReleaseDebugSharedPtSettings.cmake,cmake/std/BasicCiTestingSettings.cmake \
-DDART_TESTING_TIMEOUT:STRING=300.0 \
-DTrilinos_ENABLE_TESTS:BOOL=ON \
-DCTEST_BUILD_FLAGS=-j10 \
-DCTEST_PARALLEL_LEVEL=10 \
"$@" \
$TRILINOS_DIR
```
Anyone should be able to reproduce these failures on any SNL COE RHEL6 machine as shown below:
```
$ cd <some_build_dir>/
$ source $TRILINOS_DIR/cmake/std/sems/atdm/load_atdm_7.2_dev_env.sh
$ ./do-configure -DTrilinos_ENABLE_Amesos2=ON
$ make -j16
$ ctest -j16
```
NOTE: The timeout like `-DDART_TESTING_TIMEOUT:STRING=300.0` is important or ctest will never end.
Keep promoted "ATDM" builds of Trilinos cleanhttps://gitlab.osti.gov/jmwille/Trilinos/-/issues/1770Tpetra: Create methods to test runtime sized scalar types2017-10-26T21:12:55ZJames WillenbringTpetra: Create methods to test runtime sized scalar types*Created by: tjfulle*
This issue is to discuss and implement methods to test runtime sized scalar types as used in @trilinos/stokhos, without having to build all the way through to Stokhos.
@trilinos/tpetra
@mhoemmen
@etphipp *Created by: tjfulle*
This issue is to discuss and implement methods to test runtime sized scalar types as used in @trilinos/stokhos, without having to build all the way through to Stokhos.
@trilinos/tpetra
@mhoemmen
@etphipp Tpetra-backloghttps://gitlab.osti.gov/jmwille/Trilinos/-/issues/1745Define labels and rules for Trilinos application issues and metrics2018-11-30T11:23:42ZJames WillenbringDefine labels and rules for Trilinos application issues and metrics*Created by: ibaned*
Please see the discussion between https://github.com/trilinos/Trilinos/issues/1304#issuecomment-329885910 and https://github.com/trilinos/Trilinos/issues/1304#issuecomment-329918877 where I propose the following:
...*Created by: ibaned*
Please see the discussion between https://github.com/trilinos/Trilinos/issues/1304#issuecomment-329885910 and https://github.com/trilinos/Trilinos/issues/1304#issuecomment-329918877 where I propose the following:
> create a few extra issues labels besides "bug", namely "compile error (Trilinos)", "compile error (application)", "compile warning (Trilinos)", "compile warning (application)", "test failure (Trilinos)", "test failure (application)". Then we can collect statistics on how many issues with each label were opened in a particular period of time. I would also expand "(Trilinos)" to be either "(Trilinos/develop)" or "(Trilinos/master)"
Note @bartlettroscoe 's observation:
> before we define any more labels I think we need to better organize them along the lines of #1619.
@maherou @jwillenbring Improve productivity, stability, and quality of Trilinoshttps://gitlab.osti.gov/jmwille/Trilinos/-/issues/1619Suggestion: change to issue labeling2019-03-21T16:27:57ZJames WillenbringSuggestion: change to issue labeling*Created by: jhux2*
While trying to resolve a linker problem, I stumbled across the https://github.com/Unidata/netcdf-c/issues page. As you can see, that project prefixes labels with categories. I think Trilinos should consider adopti...*Created by: jhux2*
While trying to resolve a linker problem, I stumbled across the https://github.com/Unidata/netcdf-c/issues page. As you can see, that project prefixes labels with categories. I think Trilinos should consider adopting a similar format for labels.
```
package/blah
platform/blah
status/blah
type/blah
```
I believe this would help authors better categorize the issues, and make searching easier.Improve productivity, stability, and quality of Trilinoshttps://gitlab.osti.gov/jmwille/Trilinos/-/issues/1422Get all SERIAL tests in Trilinos to run as 1-process tests in MPI builds2017-08-22T19:15:01ZJames WillenbringGet all SERIAL tests in Trilinos to run as 1-process tests in MPI builds*Created by: bartlettroscoe*
**CC:** @trilinos/framework
**Description:**
A big problem we have with SQA and test of Trilinos is that many Trilinos packages have significant numbers of tests that only run in a non-MPI serial buil...*Created by: bartlettroscoe*
**CC:** @trilinos/framework
**Description:**
A big problem we have with SQA and test of Trilinos is that many Trilinos packages have significant numbers of tests that only run in a non-MPI serial build (e.g. Pamgan #1415 and SEACAS #1392). That means that running the standard CI build MPI_RELEASE_DEBUG_SHARED_PT does not run any of these tests.
This story is to create a place-marker for package-specific stories to get all of their single-process tests now running only in the non-MPI serial build to also run in the MPI builds. This way, the set of tests run in a serial build will be a strict subset of the tests run in the MPI build. Therefore, passing the MPI build will give higher confidence that the serial build will also pass all of its tests.
Improve productivity, stability, and quality of Trilinoshttps://gitlab.osti.gov/jmwille/Trilinos/-/issues/1418Clean up Trilinos dependencies using subpackages to reduce and better control...2019-03-05T14:44:47ZJames WillenbringClean up Trilinos dependencies using subpackages to reduce and better control dependencies*Created by: bartlettroscoe*
**CC:** @trilinos/framework
**Description:**
Currently Trilinos has a very course-grained set of dependencies based on fat top-level packages that don't follow [Software Engineering Packaging Principle...*Created by: bartlettroscoe*
**CC:** @trilinos/framework
**Description:**
Currently Trilinos has a very course-grained set of dependencies based on fat top-level packages that don't follow [Software Engineering Packaging Principles](https://tribits.org/doc/TribitsDevelopersGuide.html#software-engineering-packaging-principles). This has several bad consequences:
* Changing an upstream packages triggers a lot of downstream package enables that really should not be (e.g. see #1395 and #1406)
* When users enable these fat packages with optional dependencies enabled, they get a lot of upstream code built that they typically don't need.
The solution to this problem is to better partition many of these fat top-level Trilinos packages into TriBITS subpackages and then more carefully define dependencies between subpackages. On example is having dowstream packages depend on "Teuchos" of "TeuchosCore", "TeuchosComm", "TeuchosParameterList", etc. (e.g., see #1263).
Reduce build times for Trilinoshttps://gitlab.osti.gov/jmwille/Trilinos/-/issues/1388Get templated code in Trilinos to use ETI as much as possible to reduce (re)b...2017-06-02T23:46:23ZJames WillenbringGet templated code in Trilinos to use ETI as much as possible to reduce (re)build times and resources*Created by: bartlettroscoe*
**Blocked By:** (The ETI issues for individual packages as they get created.)
**CC:** @trilinos/framework, @rppawlo, @maherou
**Description:**
This story is a place marker to push the issue of gett...*Created by: bartlettroscoe*
**Blocked By:** (The ETI issues for individual packages as they get created.)
**CC:** @trilinos/framework, @rppawlo, @maherou
**Description:**
This story is a place marker to push the issue of getting templated packages in Trilinos to use explicit template instantiation (ETI) to reduce build times and especially rebuild times for these packages. Long build times is a huge impediment to development, testing, and usage of Trilinos.
For example, it is reported (by @dridzal) that some ROL examples consume huge amounts of compiler time and system memory in order to compile because they directly include headers from Belos, Amesos2, Ifpack2, MueLu, etc. If ETI was being used by these packages, then it is likely that those examples would compile much more quickly and use less memory.
Another example is the xSDK project where Trilinos is constantly getting hit for large build times and resources. (In fact, the xSDK installer for Trilinos was modified to remove the templated packages because of this.)
Now that Trilinos has top-level options for enabling/disabling scale types like `float`, `complex<float>`, and `complex<double>` (see #362, and there are similar options for ordinal types), every Trilinos package that has templated code needs to, as much as possible, provide for ETI based on those requested types. Note that this does *not* require packages to use a consistent system for ETI (hence, #546 is not needed). It just requires that they do ETI and respond to those type enables/disables.
It is expected that new Issues will be created for specific packages to get their code to use ETI more. Those new issues will block this Issue.
Reduce build times for Trilinoshttps://gitlab.osti.gov/jmwille/Trilinos/-/issues/1293Document information concerning Clean, Nightly, and Specialized tracks2018-02-12T20:58:34ZJames WillenbringDocument information concerning Clean, Nightly, and Specialized tracks*Created by: jwillenbring*
@william76
I need to document the definition of, qualifications for, and process for moving between the 3 new tracks that we are dividing the Nightly track into for automated testing. Definition of done fo...*Created by: jwillenbring*
@william76
I need to document the definition of, qualifications for, and process for moving between the 3 new tracks that we are dividing the Nightly track into for automated testing. Definition of done for this story is:
Trilinos GitHub wiki page exists listing
--Three develop branch testing tracks - Clean, Nightly, and specialized, with a description of each track.
--Process for moving from one track to another.Improve productivity, stability, and quality of Trilinoshttps://gitlab.osti.gov/jmwille/Trilinos/-/issues/1233Create and publish an official Trilinos Change Management Policy2017-06-29T15:50:24ZJames WillenbringCreate and publish an official Trilinos Change Management Policy*Created by: maherou*
While the Trilinos project has a rigorous backward compatibility policy, other forms of change management are not well define. This issue is for creating a more comprehensive statement of how we manage changes. ...*Created by: maherou*
While the Trilinos project has a rigorous backward compatibility policy, other forms of change management are not well define. This issue is for creating a more comprehensive statement of how we manage changes. The policy should cover these things (and probably more):
- [ ] How we inform users of commits to the repository that could change numerical results.
- [ ] When these kinds of changes can be pushed into the repository and how.
- [ ] As we start to introduce non-deterministic algorithms, how do we indicate when a function may produce results that vary from run to run.
cc: @trilinos/framework Improve productivity, stability, and quality of Trilinos