Trilinos issueshttps://gitlab.osti.gov/jmwille/Trilinos/-/issues2019-05-02T17:17:38Zhttps://gitlab.osti.gov/jmwille/Trilinos/-/issues/5035Teko: Tests failing on ATDM cuda 10 build2019-05-02T17:17:38ZJames WillenbringTeko: Tests failing on ATDM cuda 10 build*Created by: fryeguy52*
## Bug Report
CC: @trilinos/teko, @srajama1 (Trilinos Linear Solvers Product Lead), @bartlettroscoe, @fryeguy52
## Next Action Status
<status-and-or-first-action>
## Description
As shown in [this qu...*Created by: fryeguy52*
## Bug Report
CC: @trilinos/teko, @srajama1 (Trilinos Linear Solvers Product Lead), @bartlettroscoe, @fryeguy52
## Next Action Status
<status-and-or-first-action>
## Description
As shown in [this query](https://testing.sandia.gov/cdash/queryTests.php?project=Trilinos&filtercombine=and&filtercombine=&filtercombine=and&filtercombine=and&filtercount=6&showfilters=1&filtercombine=and&field1=buildname&compare1=61&value1=Trilinos-atdm-white-ride-cuda-10.1-gnu-7.2.0-release-debug&field2=testname&compare2=65&value2=Teko_&field3=status&compare3=61&value3=Failed&field4=site&compare4=61&value4=white&field5=buildstarttime&compare5=84&value5=2019-04-29T00%3A00%3A00&field6=buildstarttime&compare6=83&value6=2019-03-30T00%3A00%3A00) the tests:
* Teko_testdriver_MPI_1
* Teko_testdriver_MPI_4
* Teko_testdriver_tpetra_MPI_1
* Teko_testdriver_tpetra_MPI_4 |
are failing in the build:
* Trilinos-atdm-white-ride-cuda-10.1-gnu-7.2.0-release-debug
## Current Status on CDash
[Failing Teko tests on this build for current testing day](https://testing.sandia.gov/cdash/queryTests.php?project=Trilinos&filtercombine=and&filtercombine=&filtercombine=and&filtercombine=and&filtercombine=and&filtercount=6&showfilters=1&filtercombine=and&field1=buildname&compare1=61&value1=Trilinos-atdm-white-ride-cuda-10.1-gnu-7.2.0-release-debug&field2=testname&compare2=65&value2=Teko_&field3=status&compare3=61&value3=Failed&field4=site&compare4=61&value4=white&field5=buildstarttime&compare5=84&value5=today&field6=buildstarttime&compare6=83&value6=yesterday)
## Steps to Reproduce
One should be able to reproduce this failure on ride or white as described in:
* https://github.com/trilinos/Trilinos/blob/develop/cmake/std/atdm/README.md
More specifically, the commands given for ride or white are provided at:
* https://github.com/trilinos/Trilinos/blob/develop/cmake/std/atdm/README.md#ridewhite
The exact commands to reproduce this issue should be:
```
$ cd <some_build_dir>/
$ source $TRILINOS_DIR/cmake/std/atdm/load-env.sh Trilinos-atdm-white-ride-cuda-10.1-gnu-7.2.0-release-debug
$ cmake \
-GNinja \
-DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnv.cmake \
-DTrilinos_ENABLE_TESTS=ON -DTrilinos_ENABLE_Teko=ON \
$TRILINOS_DIR
$ make NP=16
$ bsub -x -Is -q rhel7F -n 16 ctest -j16
```
Initial cleanup of new ATDM builds of Trilinoshttps://gitlab.osti.gov/jmwille/Trilinos/-/issues/5033MueLu: Tests failing on ATDM cuda 10 build2019-05-02T19:11:33ZJames WillenbringMueLu: Tests failing on ATDM cuda 10 build*Created by: fryeguy52*
## Bug Report
CC: @trilinos/muelu, @srajama1 (Trilinos Linear Solvers Product Lead), @bartlettroscoe, @fryeguy52
## Next Action Status
<status-and-or-first-action>
## Description
As shown in [this q...*Created by: fryeguy52*
## Bug Report
CC: @trilinos/muelu, @srajama1 (Trilinos Linear Solvers Product Lead), @bartlettroscoe, @fryeguy52
## Next Action Status
<status-and-or-first-action>
## Description
As shown in [this query](https://testing.sandia.gov/cdash/queryTests.php?project=Trilinos&filtercombine=and&filtercombine=&filtercombine=and&filtercombine=and&filtercombine=and&filtercount=6&showfilters=1&filtercombine=and&field1=buildname&compare1=61&value1=Trilinos-atdm-white-ride-cuda-10.1-gnu-7.2.0-release-debug&field2=testname&compare2=65&value2=MueLu_&field3=site&compare3=61&value3=white&field4=status&compare4=62&value4=Passed&field5=buildstarttime&compare5=84&value5=2019-04-29T00%3A00%3A00&field6=buildstarttime&compare6=83&value6=2019-03-30T00%3A00%3A00) the tests:
* MueLu_Maxwell3D-Epetra_MPI_4
* MueLu_ImportPerformance_Epetra_MPI_4
* MueLu_ImportPerformance_Tpetra_MPI_4
are failing in the build:
* Trilinos-atdm-white-ride-cuda-10.1-gnu-7.2.0-release-debug
this is common in the output:
```
MueLu_ImportPerformance.exe: sys/memtype_cache.c:90: ucs_memtype_cache_delete: Assertion `pgt_region != ((void *)0)' failed.
```
## Current Status on CDash
[Current failing MueLu tests on this build](https://testing.sandia.gov/cdash/queryTests.php?project=Trilinos&filtercombine=and&filtercombine=&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercount=6&showfilters=1&filtercombine=and&field1=buildname&compare1=61&value1=Trilinos-atdm-white-ride-cuda-10.1-gnu-7.2.0-release-debug&field2=testname&compare2=65&value2=MueLu_&field3=site&compare3=61&value3=white&field4=status&compare4=62&value4=Passed&field5=buildstarttime&compare5=84&value5=today&field6=buildstarttime&compare6=83&value6=yesterday)
## Steps to Reproduce
One should be able to reproduce this failure on ride or white as described in:
* https://github.com/trilinos/Trilinos/blob/develop/cmake/std/atdm/README.md
More specifically, the commands given for ride or white are provided at:
* https://github.com/trilinos/Trilinos/blob/develop/cmake/std/atdm/README.md#ridewhite
The exact commands to reproduce this issue should be:
```
$ cd <some_build_dir>/
$ source $TRILINOS_DIR/cmake/std/atdm/load-env.sh Trilinos-atdm-white-ride-cuda-10.1-gnu-7.2.0-release-debug
$ cmake \
-GNinja \
-DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnv.cmake \
-DTrilinos_ENABLE_TESTS=ON -DTrilinos_ENABLE_MueLu=ON \
$TRILINOS_DIR
$ make NP=16
$ bsub -x -Is -q rhel7F -n 16 ctest -j16
```
Initial cleanup of new ATDM builds of Trilinoshttps://gitlab.osti.gov/jmwille/Trilinos/-/issues/5006Ifpack2: Ifpack2_MTSGS_belos_MPI_1 randomly failing in ATDM cee rhel6 intel ...2019-04-24T23:12:57ZJames WillenbringIfpack2: Ifpack2_MTSGS_belos_MPI_1 randomly failing in ATDM cee rhel6 intel build*Created by: fryeguy52*
## Bug Report
CC: @trilinos/ifpack2, @srajama1 (Trilinos Linear Solvers Product Lead), @bartlettroscoe, @fryeguy52
## Next Action Status
<status-and-or-first-action>
## Description
As shown in [thi...*Created by: fryeguy52*
## Bug Report
CC: @trilinos/ifpack2, @srajama1 (Trilinos Linear Solvers Product Lead), @bartlettroscoe, @fryeguy52
## Next Action Status
<status-and-or-first-action>
## Description
As shown in [this query](https://testing.sandia.gov/cdash/queryTests.php?project=Trilinos&filtercombine=and&filtercombine=&filtercombine=and&filtercount=5&showfilters=1&filtercombine=and&field1=buildname&compare1=61&value1=Trilinos-atdm-cee-rhel6_intel-18.0.2_mpich2-3.2_openmp_static_opt&field2=testname&compare2=61&value2=Ifpack2_MTSGS_belos_MPI_1&field3=site&compare3=61&value3=cee-rhel6&field4=buildstarttime&compare4=84&value4=2019-04-24T00%3A00%3A00&field5=buildstarttime&compare5=83&value5=2019-03-25T00%3A00%3A00) the test:
* Ifpack2_MTSGS_belos_MPI_1
is failing in the build:
* Trilinos-atdm-cee-rhel6_intel-18.0.2_mpich2-3.2_openmp_static_opt
This has failed 9 times in the last 4 weeks with something similar to:
```
Achieved tolerance: 6.44895e-10
Actual iters(20) > expected number of iterations (19), or resid-norm(0) >= 1.e-7
proc 0 total program time: 0.0258739
End Result: TEST FAILED
```
## Current Status on CDash
[Current 2 week history on CDash](https://testing.sandia.gov/cdash/queryTests.php?project=Trilinos&filtercombine=and&filtercombine=&filtercombine=and&filtercombine=and&filtercombine=and&filtercount=5&showfilters=1&filtercombine=and&field1=buildname&compare1=61&value1=Trilinos-atdm-cee-rhel6_intel-18.0.2_mpich2-3.2_openmp_static_opt&field2=testname&compare2=61&value2=Ifpack2_MTSGS_belos_MPI_1&field3=site&compare3=61&value3=cee-rhel6&field4=buildstarttime&compare4=84&value4=today&field5=buildstarttime&compare5=83&value5=2%20weeks%20ago)
## Steps to Reproduce
One should be able to reproduce this failure on a machine with a cee rhel6 environment as described in:
* https://github.com/trilinos/Trilinos/blob/develop/cmake/std/atdm/README.md
More specifically, the commands given for a machine with a cee rhel6 environment are provided at:
* https://github.com/trilinos/Trilinos/blob/develop/cmake/std/atdm/README.md#cee-rhel6-environment
The exact commands to reproduce this issue should be:
```
$ cd <some_build_dir>/
$ source $TRILINOS_DIR/cmake/std/atdm/load-env.sh Trilinos-atdm-cee-rhel6_intel-18.0.2_mpich2-3.2_openmp_static_opt
$ cmake \
-GNinja \
-DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnv.cmake \
-DTrilinos_ENABLE_TESTS=ON -DTrilinos_ENABLE_Ifpack2=ON \
$TRILINOS_DIR
$ make NP=16
$ ctest -j16
```
Keep promoted "ATDM" builds of Trilinos cleanhttps://gitlab.osti.gov/jmwille/Trilinos/-/issues/5002MueLu: MueLu_FixedMatrixPattern-Tpetra_MPI_4 randomly timing out in ATDM wat...2019-05-07T01:09:59ZJames WillenbringMueLu: MueLu_FixedMatrixPattern-Tpetra_MPI_4 randomly timing out in ATDM waterman build*Created by: fryeguy52*
## Bug Report
CC: @trilinos/muelu, @srajama1 (Trilinos Linear Solvers Product Lead), @bartlettroscoe, @fryeguy52
## Next Action Status
<status-and-or-first-action>
## Description
As shown in [this q...*Created by: fryeguy52*
## Bug Report
CC: @trilinos/muelu, @srajama1 (Trilinos Linear Solvers Product Lead), @bartlettroscoe, @fryeguy52
## Next Action Status
<status-and-or-first-action>
## Description
As shown in [this query](https://testing.sandia.gov/cdash/queryTests.php?project=Trilinos&filtercombine=and&filtercombine=&filtercombine=and&filtercount=5&showfilters=1&filtercombine=and&field1=buildname&compare1=61&value1=Trilinos-atdm-waterman-cuda-9.2-debug&field2=testname&compare2=61&value2=MueLu_FixedMatrixPattern-Tpetra_MPI_4&field3=site&compare3=61&value3=waterman&field4=buildstarttime&compare4=84&value4=2019-04-23T00%3A00%3A00&field5=buildstarttime&compare5=83&value5=2019-03-24T00%3A00%3A00) the test:
* MueLu_FixedMatrixPattern-Tpetra_MPI_4
is randomly timing out in the build:
* Trilinos-atdm-waterman-cuda-9.2-debug
This test usually passes in about 6.5 seconds but has timed out (10 minutes) 5 times in the last 30 days
## Current Status on CDash
[Current 2 week history on CDash](https://testing.sandia.gov/cdash/queryTests.php?project=Trilinos&filtercombine=and&filtercombine=&filtercombine=and&filtercombine=and&filtercombine=and&filtercount=5&showfilters=1&filtercombine=and&field1=buildname&compare1=61&value1=Trilinos-atdm-waterman-cuda-9.2-debug&field2=testname&compare2=61&value2=MueLu_FixedMatrixPattern-Tpetra_MPI_4&field3=site&compare3=61&value3=waterman&field4=buildstarttime&compare4=84&value4=today&field5=buildstarttime&compare5=83&value5=2%20weeks%20ago)
## Steps to Reproduce
One should be able to reproduce this failure on waterman as described in:
* https://github.com/trilinos/Trilinos/blob/develop/cmake/std/atdm/README.md
More specifically, the commands given for waterman are provided at:
* https://github.com/trilinos/Trilinos/blob/develop/cmake/std/atdm/README.md#waterman
The exact commands to reproduce this issue should be:
```
$ cd <some_build_dir>/
$ source $TRILINOS_DIR/cmake/std/atdm/load-env.sh Trilinos-atdm-waterman-cuda-9.2-debug
$ cmake \
-GNinja \
-DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnv.cmake \
-DTrilinos_ENABLE_TESTS=ON -DTrilinos_ENABLE_MueLu=ON \
$TRILINOS_DIR
$ make NP=16
$ bsub -x -Is -n 20 ctest -j20
```
Keep promoted "ATDM" builds of Trilinos cleanhttps://gitlab.osti.gov/jmwille/Trilinos/-/issues/4960Ifpack2_Relaxation_def.hpp variable template C++14 extension warning breaking...2019-04-23T23:01:23ZJames WillenbringIfpack2_Relaxation_def.hpp variable template C++14 extension warning breaking SPARC Trilinos Integration builds starting 4/18/2019*Created by: bartlettroscoe*
CC: @trilinos/ifpack2, @srajama1 (Trilinos Linear Solvers Product Lead), @bartlettroscoe, @fryeguy52
<Checklist>
<???: Add label "client: ATDM">
<???: Add label "ATDM Sev: Blocker" (by default but coul...*Created by: bartlettroscoe*
CC: @trilinos/ifpack2, @srajama1 (Trilinos Linear Solvers Product Lead), @bartlettroscoe, @fryeguy52
<Checklist>
<???: Add label "client: ATDM">
<???: Add label "ATDM Sev: Blocker" (by default but could be other "ATDM Sev: XXX")>
<???: Add label "type: bug"?>
<???: Add label for affected packages (e.g. "pkg: MueLu", "pkg: Tpetra", "pkg: Kokkos", etc.)>
<???: Add label "PA: ???Project Area???" (e.g. "PA: Linear Solvers", "PA: Data Services")>
<???: Add milestone "Initial cleanup of new ATDM ..." or "Keep promoted ATDM ...">
<???: Once GitHub Issue is created, add entries for tests to TrilinosATDMStatus/*.csv files>
## Next Action Status
<status-and-or-first-action>
## Description
The new warning elevated to an error:
```
Ifpack2_Relaxation_def.hpp:147:6: error: variable templates are a C++14 extension [-Werror,-Wc++14-extensions]
void Relaxation::updateCachedMultiVector(const Teuchos::RCP > & map, size_t numVecs) const{
^
```
is breaking the SPARC Trilinos integration builds as shown [here](http://compsim-dashboard.sandia.gov/cdash/index.php?project=SPARC&date=2019-04-18&filtercount=1&showfilters=1&field1=buildname&compare1=66&value1=-trildev) with that warning being shown [here](http://compsim-dashboard.sandia.gov/cdash/viewBuildError.php?buildid=108404)
## Current Status on CDash
* [sparc-alltpls_cee-cpu_clang-5.0.1_openmpi-1.10.2_static_opt-trildev builds over last 5 days](http://compsim-dashboard.sandia.gov/cdash/index.php?project=SPARC&date=2019-04-18&filtercombine=and&filtercount=2&showfilters=1&filtercombine=and&field1=buildname&compare1=61&value1=sparc-alltpls_cee-cpu_clang-5.0.1_openmpi-1.10.2_static_opt-trildev&field2=buildstarttime&compare2=83&value2=5%20days%20ago)
## Steps to Reproduce
I can't see this warning being generated in the [Ifack2 package build itself for the build Trilinos-atdm-cee-rhel6_clang-5.0.1_openmpi-1.10.2_serial_static_opt](https://testing.sandia.gov/cdash-dev-view/viewBuildError.php?type=1&buildid=4910167) so I am not sure one can reproduce this just with Trilinos. (Does this suggest a lack of test coverage for Ifpack2?)
But if one can get on the CEE LAN and can clone the SPARC repos, then one can reproduce using the scripts described [here](https://snl-wiki.sandia.gov/display/CoodinatedDevOpsATDM/Building+ATDM+APPs+Against+Local+Installs+of+Trilinos#BuildingATDMAPPsAgainstLocalInstallsofTrilinos-BuildingandTestingSPARCAgainstLocalTrilinosInstallation). After getting Trilinos on to the 'develop' branch as described [here](https://snl-wiki.sandia.gov/display/CoodinatedDevOpsATDM/Building+ATDM+APPs+Against+Local+Installs+of+Trilinos#BuildingATDMAPPsAgainstLocalInstallsofTrilinos-BuildingagainstTrilinos'develop'usingtheATDMTrilinosconfiguration), one can reproduce this using the command:
```
$ cd sparc/
$ env \
ATDM_TRIL_SPARC_BUILDS_LIST=cee-rhel6_clang-5.0.1_openmpi-1.10.2_serial_static_opt \
ATDM_TRIL_SPARC_SKIP_NATIVE_BUILD=1 \
./sparc-tril-dev-scripts/run_builds_and_tests.sh
```
Keep promoted "ATDM" builds of Trilinos cleanhttps://gitlab.osti.gov/jmwille/Trilinos/-/issues/4779Teko_testdriver_tpetra_MPI_1 randomly failing in ATDM waterman build2019-04-01T16:37:24ZJames WillenbringTeko_testdriver_tpetra_MPI_1 randomly failing in ATDM waterman build*Created by: fryeguy52*
CC: @trilinos/teko, @srajama1 (Trilinos Linear Solvers Product Lead), @bartlettroscoe, @fryeguy52
## Next Action Status
<status-and-or-first-action>
## Description
As shown in [this query](https:/...*Created by: fryeguy52*
CC: @trilinos/teko, @srajama1 (Trilinos Linear Solvers Product Lead), @bartlettroscoe, @fryeguy52
## Next Action Status
<status-and-or-first-action>
## Description
As shown in [this query](https://testing.sandia.gov/cdash/queryTests.php?project=Trilinos&filtercombine=and&filtercombine=&filtercombine=and&filtercombine=and&filtercombine=and&filtercount=5&showfilters=1&filtercombine=and&field1=buildname&compare1=61&value1=Trilinos-atdm-waterman-cuda-9.2-release-debug&field2=testname&compare2=61&value2=Teko_testdriver_tpetra_MPI_1&field3=site&compare3=61&value3=waterman&field4=buildstarttime&compare4=84&value4=2019-04-01&field5=buildstarttime&compare5=83&value5=2019-02-28) the test:
* Teko_testdriver_tpetra_MPI_1
looks to be randomly failing in the build:
* Trilinos-atdm-waterman-cuda-9.2-release-debug
It has failed 5 times in the last month each time with:
```
terminate called after throwing an instance of 'std::runtime_error'
what(): cudaGetLastError() error( cudaErrorAssert): device-side assert triggered /home/jenkins/waterman/workspace/Trilinos-atdm-waterman-cuda-9.2-release-debug/SRC_AND_BUILD/Trilinos/packages/kokkos/core/src/Cuda/Kokkos_CudaExec.hpp:401
Traceback functionality not available
```
full output from a failed run can be found [here](https://testing.sandia.gov/cdash/testDetails.php?test=72920917&build=4811399)
## Current Status on CDash
current 4 week history can be found [here](https://testing.sandia.gov/cdash/queryTests.php?project=Trilinos&filtercombine=and&filtercombine=&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercount=5&showfilters=1&filtercombine=and&field1=buildname&compare1=61&value1=Trilinos-atdm-waterman-cuda-9.2-release-debug&field2=testname&compare2=61&value2=Teko_testdriver_tpetra_MPI_1&field3=site&compare3=61&value3=waterman&field4=buildstarttime&compare4=84&value4=today&field5=buildstarttime&compare5=83&value5=4%20weeks%20ago)
## Steps to Reproduce
One should be able to reproduce this failure on waterman as described in:
* https://github.com/trilinos/Trilinos/blob/develop/cmake/std/atdm/README.md
More specifically, the commands given for waterman are provided at:
* https://github.com/trilinos/Trilinos/blob/develop/cmake/std/atdm/README.md#waterman
The exact commands to reproduce this issue should be:
```
$ cd <some_build_dir>/
$ source $TRILINOS_DIR/cmake/std/atdm/load-env.sh Trilinos-atdm-waterman-cuda-9.2-release-debug
$ cmake \
-GNinja \
-DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnv.cmake \
-DTrilinos_ENABLE_TESTS=ON -DTrilinos_ENABLE_Teko=ON \
$TRILINOS_DIR
$ make NP=16
$ bsub -x -Is -n 20 ctest -j20
```
Keep promoted "ATDM" builds of Trilinos cleanhttps://gitlab.osti.gov/jmwille/Trilinos/-/issues/4599MueLu build failures in new ATDM Trilinos sems-rhel7+cuda+complex builds2019-04-10T17:41:55ZJames WillenbringMueLu build failures in new ATDM Trilinos sems-rhel7+cuda+complex builds*Created by: bartlettroscoe*
CC: @trilinos/muelu, @srajama1 (Trilinos Linear Solvers Product Lead), @bartlettroscoe, @fryeguy52
## Next Action Status
<status-and-or-first-action>
## Description
As shown in [this query](https...*Created by: bartlettroscoe*
CC: @trilinos/muelu, @srajama1 (Trilinos Linear Solvers Product Lead), @bartlettroscoe, @fryeguy52
## Next Action Status
<status-and-or-first-action>
## Description
As shown in [this query](https://testing.sandia.gov/cdash-dev-view/index.php?project=Trilinos&date=2019-03-11&filtercount=2&showfilters=1&filtercombine=and&field1=subprojects&compare1=93&value1=MueLu&field2=buildname&compare2=65&value2=Trilinos-atdm-sems-rhel7-cuda-9.2-Volta70-complex-), MueLu has build errors in library code in the new cuda+complex builds:
* `Trilinos-atdm-sems-rhel7-cuda-9.2-Volta70-complex-shared-release-debug`
* `Trilinos-atdm-sems-rhel7-cuda-9.2-Volta70-complex-static-release-debug`
using the 'sems-rhel7' env.
The build errors shown [here](https://testing.sandia.gov/cdash-dev-view/viewBuildError.php?buildid=4695056) and [here](https://testing.sandia.gov/cdash-dev-view/viewBuildError.php?buildid=4695082) show errors building the source files **`ExplicitInstantiation/MueLu_TentativePFactory_kokkos.cpp`** showing errors like:
* `Trilinos-atdm-sems-rhel7-cuda-9.2-Volta70-complex-shared-release-debug/SRC_AND_BUILD/Trilinos/packages/kokkos/core/src/Kokkos_View.hpp(816): error: calling a constexpr __host__ function("std::real<double> ") from a __device__ function("Kokkos::Impl::ParallelFor< ::, ::Kokkos::RangePolicy<int, ::Kokkos::Cuda > , ::Kokkos::Cuda> ::operator () const") is not allowed. The experimental flag '--expt-relaxed-constexpr' can be used to allow this.`
and **`ExplicitInstantiation/MueLu_TentativePFactory_kokkos.cpp`** showing errors like:
* `Trilinos-atdm-sems-rhel7-cuda-9.2-Volta70-complex-shared-release-debug/SRC_AND_BUILD/Trilinos/packages/kokkos/core/src/Kokkos_View.hpp(971): error: calling a constexpr __host__ function("std::complex<double> ::complex") from a __device__ function("Kokkos::Impl::ParallelFor< ::, ::Kokkos::RangePolicy<int, ::Kokkos::Cuda > , ::Kokkos::Cuda> ::operator () const") is not allowed. The experimental flag '--expt-relaxed-constexpr' can be used to allow this.`
## Current Status on CDash
The current status of these builds over the last 7 days can be see in [this query](https://testing.sandia.gov/cdash/index.php?project=Trilinos&date=2019-03-11&filtercount=3&showfilters=1&filtercombine=and&field1=subprojects&compare1=93&value1=MueLu&field2=buildname&compare2=65&value2=Trilinos-atdm-sems-rhel7-cuda-9.2-Volta70-complex-&field3=buildstarttime&compare3=83&value3=7%20days%20ago).
## Steps to Reproduce
These builds are from the CEE LAN machine 'ascicgpu14' and someone with access to the CEE LAN should be able to log onto 'ascicgpu15' and reproduce these failures in as described in:
* https://github.com/trilinos/Trilinos/blob/develop/cmake/std/atdm/README.md
More specifically, the commands given for the system `sems-rhel7' are provided at:
* https://github.com/trilinos/Trilinos/blob/develop/cmake/std/atdm/README.md#sems-rhel7-environment
The exact commands to reproduce this issue should be:
```
$ cd <some_build_dir>/
$ source $TRILINOS_DIR/cmake/std/atdm/load-env.sh \
sems-rhel7-cuda-9.2-Volta70-complex-shared-release-debug
$ cmake \
-GNinja \
-DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnv.cmake \
-DTrilinos_ENABLE_TESTS=ON -DTrilinos_ENABLE_MueLu=ON \
$TRILINOS_DIR
$ ninja -j16
```
Since some developers do not have access to the SRN CEE LAN, it is likely that these build errors can also be produce on other machines that have a CUDA build. For example, one can likely reproduce these build errors on the SON machine 'white' as described at:
* https://github.com/trilinos/Trilinos/blob/develop/cmake/std/atdm/README.md#ridewhite
using the commands:
```
$ cd <some_build_dir>/
$ source $TRILINOS_DIR/cmake/std/atdm/load-env.sh cuda-9.2-complex-release-debug
$ cmake \
-GNinja \
-DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnv.cmake \
-DTrilinos_ENABLE_TESTS=ON -DTrilinos_ENABLE_MueLu=ON \
$TRILINOS_DIR
$ ninja -j16
```
Initial cleanup of new ATDM builds of Trilinoshttps://gitlab.osti.gov/jmwille/Trilinos/-/issues/4219Elevate ShyLU_Node from ST to PT since it is being used by SPARC?2019-01-26T04:05:03ZJames WillenbringElevate ShyLU_Node from ST to PT since it is being used by SPARC?*Created by: bartlettroscoe*
**CC:** @trilinos/framework, @trilinos/shylu, @srajama1 (Trilinos Linear Solvers Product Area Lead)
**Blocking:** #2597
## Description
The current SPARC Trilinos configuration explicitly enables `S...*Created by: bartlettroscoe*
**CC:** @trilinos/framework, @trilinos/shylu, @srajama1 (Trilinos Linear Solvers Product Area Lead)
**Blocking:** #2597
## Description
The current SPARC Trilinos configuration explicitly enables `ShyLU_Node` subpackage `ShyLU_NodeHTS` (see [here](https://sems-atlassian-son.sandia.gov/jira/browse/TRIL-212?focusedCommentId=25503&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-25503). Therefore, ATDM Trilinos builds supporting SPARC are enabling ShyLU_Node, for example, as shown [here](https://testing.sandia.gov/cdash-dev-view/viewConfigure.php?buildid=4427483) showing:
```
...
-- Setting Trilinos_ENABLE_ShyLU_NodeHTS=ON
-- Setting Trilinos_ENABLE_ShyLU_NodeTacho=ON
-- Setting Trilinos_ENABLE_ShyLU_Node=ON
...
Final set of enabled packages: ... ShyLU_Node ... 41
Final set of enabled SE packages: ... ShyLU_NodeHTS ShyLU_NodeTacho ShyLU_Node ... 112
```
So it looks like `ShyLU_NodeTacho` may be getting implicitly enabled by accident. (We will need to see if SPARC actually using `ShyLU_NodeTacho` or not.) But `ShyLU_NodeTacho` is already declared to be `PT` (Primary Tested) but `ShyLU_NodeHTS` is currently declared to be `ST` (Secondary Tested).
In any case, since an important internal Trilinos customer (i.e SPARC) is using `ShyLU_NodeHTS`, [by definition](http://trac.trilinos.org/wiki/TribitsLifecycleModelOverview#test_categories), it needs to be elevated from Secondary Tested (ST) to Primary Tested (PT). Otherwise, `ShyLU_NodeHTS` will not get enabled in Trilinos PR builds and therefore will not protect SPARC (see #2597).
## Proposed Solution
Update the line:
```
HTS hts ST OPTIONAL
```
to be
```
HTS hts PT OPTIONAL
```
in the file
* Trilinos/packages/shylu/shylu_node/cmake/Dependencies.cmake
https://gitlab.osti.gov/jmwille/Trilinos/-/issues/4159Belos_Tpetra_HybridGMRES_hb_test_* randomly failing in many trilinos builds2019-04-26T14:20:53ZJames WillenbringBelos_Tpetra_HybridGMRES_hb_test_* randomly failing in many trilinos builds*Created by: fryeguy52*
CC: @trilinos/belos, @srajama1 (Trilinos Linear Solvers Product Lead), @bartlettroscoe, @fryeguy52
## Next Action Status
PR #4229 that may fix this was merged to 'develop' on 1/22/2019. Next: Watch for a...*Created by: fryeguy52*
CC: @trilinos/belos, @srajama1 (Trilinos Linear Solvers Product Lead), @bartlettroscoe, @fryeguy52
## Next Action Status
PR #4229 that may fix this was merged to 'develop' on 1/22/2019. Next: Watch for any more random failures and if no new failures by 2/22/2019 then we can close ...
## Description
As shown in [this query](https://testing.sandia.gov/cdash-dev-view/queryTests.php?project=Trilinos&filtercount=5&showfilters=1&filtercombine=and&field1=buildname&compare1=65&value1=Trilinos-atdm&field2=testname&compare2=65&value2=Belos_Tpetra_HybridGMRES_hb_test_&field3=status&compare3=62&value3=passed&field4=status&compare4=62&value4=notrun&field5=buildstarttime&compare5=83&value5=2018-12-01) the tests:
* Belos_Tpetra_HybridGMRES_hb_test_1_MPI_4
* Belos_Tpetra_HybridGMRES_hb_test_0_MPI_4
have failed 11 total times since 2018-12-01 in the following ATDM builds:
* Trilinos-atdm-cee-rhel6-clang-5.0.1-openmpi-1.10.2-serial-static-opt
* Trilinos-atdm-cee-rhel6-gnu-4.9.3-openmpi-1.10.2-serial-static-opt
* Trilinos-atdm-cee-rhel6-gnu-7.2.0-openmpi-1.10.2-serial-static-opt
* Trilinos-atdm-mutrino-intel-opt-openmp-HSW
* Trilinos-atdm-sems-rhel6-gnu-debug-openmp
* Trilinos-atdm-sems-rhel6-gnu-opt-openmp
* Trilinos-atdm-sems-rhel6-gnu-opt-serial
[This query](https://testing.sandia.gov/cdash-dev-view/queryTests.php?project=Trilinos&filtercount=4&showfilters=1&filtercombine=and&field1=testname&compare1=65&value1=Belos_Tpetra_HybridGMRES_hb_test_&field2=status&compare2=62&value2=passed&field3=status&compare3=62&value3=notrun&field4=buildstarttime&compare4=83&value4=2018-12-01) shows that `Belos_Tpetra_HybridGMRES_hb_test_*` tests have been failing in other trilinos builds as well during that same time period.
Here is some typical output from a failure:
```
Belos Version 1.3d - 9/17/2008
Dimension of matrix: 1806
Number of right-hand sides: 1
Block size used by solver: 1
Max number of Gmres iterations: 1805
Relative residual tolerance: 1e-05
Failed.......OR Combination ->
OK...........Number of Iterations = 800 < 1805
Unconverged..(2-Norm Res Vec) / (2-Norm Prec Res0)
residual [ 0 ] = 0.0224497 > 1e-05
========================================================================================================================
TimeMonitor results over 4 processors
Timer Name MinOverProcs MeanOverProcs MaxOverProcs MeanOverCallCounts
------------------------------------------------------------------------------------------------------------------------
Belos: BlockGmresSolMgr total solve time 0.5308 (1) 0.5308 (1) 0.5308 (1) 0.5308 (1)
Belos: DGKS[2]: Ortho (Inner Product) 0.03627 (1370) 0.03643 (1370) 0.03654 (1370) 2.659e-05 (1370)
Belos: DGKS[2]: Ortho (Norm) 0.01371 (2416) 0.01547 (2416) 0.01742 (2416) 6.402e-06 (2416)
Belos: DGKS[2]: Ortho (Update) 0.02398 (1370) 0.02443 (1370) 0.02485 (1370) 1.783e-05 (1370)
Belos: DGKS[2]: Orthogonalization 0.08255 (816) 0.08426 (816) 0.08634 (816) 0.0001033 (816)
Belos: GmresPolyOp creation time 0.001283 (1) 0.001308 (1) 0.001326 (1) 0.001308 (1)
Belos: Hybrid Gmres: Operation Op*x 0.03555 (815) 0.03593 (815) 0.03648 (815) 4.408e-05 (815)
Belos: Hybrid Gmres: Operation Prec*x 0.3885 (816) 0.3908 (816) 0.3932 (816) 0.0004789 (816)
Belos: Operation Op*x 0.382 (8986) 0.3853 (8986) 0.3886 (8986) 4.288e-05 (8986)
Belos: Operation Prec*x 0 (0) 0 (0) 0 (0) 0 (0)
========================================================================================================================
---------- Actual Residuals (normalized) ----------
Problem 0 : 0.0224497
End Result: TEST FAILED
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpiexec detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[58035,1],2]
Exit code: 1
--------------------------------------------------------------------------
```
## Current Status on CDash
* [Current status and recent history of failures of test Belos_Tpetra_HybridGMRES_hb_test_* on CDash](https://testing.sandia.gov/cdash/queryTests.php?project=Trilinos&filtercombine=and&filtercombine=&filtercombine=and&filtercombine=and&filtercount=5&showfilters=1&filtercombine=and&field1=groupname&compare1=61&value1=ATDM&field2=buildname&compare2=65&value2=Trilinos-atdm-&field3=testname&compare3=65&value3=Belos_Tpetra_HybridGMRES_hb_test_&field4=status&compare4=61&value4=failed&field5=buildstarttime&compare5=83&value5=4%20weeks%20ago)
* [Recent history of test Belos_Tpetra_HybridGMRES_hb_test_1_MPI_4 in build](https://testing.sandia.gov/cdash/queryTests.php?project=Trilinos&filtercombine=and&filtercombine=&filtercombine=and&filtercombine=and&filtercount=4&showfilters=1&filtercombine=and&field1=buildname&compare1=61&value1=Trilinos-atdm-chama-intel-opt-openmp&field2=testname&compare2=61&value2=Belos_Tpetra_HybridGMRES_hb_test_1_MPI_4&field3=site&compare3=61&value3=chama&field4=buildstarttime&compare4=83&value4=4%20weeks%20ago)
## Steps to Reproduce
One should be able to reproduce a build where this random failure has a chance of occurring with a sems rhel6 environment as described in:
* https://github.com/trilinos/Trilinos/blob/develop/cmake/std/atdm/README.md
More specifically, the commands given for with a sems rhel6 environment are provided at:
* https://github.com/trilinos/Trilinos/blob/develop/cmake/std/atdm/README.md#sems-rhel6-environment
The exact commands to reproduce a build where this random failure has a chance of occurring should be:
```
$ cd <some_build_dir>/
$ source $TRILINOS_DIR/cmake/std/atdm/load-env.sh Trilinos-atdm-sems-rhel6-gnu-opt-openmp
$ cmake \
-GNinja \
-DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnv.cmake \
-DTrilinos_ENABLE_TESTS=ON -DTrilinos_ENABLE_Belos=ON \
$TRILINOS_DIR
$ make NP=16
$ ctest -j8
```
Keep promoted "ATDM" builds of Trilinos cleanhttps://gitlab.osti.gov/jmwille/Trilinos/-/issues/3991MueLu: MueLu hangs when try to "export data" such as matrices after repartiti...2019-01-23T22:50:02ZJames WillenbringMueLu: MueLu hangs when try to "export data" such as matrices after repartitioning has occurred*Created by: pwxy*
MueLu hangs when try to "export data" such as matrices after repartitioning has occurred.
The MPI processes that have dropped out after repartitioning will throw and the run hangs:
```
p=3: *** Caught standard st...*Created by: pwxy*
MueLu hangs when try to "export data" such as matrices after repartitioning has occurred.
The MPI processes that have dropped out after repartitioning will throw and the run hangs:
```
p=3: *** Caught standard std::exception of type 'Teuchos::bad_any_cast' :
../../packages/muelu/src/Interface/../MueCentral/MueLu_VariableContainer.hpp:103:
Throw number = 17
Throw test that evaluated to true: data_->type() != typeid(T)
Error, cast to type Data<Teuchos::RCP<Xpetra::Matrix<double, int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::OpenMP, Kokkos::HostSpace> > >> failed since the actual underlying type is 'Teuchos::RCP<Xpetra::Operator<double, int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::OpenMP, Kokkos::HostSpace> > >!
This is develop Trilinos cloned this morning (Dec 4, 2018), SHA1 573e3290b0500eee45e582cb8fcee0b1c6476cec
Example MueLu_Driver.exe run that exhibits this issue:
mpirun -n 4 MueLu_Driver.exe --matrixType=Laplace3D --nx=50 --ny=50 --nz=4 --mx=2 --my=2 --mz=1
[ptlin@ceerws3709 scaling]$ cat scaling.xml
<ParameterList name="MueLu">
<!--
For a generic symmetric scalar problem, these are the recommended settings for MueLu.
-->
<!-- =========== GENERAL ================ -->
<Parameter name="verbosity" type="string" value="high"/>
<Parameter name="coarse: max size" type="int" value="1000"/>
<Parameter name="multigrid algorithm" type="string" value="sa"/>
<!-- reduces setup cost for symmetric problems -->
<Parameter name="transpose: use implicit" type="bool" value="true"/>
<!-- start of default values for general options (can be omitted) -->
<Parameter name="max levels" type="int" value="10"/>
<Parameter name="number of equations" type="int" value="1"/>
<Parameter name="sa: use filtered matrix" type="bool" value="true"/>
<!-- end of default values -->
<!-- =========== AGGREGATION =========== -->
<Parameter name="aggregation: type" type="string" value="uncoupled"/>
<Parameter name="aggregation: drop scheme" type="string" value="classical"/>
<!-- Uncomment the next line to enable dropping of weak connections, which can help AMG convergence
for anisotropic problems. The exact value is problem dependent. -->
<!-- <Parameter name="aggregation: drop tol" type="double" value="0.02"/> -->
<!-- =========== SMOOTHING =========== -->
<Parameter name="smoother: type" type="string" value="CHEBYSHEV"/>
<ParameterList name="smoother: params">
<Parameter name="chebyshev: degree" type="int" value="2"/>>
<Parameter name="chebyshev: ratio eigenvalue" type="double" value="7"/>
<Parameter name="chebyshev: min eigenvalue" type="double" value="1.0"/>
<Parameter name="chebyshev: zero starting solution" type="bool" value="true"/>
</ParameterList>
<!-- =========== REPARTITIONING =========== -->
<Parameter name="repartition: enable" type="bool" value="true"/>
<Parameter name="repartition: partitioner" type="string" value="zoltan2"/>
<Parameter name="repartition: start level" type="int" value="2"/>
<Parameter name="repartition: min rows per proc" type="int" value="800"/>
<Parameter name="repartition: max imbalance" type="double" value="1.1"/>
<Parameter name="repartition: remap parts" type="bool" value="false"/>
<!-- start of default values for repartitioning (can be omitted) -->
<Parameter name="repartition: remap parts" type="bool" value="true"/>
<Parameter name="repartition: rebalance P and R" type="bool" value="false"/>
<ParameterList name="repartition: params">
<Parameter name="algorithm" type="string" value="multijagged"/>
</ParameterList>
<!-- end of default values -->
<ParameterList name="export data">
<Parameter name="A" type="string" value="{2}"/>
</ParameterList>
</ParameterList>
[ptlin@ceerws3709 scaling]$
```
https://gitlab.osti.gov/jmwille/Trilinos/-/issues/3941MueLu_ENABLE_Kokkos_Refactor_Use_By_Default not ON by default for CUDA builds2019-01-15T17:00:15ZJames WillenbringMueLu_ENABLE_Kokkos_Refactor_Use_By_Default not ON by default for CUDA builds*Created by: bartlettroscoe*
@trilinos/muelu
CC: @bathmatt, @jmgate, @jhux2, @fryeguy52
It was reported by EMPIRE developers (i.e. @bathmatt) that the option `HAVE_MUELU_KOKKOS_REFACTOR_USE_BY_DEFAULT` is getting set to `OFF` by...*Created by: bartlettroscoe*
@trilinos/muelu
CC: @bathmatt, @jmgate, @jhux2, @fryeguy52
It was reported by EMPIRE developers (i.e. @bathmatt) that the option `HAVE_MUELU_KOKKOS_REFACTOR_USE_BY_DEFAULT` is getting set to `OFF` by default for CUDA builds where it was seemed to be getting set to `ON` by default as verified in https://github.com/trilinos/Trilinos/issues/2674#issuecomment-433185731 on 10/25/2018.
However, looking at a recent CUDA for `Trilinos-atdm-white-ride-cuda-9.2-debug` for example shown [here](https://testing.sandia.gov/cdash-dev-view/viewNotes.php?buildid=4220171#!#note1) today we see:
```
MueLu_ENABLE_Kokkos_Refactor:BOOL=ON
MueLu_ENABLE_Kokkos_Refactor_Use_By_Default:BOOL=NO
...
Xpetra_ENABLE_Kokkos_Refactor:BOOL=ON
```
But I am also seeing:
```
HAVE_MUELU_KOKKOS_REFACTOR:INTERNAL=ON
HAVE_MUELU_KOKKOS_REFACTOR_USE_BY_DEFAULT:INTERNAL=OFF
...
HAVE_XPETRA_KOKKOS_REFACTOR:INTERNAL=ON
```
@jhux2, how is it possible for `MueLu_ENABLE_Kokkos_Refactor_Use_By_Default` to be different than `HAVE_MUELU_KOKKOS_REFACTOR_USE_BY_DEFAULT`?
https://gitlab.osti.gov/jmwille/Trilinos/-/issues/3862MueLu_ConvergenceTpetra_MPI tests failing on mutrino KNL ATDM build2019-02-28T18:58:23ZJames WillenbringMueLu_ConvergenceTpetra_MPI tests failing on mutrino KNL ATDM build*Created by: fryeguy52*
CC: @trilinos/muelu @srajama1 (Trilinos Linear Solvers Product Lead), @bartlettroscoe
## Next Action Status
## Description
As shown in [this query](https://testing.sandia.gov/cdash/queryTests.php?proje...*Created by: fryeguy52*
CC: @trilinos/muelu @srajama1 (Trilinos Linear Solvers Product Lead), @bartlettroscoe
## Next Action Status
## Description
As shown in [this query](https://testing.sandia.gov/cdash/queryTests.php?project=Trilinos&filtercombine=and&filtercombine=&filtercombine=and&filtercount=4&showfilters=1&filtercombine=and&field1=buildname&compare1=61&value1=Trilinos-atdm-mutrino-intel-opt-openmp-KNL&field2=testname&compare2=63&value2=MueLu_ConvergenceTpetra_MPI_&field3=buildstarttime&compare3=84&value3=2018-11-13T00%3A00%3A00&field4=buildstarttime&compare4=83&value4=2018-11-12T00%3A00%3A00) the tests:
* MueLu_ConvergenceTpetra_MPI_1
* MueLu_ConvergenceTpetra_MPI_4
are failing in the build:
* Trilinos-atdm-mutrino-intel-opt-openmp-KNL
Test output:
```
srun: Job 13306723 step creation temporarily disabled, retrying
srun: Step created for job 13306723
p=0: *** Caught standard std::exception of type 'N7Teuchos40CantFindParameterEntryConverterExceptionE' :
/lscratch1/jenkins/mutrino-slave/workspace/Trilinos-atdm-mutrino-intel-opt-openmp-KNL/SRC_AND_BUILD/Trilinos/packages/teuchos/parameterlist/src/Teuchos_ParameterEntryXMLConverterDB.cpp:85:
Throw number = 1
Throw test that evaluated to true: it == getConverterMap().end()
Can't find converter for parameter entry of type: long long
End Result: TEST FAILED
srun: error: nid00192: task 0: Exited with exit code 1
srun: Terminating job step 13306723.920
```
## Current Status on CDash
The current status of these tests/builds for the current testing day can be found at:
[Tests status for current day](https://testing.sandia.gov/cdash/queryTests.php?project=Trilinos&filtercombine=and&filtercombine=and&filtercombine=and&filtercount=2&showfilters=1&filtercombine=and&field1=buildname&compare1=61&value1=Trilinos-atdm-mutrino-intel-opt-openmp-KNL&field2=testname&compare2=63&value2=MueLu_ConvergenceTpetra_MPI_)
This build takes a long time run and the results may not be available until later in the day, so the below is a link the status of the tests yesterday.
[Test status of previous day](https://testing.sandia.gov/cdash/queryTests.php?project=Trilinos&filtercombine=and&filtercombine=and&filtercount=3&showfilters=1&filtercombine=and&field1=buildname&compare1=61&value1=Trilinos-atdm-mutrino-intel-opt-openmp-KNL&field2=testname&compare2=63&value2=MueLu_ConvergenceTpetra_MPI_&field3=buildstarttime&compare3=83&value3=yesterday)
## Steps to Reproduce
One should be able to reproduce this failure on the machine mutrino as described in:
* https://github.com/trilinos/Trilinos/blob/develop/cmake/std/atdm/README.md
More specifically, the commands given for the system mutrino are provided at:
* https://github.com/trilinos/Trilinos/blob/develop/cmake/std/atdm/README.md#mutrino
The exact commands to reproduce this issue should be:
```
$ cd <some_build_dir>/
$ source $TRILINOS_DIR/cmake/std/atdm/load-env.sh Trilinos-atdm-mutrino-intel-opt-openmp-KNL
$ cmake \
-DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnv.cmake \
-DTrilinos_ENABLE_TESTS=ON -DTrilinos_ENABLE_MueLu=ON \
$TRILINOS_DIR
$ make -j16
$ salloc -N 1 -p standard -J Trilinos-atdm-mutrino-intel-opt-openmp-KNL ctest -j16
```Initial cleanup of new ATDM builds of Trilinoshttps://gitlab.osti.gov/jmwille/Trilinos/-/issues/3863Stratimikos_test_single_belos_thyra_solver_driver_nos1_nrhs8_MPI_1 failing on...2019-02-28T18:58:23ZJames WillenbringStratimikos_test_single_belos_thyra_solver_driver_nos1_nrhs8_MPI_1 failing on KNL mutrino ATDM build*Created by: fryeguy52*
CC: @trilinos/<package-name>, @srajama1 (Trilinos Linear Solvers Product Lead), @bartlettroscoe
## Next Action Status
## Description
As shown in [this query](<cdash-query-url>) the test:
* Stratimikos_...*Created by: fryeguy52*
CC: @trilinos/<package-name>, @srajama1 (Trilinos Linear Solvers Product Lead), @bartlettroscoe
## Next Action Status
## Description
As shown in [this query](<cdash-query-url>) the test:
* Stratimikos_test_single_belos_thyra_solver_driver_nos1_nrhs8_MPI_1
is failing in the builds:
* Trilinos-atdm-mutrino-intel-opt-openmp-KNL
Detailed test output can be found [here](https://testing.sandia.gov/cdash/testDetails.php?test=59061703&build=4163447). Looks like a failure to say that it converged:
```
Starting iterations with Belos:
Using forward operator = N5Thyra14EpetraLinearOpE{rangeDim=237,domainDim=237}
Using iterative solver = "Belos::BlockGmresSolMgr":
Template parameters:
ScalarType: double
MV: N5Thyra15MultiVectorBaseIdEE
OP: N5Thyra12LinearOpBaseIdEE
Flexible: false
Num Blocks: 300
Maximum Iterations: 1000
Maximum Restarts: 25
Convergence Tolerance: 1e-13
With #Eqns=237, #RHSs=8 ...
The Belos solver of type ""Belos::BlockGmresSolMgr": {Flexible: false, Num Blocks: 300, Maximum Iterations: 1000, Maximum Restarts: 25, Convergence Tolerance: 1e-13}" returned a solve status of "SOLVE_STATUS_UNCONVERGED" in 1 iterations with total CPU time of 0.241756 sec
=> Solve time = 0.241863 sec
solve status:
solveStatus = SOLVE_STATUS_UNCONVERGED
achievedTol = 3.94193e-16
message:
The Belos solver of type ""Belos::BlockGmresSolMgr": {Flexible: false, Num Blocks: 300, Maximum Iterations: 1000, Maximum Restarts: 25, Convergence Tolerance: 1e-13}" returned a solve status of "SOLVE_STATUS_UNCONVERGED" in 1 iterations with total CPU time of 0.241756 sec
extraParameters:
Belos/Iteration Count : int = 1 [unused]
Iteration Count : int = 1 [unused]
Belos/Achieved Tolerance : double = 3.94193e-16 [unused]
check: solveStatus = SOLVE_STATUS_UNCONVERGED == SOLVE_STATUS_CONVERGED : FAILED
```
## Current Status on CDash
The current status of these tests/builds for the current testing day can be found at:
[Tests status for current day](https://testing.sandia.gov/cdash/queryTests.php?project=Trilinos&filtercombine=and&filtercombine=&filtercombine=and&filtercount=2&showfilters=1&filtercombine=and&field1=buildname&compare1=61&value1=Trilinos-atdm-mutrino-intel-opt-openmp-KNL&field2=testname&compare2=61&value2=Stratimikos_test_single_belos_thyra_solver_driver_nos1_nrhs8_MPI_1)
This build takes a long time run and the results may not be available until later in the day, so the below is a link the status of the tests yesterday.
[Test status of previous day](https://testing.sandia.gov/cdash/queryTests.php?project=Trilinos&date=2018-11-12&filtercombine=and&filtercombine=and&filtercount=3&showfilters=1&filtercombine=and&field1=buildname&compare1=61&value1=Trilinos-atdm-mutrino-intel-opt-openmp-KNL&field2=testname&compare2=61&value2=Stratimikos_test_single_belos_thyra_solver_driver_nos1_nrhs8_MPI_1&field3=buildstarttime&compare3=83&value3=yesterday)
## Steps to Reproduce
One should be able to reproduce this failure on the machine mutrino as described in:
* https://github.com/trilinos/Trilinos/blob/develop/cmake/std/atdm/README.md
More specifically, the commands given for the system mutrino are provided at:
* https://github.com/trilinos/Trilinos/blob/develop/cmake/std/atdm/README.md#mutrino
The exact commands to reproduce this issue should be:
```
$ cd <some_build_dir>/
$ source $TRILINOS_DIR/cmake/std/atdm/load-env.sh Trilinos-atdm-mutrino-intel-opt-openmp-KNL
$ cmake \
-DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnv.cmake \
-DTrilinos_ENABLE_TESTS=ON -DTrilinos_ENABLE_Stratimikos=ON \
$TRILINOS_DIR
$ make -j16
$ salloc -N 1 -p standard -J Trilinos-atdm-mutrino-intel-opt-openmp-KNL ctest -j16
```Initial cleanup of new ATDM builds of Trilinoshttps://gitlab.osti.gov/jmwille/Trilinos/-/issues/3638Teko_ModALPreconditioner_MPI_1 Failing in build Trilinos-atdm-cee-rhel6-clang...2018-12-20T20:25:33ZJames WillenbringTeko_ModALPreconditioner_MPI_1 Failing in build Trilinos-atdm-cee-rhel6-clang-5.0.1-openmpi-1.10.2-serial-static-opt*Created by: fryeguy52*
CC: @trilinos/teko , @srajama1 (Trilinos Linear Solvers Product Lead), @bartlettroscoe
## Next Action Status
With the merge of PR #4079 to 'develop' on 12/19/2018, this test `Teko_ModALPreconditioner_MPI...*Created by: fryeguy52*
CC: @trilinos/teko , @srajama1 (Trilinos Linear Solvers Product Lead), @bartlettroscoe
## Next Action Status
With the merge of PR #4079 to 'develop' on 12/19/2018, this test `Teko_ModALPreconditioner_MPI_1` is disabled and shown missing in the build `Trilinos-atdm-cee-rhel6-clang-5.0.1-openmpi-1.10.2-serial-static-opt ` on 12/20/2018.
## Description
As shown in [this query](https://testing.sandia.gov/cdash-dev-view/queryTests.php?project=Trilinos&date=2018-10-15&filtercount=2&showfilters=1&filtercombine=and&field1=buildname&compare1=65&value1=Trilinos-atdm-cee-rhel6-&field2=status&compare2=62&value2=passed) the tests:
* Teko_ModALPreconditioner_MPI_1
are failing in the builds:
* Trilinos-atdm-cee-rhel6-clang-opt-serial
failing from a seg fault:
```
[ceerws1113:36105] *** Process received signal ***
[ceerws1113:36105] Signal: Segmentation fault (11)
[ceerws1113:36105] Signal code: Address not mapped (1)
[ceerws1113:36105] Failing at address: (nil)
```
## Steps to Reproduce
One should be able to reproduce this failure any CEE LAN RHEL6 SRN machine as described in:
* https://github.com/trilinos/Trilinos/blob/develop/cmake/std/atdm/README.md
More specifically, the commands given for the CEE LAN RHEL6 SRN machines are provided at:
* https://github.com/trilinos/Trilinos/blob/develop/cmake/std/atdm/README.md#cee-rhel6-environment
The exact commands to reproduce this issue should be:
```
$ cd <some_build_dir>/
$ source $TRILINOS_DIR/cmake/std/atdm/load-env.sh cee-rhel6-clang-opt-serial
$ cmake \
-GNinja \
-DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnv.cmake \
-DTrilinos_ENABLE_TESTS=ON -DTrilinos_ENABLE_Teko=ON \
$TRILINOS_DIR
$ make NP=16
$ ctest -j16
```Initial cleanup of new ATDM builds of Trilinoshttps://gitlab.osti.gov/jmwille/Trilinos/-/issues/3585Test Anasazi_Epetra_OrthoManagerGenTester_0_MPI_4 appears to be randomly fail...2019-04-02T18:21:50ZJames WillenbringTest Anasazi_Epetra_OrthoManagerGenTester_0_MPI_4 appears to be randomly failing in many builds including CI, PR, and ATDM builds*Created by: bartlettroscoe*
CC: @trilinos/framework, @trilinos/anasazi, @srajama1 (Trilinos Linear Solver Product Area Lead)
## Next Action Status
PR #4052 merged to 'develop' on 12/18/2018 but still failing after that. Next: Tr...*Created by: bartlettroscoe*
CC: @trilinos/framework, @trilinos/anasazi, @srajama1 (Trilinos Linear Solver Product Area Lead)
## Next Action Status
PR #4052 merged to 'develop' on 12/18/2018 but still failing after that. Next: Try to fix again?
## Description
It would seem that the test `Anasazi_Epetra_OrthoManagerGenTester_0_MPI_4` is very occasionally randomly failing in various builds. As shown in [this query](https://testing.sandia.gov/cdash-dev-view/queryTests.php?project=Trilinos&date=2018-10-09&filtercount=4&showfilters=1&filtercombine=and&field1=testname&compare1=61&value1=Anasazi_Epetra_OrthoManagerGenTester_0_MPI_4&field2=status&compare2=61&value2=failed&field3=details&compare3=64&value3=timeout&field4=buildstarttime&compare4=83&value4=2018-07-01), this test failed 10 times since 7/1/2018 in the builds:
* `Linux-GCC-4.8.4-MPI_RELEASE_DEBUG_SHARED_PT_OPENMP_CI` (post-push CI build): 1 time (today)
* `PR-XXXX-test-Trilinos_pullrequest_gcc_4.9.3-YYYY` (standard PR build): 4 times
* `PR-XXXX-test-Trilinos_pullrequest_gcc_4.8.4-YYYY` (standard PR build): 1 time
* `Trilinos-atdm-chama-intel-debug-openmp` (standard ATDM build): 1 time
* `Trilinos-atdm-rhel6-gnu-opt-openmp` (standard ATDM build): 2 times
* `Trilinos-atdm-waterman-cuda-9.2-debug` (standard ATDM build): 1 time
In each of these 10 failures in the last 3 months, such as the CI failure today shown [here](https://testing.sandia.gov/cdash-dev-view/testDetails.php?test=56264374&build=4031303), it shows failures like:
```
projectAndNormalizeGen() returned rank 5
|| <S,S> - I || after : 2.65912e-11
1|| S_in - X1*C1 - X2*C2 - S_out*B || : 1.70776e-09
vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv tolerance exceeded! test failed!
```
The location of these failures seems to change in this test but all of the failures appear to be "tolerance exceeded! test failed!"
Is there some type of non-deterministic behavior in this test or in the underlying Anasazi code that allows for these types of random failures?
## Steps to Reproduce
Given that this test seems to be failing randomly only very occasionally, this might be hard to reproduce locally. But given that this has failed in the post-push GCC 4.8.4 CI build and the GCC 4.9.3 PR build one might be able to use one of those.
Keep promoted "ATDM" builds of Trilinos cleanhttps://gitlab.osti.gov/jmwille/Trilinos/-/issues/3499Anasazi tests failing in intel-18.0.2 builds on 'mutrino' and 'cee-rhel6' envs2019-03-26T16:13:53ZJames WillenbringAnasazi tests failing in intel-18.0.2 builds on 'mutrino' and 'cee-rhel6' envs*Created by: fryeguy52*
CC: @trilinos/anasazi , @srajama1 (Trilinos Linear Solvers Product Lead), @bartlettroscoe
## Next Action Status
## Description
As shown in [this query](https://testing.sandia.gov/cdash-dev-view/queryTes...*Created by: fryeguy52*
CC: @trilinos/anasazi , @srajama1 (Trilinos Linear Solvers Product Lead), @bartlettroscoe
## Next Action Status
## Description
As shown in [this query](https://testing.sandia.gov/cdash-dev-view/queryTests.php?project=Trilinos&date=2018-09-24&filtercount=5&showfilters=1&filtercombine=and&field1=groupname&compare1=61&value1=ATDM&field2=site&compare2=61&value2=mutrino&field3=status&compare3=62&value3=passed&field4=buildstarttime&compare4=83&value4=2018-09-01&field5=testname&compare5=63&value5=Anasazi) the tests:
* `Anasazi_MultiVecTraitsTest2_MPI_4`
* `Anasazi_Epetra_BKS_norestart_test_MPI_4`
are failing in the builds:
* Trilinos-atdm-mutrino-intel-opt-openmp-HSW
* Trilinos-atdm-mutrino-intel-opt-openmp-KNL
both of these tests started failing on 9-22-2018.
The test `Anasazi_Epetra_BKS_norestart_test_MPI_4` is also failing in the build `Trilinos-atdm-cee-rhel6-intel-18.0.2-mpich2-3.2-serial-static-opt` for the 'cee-rhel6' inv since it was first set up.
The first failures of the test `Anasazi_MultiVecTraitsTest2_MPI_4` on 9/22/2018 is shown [here](https://testing.sandia.gov/cdash-dev-view/testDetails.php?test=55028214&build=3963372) which shows:
```
Check B_view = CloneViewNonConst(B, ind):
ind: [0, 2, 4, 6, 8]
static_cast<size_t> (B_view->getNumVectors ()) = 5 == static_cast<size_t> (ind.size ()) = 5 : passed
norms of CloneViewNonConst(B, ind): [2.42234, 2.43667, 2.43783, 2.39508, 2.97253]
B_view_norms[j] = 2.42233923795730766e+00 == normsB1[ind[j]] = 2.42233923795730766e+00 : passed
B_view_norms[j] = 2.43666989157577962e+00 == normsB1[ind[j]] = 2.43666989157578007e+00 : FAILED ==> /lscratch1/jenkins/mutrino-slave/workspace/Trilinos-atdm-mutrino-intel-opt-openmp-HSW/SRC_AND_BUILD/Trilinos/packages/anasazi/tpetra/test/MVOPTester/MultiVecTraitsTest2.cpp:573
...
[FAILED] (0.158 sec) MultiVecTraits_TpetraSetBlock4_UnitTest
Location: /lscratch1/jenkins/mutrino-slave/workspace/Trilinos-atdm-mutrino-intel-opt-openmp-HSW/SRC_AND_BUILD/Trilinos/packages/anasazi/tpetra/test/MVOPTester/MultiVecTraitsTest2.cpp:434
The following tests FAILED:
3. MultiVecTraits_TpetraSetBlock4_UnitTest ...
Total Time: 1.55 sec
Summary: total = 4, run = 4, passed = 3, failed = 1
```
The first failures of the test `Anasazi_Epetra_BKS_norestart_test_MPI_4` on 9/22/2018 is shown [here](https://testing.sandia.gov/cdash-dev-view/testDetails.php?test=55028175&build=3963372) which shows:
```
Anasazi_Epetra_BKS_norestart_test.exe: malloc.c:2392: sysmalloc: Assertion `(old_top == (((mbinptr) (((char *) &((av)->bins[((1) - 1) * 2])) - __builtin_offsetof (struct malloc_chunk, fd)))) && old_size == 0) || ((unsigned long) (old_size) >= (unsigned long)((((__builtin_offsetof (struct malloc_chunk, fd_nextsize))+((2 *(sizeof(size_t)) < __alignof__ (long double) ? __alignof__ (long double) : 2 *(sizeof(size_t))) - 1)) & ~((2 *(sizeof(size_t)) < __alignof__ (long double) ? __alignof__ (long double) : 2 *(sizeof(size_t))) - 1))) && ((old_top)->size & 0x1) && ((unsigned long) old_end & (pagesize - 1)) == 0)' failed.
Anasazi_Epetra_BKS_norestart_test.exe: malloc.c:2392: sysmalloc: Assertion `(old_top == (((mbinptr) (((char *) &((av)->bins[((1) - 1) * 2])) - __builtin_offsetof (struct malloc_chunk, fd)))) && old_size == 0) || ((unsigned long) (old_size) >= (unsigned long)((((__builtin_offsetof (struct malloc_chunk, fd_nextsize))+((2 *(sizeof(size_t)) < __alignof__ (long double) ? __alignof__ (long double) : 2 *(sizeof(size_t))) - 1)) & ~((2 *(sizeof(size_t)) < __alignof__ (long double) ? __alignof__ (long double) : 2 *(sizeof(size_t))) - 1))) && ((old_top)->size & 0x1) && ((unsigned long) old_end & (pagesize - 1)) == 0)' failed.
srun: error: nid00012: tasks 0,3: Segmentation fault
srun: Terminating job step 11643442.1635
srun: error: nid00012: task 1: Aborted
slurmstepd: error: *** STEP 11643442.1635 ON nid00012 CANCELLED AT 2018-09-22T07:50:20 ***
srun: error: nid00012: task 2: Aborted (core dumped)
```
New commits for this build can be seen [here](https://testing.sandia.gov/cdash-dev-view/viewNotes.php?buildid=3963370#!#note6)
## Current Status on CDash
See:
* [Non passing Anasazi tests in 'mutrino' builds last 2 days](https://testing.sandia.gov/cdash/queryTests.php?project=Trilinos&filtercombine=and&filtercombine=and&filtercombine=and&filtercount=5&showfilters=1&filtercombine=and&field1=testname&compare1=65&value1=Anasazi_&field2=buildname&compare2=65&value2=Trilinos-atdm-mutrino-&field3=groupname&compare3=62&value3=Experimental&field4=status&compare4=62&value4=passed&field5=buildstarttime&compare5=83&value5=2%20days%20ago)
## Steps to Reproduce
One should be able to reproduce this failure on the machine mutrino as described in:
* https://github.com/trilinos/Trilinos/blob/develop/cmake/std/atdm/README.md
More specifically, the commands given for the system mutrino are provided at:
* https://github.com/trilinos/Trilinos/blob/develop/cmake/std/atdm/README.md#mutrino
The exact commands to reproduce this issue should be:
```
$ cd <some_build_dir>/
$ source $TRILINOS_DIR/cmake/std/atdm/load-env.sh intel-opt-openmp-HSW
$ cmake \
-DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnv.cmake \
-DTrilinos_ENABLE_TESTS=ON -DTrilinos_ENABLE_Anasazi=ON \
$TRILINOS_DIR
$ make -j16
$ salloc -N 1 -p standard -J $JOB_NAME ctest -j16
```Keep promoted "ATDM" builds of Trilinos cleanhttps://gitlab.osti.gov/jmwille/Trilinos/-/issues/3137MueLu: Static library is HUGE; splitting by GlobalOrdinal etc. won't help2019-05-03T22:19:12ZJames WillenbringMueLu: Static library is HUGE; splitting by GlobalOrdinal etc. won't help*Created by: mhoemmen*
@trilinos/muelu @micahahoward @tcfisher
SPARC wasn't actually able to build MueLu before on Intel 17, static debug, because the library was too huge for the linker. PR #3100 fixes this for ATDM Dashboard buil...*Created by: mhoemmen*
@trilinos/muelu @micahahoward @tcfisher
SPARC wasn't actually able to build MueLu before on Intel 17, static debug, because the library was too huge for the linker. PR #3100 fixes this for ATDM Dashboard builds by using a new BinUtils module, and thus a new linker. I fixed this for SPARC by setting `Tpetra_INST_INT_INT=OFF` and `Amesos2_ENABLE_Epetra=OFF`.
While I'm able to build Trilinos now, `libmuelu.a` is still 3.8G. This is with only one Scalar type (`double`), one GlobalOrdinal type (`long long`), and one Node type (`OpenMP`) enabled. That suggests that splitting MueLu's library by GlobalOrdinal and/or Node won't actually help shrink the library. Instead, if we want to split it, we'll need to split it by topic.
## Expectations
Libraries shouldn't be so huge that they require 64-bit linkers.
## Current Behavior
`libmuelu.a` is 3.8G when I have only one Scalar, GlobalOrdinal, and Node type combination enabled.
## Possible Solution
Split `libmuelu` by topic, e.g., smoothers, aggregation, etc.
## Steps to Reproduce
Intel 17 static debug build, ATDM libraries.
## Related Issues
* Follows #3069
https://gitlab.osti.gov/jmwille/Trilinos/-/issues/2919Belos_rcg_hb_MPI_4 timing out in several ATDM Trilinos builds on 'hansen' sin...2018-11-30T11:16:53ZJames WillenbringBelos_rcg_hb_MPI_4 timing out in several ATDM Trilinos builds on 'hansen' since 5/29/2018*Created by: bartlettroscoe*
CC: @trilinos/belos, @fryeguy52, @srajama1 (Linear Solves Project Lead)
## Next Action Status
Test was disabled in these builds on 'hansen' in the commit 8850c64 pushed on 6/12/2018 and was shown to be...*Created by: bartlettroscoe*
CC: @trilinos/belos, @fryeguy52, @srajama1 (Linear Solves Project Lead)
## Next Action Status
Test was disabled in these builds on 'hansen' in the commit 8850c64 pushed on 6/12/2018 and was shown to be disabled in the builds on CDash 6/13/2018
## Description
As shown in [this large query](https://testing-vm.sandia.gov/cdash/queryTests.php?project=Trilinos&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercount=19&showfilters=1&filtercombine=and&field1=buildname&compare1=65&value1=Trilinos-atdm-&field2=buildname&compare2=62&value2=Trilinos-atdm-mutrino-intel-debug-openmp&field3=buildname&compare3=62&value3=Trilinos-atdm-mutrino-intel-opt-openmp&field4=buildname&compare4=62&value4=Trilinos-atdm-white-ride-cuda-debug-pt-all-at-once&field5=buildname&compare5=62&value5=Trilinos-atdm-serrano-intel-debug-openmp&field6=buildname&compare6=62&value6=Trilinos-atdm-serrano-intel-opt-openmp&field7=buildname&compare7=62&value7=Trilinos-atdm-chama-intel-opt-openmp&field8=buildname&compare8=62&value8=Trilinos-atdm-chama-intel-debug-openmp-panzer&field9=buildname&compare9=62&value9=Trilinos-atdm-chama-intel-debug-openmp&field10=buildname&compare10=62&value10=Trilinos-atdm-chama-intel-opt-openmp-panzer&field11=site&compare11=62&value11=ride&field12=testname&compare12=61&value12=Belos_rcg_hb_MPI_4&field13=buildstarttime&compare13=84&value13=2018-06-08&field14=buildstarttime&compare14=83&value14=2018-05-10&field15=buildname&compare15=62&value15=Trilinos-atdm-white-ride-cuda-opt&field16=buildname&compare16=62&value16=Trilinos-atdm-white-ride-gnu-opt-openmp&field17=site&compare17=62&value17=serrano&field18=site&compare18=62&value18=shiller&field19=buildname&compare19=62&value19=Trilinos-atdm-white-ride-cuda-debug-all-at-once) the test `Belos_rcg_hb_MPI_4` looks to be consistently timing out in the builds:
* Trilinos-atdm-hansen-shiller-cuda-8.0-debug
* Trilinos-atdm-hansen-shiller-cuda-8.0-opt
* Trilinos-atdm-hansen-shiller-cuda-9.0-debug
* Trilinos-atdm-hansen-shiller-cuda-9.0-opt
* Trilinos-atdm-hansen-shiller-gnu-debug-serial
* Trilinos-atdm-hansen-shiller-gnu-opt-serial
all on 'hansen' starting on 5/29/201 or 5/30/2018. (Since the these builds are pulling directly from the 'develop' branch, they may be testing different versions on the same day and this is UTC time so they may be on the same testing day in Mountain time.)
That same query shows that that test has been consistently passing in every other promoted build on every other ATDM Trilinos testing machine.
What that query also shows is that in those same builds that are now timing out, the test was taking upwards of 6+ minutes to complete before it started timing out at 10 minutes on 5/29/201 or 5/30/2018 as shown in the last non-timing-out builds:
* Trilinos-atdm-hansen-shiller-cuda-8.0-debug: 6m 26s 280ms
* Trilinos-atdm-hansen-shiller-cuda-8.0-opt: 6m 25s 680ms
* Trilinos-atdm-hansen-shiller-cuda-9.0-debug: 6m 22s 810ms
* Trilinos-atdm-hansen-shiller-cuda-9.0-opt: 6m 22s 440ms
* Trilinos-atdm-hansen-shiller-gnu-debug-serial: 6m 13s 150ms
* Trilinos-atdm-hansen-shiller-gnu-opt-serial: 5m 58s 960ms
But the other builds that are not showing any timeouts, that test completes very fast (in under 30 seconds in about every case). Some of the recent test times shown in that query for the various builds that don't have timeouts now are:
* Trilinos-atdm-hansen-shiller-gnu-debug-openmp: 23s 850ms
* Trilinos-atdm-hansen-shiller-gnu-opt-openmp: 8s 650ms
* Trilinos-atdm-hansen-shiller-intel-debug-openmp: 7s 720ms
* Trilinos-atdm-hansen-shiller-intel-debug-serial: 7s 950ms
* Trilinos-atdm-hansen-shiller-intel-opt-openmp: 6s 150ms
* Trilinos-atdm-hansen-shiller-intel-opt-serial: 5s 910ms
* Trilinos-atdm-rhel6-gnu-debug-openmp: 6s 840ms
* Trilinos-atdm-rhel6-gnu-debug-serial: 5s 340ms
* Trilinos-atdm-rhel6-gnu-opt-openmp: 5s 180ms
* Trilinos-atdm-rhel6-gnu-opt-serial: 4s 250ms
* Trilinos-atdm-rhel6-intel-opt-openmp: 3s 740ms
* Trilinos-atdm-sems-gcc-7-2-0: 5s 290ms
* Trilinos-atdm-white-ride-cuda-debug: 9s 430ms
* Trilinos-atdm-white-ride-gnu-debug-openmp: 9s 90ms
So this seems pretty crazy. How can the same test take over 6 minutes to complete for a CUDA 8.0 and 9.0 optimized build on 'hansen' and only take 9 seconds for a CUDA debug on 'white'? And this test takes a very long time (and are now timing out) for the `gnu-debug-serial` and `gnu-opt-serial` builds as well on 'hansen' but is fast for the `intel-debug-serial` and `intel-opt-serial` builds on the same machine. How can that be the case?
To try to get more insight about this test we can look at the test output for a case where it takes a long time to run (and is timing out currently) and compare that to the test output for a case that completes very quickly.
First, lets look at the last time this test passed for the `Trilinos-atdm-hansen-shiller-gnu-debug-serial` build on 'hansen' which took 6m 13s 150ms to complate and pass on 2018-05-29T06:41:19 UTC with output shown at:
* https://testing-vm.sandia.gov/cdash/testDetails.php?test=47454651&build=3555977
which shows:
```
Passed.......OR Combination ->
OK...........Number of Iterations = 2206 < 4000
Converged....(2-Norm Imp Res Vec) / (2-Norm Res0)
residual [ 0 ] = 9.56537e-07 < 1e-06
residual [ 1 ] = 9.4486e-07 < 1e-06
residual [ 2 ] = 9.24543e-07 < 1e-06
residual [ 3 ] = 9.44363e-07 < 1e-06
residual [ 4 ] = 9.64382e-07 < 1e-06
residual [ 5 ] = 9.14533e-07 < 1e-06
residual [ 6 ] = 9.50517e-07 < 1e-06
residual [ 7 ] = 8.31671e-07 < 1e-06
residual [ 8 ] = 9.59686e-07 < 1e-06
residual [ 9 ] = 9.74218e-07 < 1e-06
==================================================================================================================================
TimeMonitor results over 4 processors
Timer Name MinOverProcs MeanOverProcs MaxOverProcs MeanOverCallCounts
----------------------------------------------------------------------------------------------------------------------------------
Belos: Operation Op*x 1.489 (2.114e+04) 1.582 (2.114e+04) 1.668 (2.114e+04) 7.483e-05 (2.114e+04)
Belos: Operation Prec*x 0 (0) 0 (0) 0 (0) 0 (0)
Belos: RCGSolMgr total solve time 365.4 (1) 365.4 (1) 365.4 (1) 365.4 (1)
Epetra_CrsMatrix::Multiply(TransA,X,Y) 1.45 (2.114e+04) 1.542 (2.114e+04) 1.629 (2.114e+04) 7.295e-05 (2.114e+04)
==================================================================================================================================
```
And let's compare this to the test output for the build `Trilinos-atdm-hansen-shiller-intel-debug-serial` on 'hansen' which took 6s 740ms to complete and pass on 2018-05-29T14:52:35 UTC shown at:
* https://testing-vm.sandia.gov/cdash/testDetails.php?test=47482010&build=3557186
which shows:
```
Passed.......OR Combination ->
OK...........Number of Iterations = 2131 < 4000
Converged....(2-Norm Imp Res Vec) / (2-Norm Res0)
residual [ 0 ] = 9.5909e-07 < 1e-06
residual [ 1 ] = 9.65321e-07 < 1e-06
residual [ 2 ] = 8.59334e-07 < 1e-06
residual [ 3 ] = 9.55053e-07 < 1e-06
residual [ 4 ] = 9.97094e-07 < 1e-06
residual [ 5 ] = 7.53902e-07 < 1e-06
residual [ 6 ] = 8.46489e-07 < 1e-06
residual [ 7 ] = 9.64082e-07 < 1e-06
residual [ 8 ] = 9.92318e-07 < 1e-06
residual [ 9 ] = 9.92263e-07 < 1e-06
==================================================================================================================================
TimeMonitor results over 4 processors
Timer Name MinOverProcs MeanOverProcs MaxOverProcs MeanOverCallCounts
----------------------------------------------------------------------------------------------------------------------------------
Belos: Operation Op*x 2.026 (2.109e+04) 2.179 (2.109e+04) 2.403 (2.109e+04) 0.0001033 (2.109e+04)
Belos: Operation Prec*x 0 (0) 0 (0) 0 (0) 0 (0)
Belos: RCGSolMgr total solve time 5.945 (1) 5.946 (1) 5.946 (1) 5.946 (1)
Epetra_CrsMatrix::Multiply(TransA,X,Y) 1.975 (2.109e+04) 2.116 (2.109e+04) 2.316 (2.109e+04) 0.0001003 (2.109e+04)
==================================================================================================================================
```
The times for the individual operations is not that different but "Belos: RCGSolMgr total solve time" at 365.4 vs. 5.946 is the real problem. The final results show that the test is doing different computations in these two builds but the total number of operations is not radically different (e,g, 2.114e+04 vs. 2.109e+04 mat-vecs). So what is going on here to cause the huge increase in wall clock time for a serial Kokkos threading test?
Looking at the new commits pulled in when this started to fail for the build `Trilinos-atdm-hansen-shiller-gnu-opt-serial` on 2018-05-29 14:05:09 shown at:
* https://testing-vm.sandia.gov/cdash/viewNotes.php?buildid=3560199#!#note0
it is hard to tell what might have caused these tests to start timing out. I would guess that the most likely trigger was:
```
c840658: Switch to CMake 3.11.2, Ninja 1.8.2 and all-at-once mode on hansen/shiller (TRIL-209)
Author: Roscoe A. Bartlett <rabartl@sandia.gov>
Date: Tue May 29 08:12:42 2018 -0600
M cmake/ctest/drivers/atdm/shiller/local-driver.sh
M cmake/std/atdm/shiller/environment.sh
```
That will increase the number of tests running on the machine and could result in single tests taking longer to run.
But the fact that the same test takes 6 minutes GCC but only takes 7 seconds with Intel is a major problem, in my opinion and that has to be investigated.
Someone is going to need to add some more timers to account for where the time is going.
## Steps to reproduce
One should be able to follow the instructions at:
* https://github.com/trilinos/Trilinos/blob/develop/cmake/std/atdm/README.md
to reproduce this behavior on 'hansen' or 'shiller'. To avoid needing to run on a compute node, one could use the `gnu-debug-serial` build and do:
```
$ cd <some_build_dir>/
$ source $TRILINOS_DIR/cmake/std/atdm/load-env.sh gnu-debug-serial
$ cmake \
-GNinja \
-DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnv.cmake \
-DTrilinos_ENABLE_TESTS=ON -DTrilinos_ENABLE_Belos=ON \
$TRILINOS_DIR
$ make NP=16
$ ctest -VV -R Belos_rcg_hb_MPI_4
```
Keep promoted "ATDM" builds of Trilinos cleanhttps://gitlab.osti.gov/jmwille/Trilinos/-/issues/2925Test Stratimikos_test_aztecoo_thyra_driver_MPI_1 timing out in Trilinos-atdm...2018-11-30T11:16:53ZJames WillenbringTest Stratimikos_test_aztecoo_thyra_driver_MPI_1 timing out in Trilinos-atdm-hansen-shiller-gnu-debug-serial build since 5/30/2018*Created by: bartlettroscoe*
CC: @trilinos/stratimikos, @fryeguy52
## Next Action Stauts
Test was disabled for these two builds on 'hansen' in commit 73ae19c pushed on 6/12/2018 and this test disappeared in these builds on 6/13/2...*Created by: bartlettroscoe*
CC: @trilinos/stratimikos, @fryeguy52
## Next Action Stauts
Test was disabled for these two builds on 'hansen' in commit 73ae19c pushed on 6/12/2018 and this test disappeared in these builds on 6/13/2018.
## Description
As shown in [this query](https://testing-vm.sandia.gov/cdash/queryTests.php?project=Trilinos&date=2018-06-11&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercount=11&showfilters=1&filtercombine=and&field1=buildname&compare1=65&value1=Trilinos-atdm-&field2=status&compare2=62&value2=passed&field3=status&compare3=62&value3=notrun&field4=buildname&compare4=62&value4=Trilinos-atdm-white-ride-cuda-debug-pt-all-at-once&field5=site&compare5=62&value5=mutrino&field6=site&compare6=62&value6=serrano&field7=site&compare7=62&value7=chama&field8=site&compare8=62&value8=ride&field9=buildstarttime&compare9=84&value9=2018-06-11&field10=buildstarttime&compare10=83&value10=2018-05-20&field11=testname&compare11=65&value11=Stratimikos), the test `Stratimikos_test_aztecoo_thyra_driver_MPI_1` has been timing out in the builds `Trilinos-atdm-hansen-shiller-gnu-debug-serial` and `Trilinos-atdm-hansen-shiller-gnu-opt-serial` since 5/30/2018. (That query also shows this is the only Stratimikos test that has failed in any of the promoted "ATDM" builds since 5/20/2018.)
[This query](https://testing-vm.sandia.gov/cdash/queryTests.php?project=Trilinos&date=2018-06-11&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercount=5&showfilters=1&filtercombine=and&field1=buildname&compare1=65&value1=Trilinos-atdm-&field2=buildname&compare2=61&value2=Trilinos-atdm-hansen-shiller-gnu-debug-serial&field3=buildstarttime&compare3=84&value3=2018-06-11&field4=buildstarttime&compare4=83&value4=2018-05-20&field5=testname&compare5=61&value5=Stratimikos_test_aztecoo_thyra_driver_MPI_1) shows that the test `Stratimikos_test_aztecoo_thyra_driver_MPI_1` went from passing at under 21s every day to timing out at 10 minutes every day since 5/29/2018 (but it did pass once taking 9m 56s 930ms on 6/8/2018, the only time it did not time-out since 5/29/2018).
What changed from 5/29/2018 to 5/30/2018? Looking at the updates pulled in for the build `Trilinos-atdm-hansen-shiller-gnu-debug-serial` with build stamp `20180530-0400-ATDM` shown at:
* https://testing-vm.sandia.gov/cdash/viewNotes.php?buildid=3558860#!#note0
it seems like only commits that could have impacted this were:
```
c9ccf7d: Switch from srun to salloc on hansen/shiller (TRIL-209)
Author: Roscoe A. Bartlett <rabartl@sandia.gov>
Date: Tue May 29 08:35:16 2018 -0600
M cmake/ctest/drivers/atdm/shiller/local-driver.sh
M cmake/std/atdm/README.md
c840658: Switch to CMake 3.11.2, Ninja 1.8.2 and all-at-once mode on hansen/shiller (TRIL-209)
Author: Roscoe A. Bartlett <rabartl@sandia.gov>
Date: Tue May 29 08:12:42 2018 -0600
M cmake/ctest/drivers/atdm/shiller/local-driver.sh
M cmake/std/atdm/shiller/environment.sh
```
There are no other commits that I could see that could impact this AztecOO test. So it looks like moving to CMake/CTest 3.11.2 and to the all-at-once approach triggered this large increase in runtime for the test `Stratimikos_test_aztecoo_thyra_driver_MPI_1` for the build `Trilinos-atdm-hansen-shiller-gnu-debug-serial`. This may have been a result of having more tests running while this Stratimikos test is running.
Looking in [this query](https://testing-vm.sandia.gov/cdash/queryTests.php?project=Trilinos&date=2018-06-10&filtercombine=and&filtercount=2&showfilters=1&filtercombine=and&field1=buildname&compare1=65&value1=Trilinos-atdm-&field2=testname&compare2=61&value2=Stratimikos_test_aztecoo_thyra_driver_MPI_1), we can see that the test `Stratimikos_test_aztecoo_thyra_driver_MPI_1` timed-out in the build `Trilinos-atdm-hansen-shiller-gnu-debug-serial` yesterday 6/10/2018 but it took upwards of 2.5 to 3.5 minutes to run in the CUDA builds. Ohterwise, this test did not take any longer than 22s to run in all of the other ATDM builds of Trilinos. And what is also interesting is that query showed that this test passed in 4s 460ms for the build `Trilinos-atdm-hansen-shiller-intel-debug-serial` also run on 'hansen'. How can the same test pass on an `intel-debug-serial` build in under 5 seconds but then time out at 10 minutes for a `gnu-debug-serial` build on the same hardware with the same MPI implementation and settings?
For that matter, [this query](https://testing-vm.sandia.gov/cdash/queryTests.php?project=Trilinos&date=2018-06-10&filtercombine=and&filtercount=1&showfilters=1&field1=testname&compare1=61&value1=Stratimikos_test_aztecoo_thyra_driver_MPI_1) shows that other than the CUDA builds of Trilinos and the yet-to-be-cleaned-up 'mutrinos' build `Trilinos-atdm-mutrino-intel-debug-openmp`, this test did not take any longer than 22s to run in any of the 46 Trilinos builds where this test ran yesterday. On some platforms, this test completed in less than 2s!
This is very strange behavior for a test. There must be some type of machine or system usage issue going on here. But why would it impact a `gnu-debug-serial` build but not an `intel-debug-serial` build on the same machine?
## Steps to reproduce
Following the instructions at:
* https://github.com/trilinos/Trilinos/blob/develop/cmake/std/atdm/README.md#shillerhansen
one can log on to 'hansen' or 'shiller', clone Trilinos and get on to the 'develop' branch, and then do:
```
$ cd <some_build_dir>/
$ source $TRILINOS_DIR/cmake/std/atdm/load-env.sh intel-opt-openmp
$ cmake \
-GNinja \
-DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnv.cmake \
-DTrilinos_ENABLE_TESTS=ON -DTrilinos_ENABLE_Stratimikos=ON \
$TRILINOS_DIR
$ make NP=16
$ salloc ctest -j16
```
I did this on 'shiller' but unfortunately all of the Stratimikos tests passed:
```
100% tests passed, 0 tests failed out of 40
Subproject Time Summary:
Stratimikos = 256.50 sec*proc (40 tests)
Total Test time (real) = 20.84 sec
```
Therefore, I was not able to reproduce this behavior on 'shiller'. Therefore, this must be some type of system issue.
Keep promoted "ATDM" builds of Trilinos cleanhttps://gitlab.osti.gov/jmwille/Trilinos/-/issues/2712Test Teko_testdriver_tpetra_MPI_1 is failing in new GCC 4.8.4 + OpenMPI 1.10....2018-11-30T11:16:53ZJames WillenbringTest Teko_testdriver_tpetra_MPI_1 is failing in new GCC 4.8.4 + OpenMPI 1.10.1 + OpenMP build*Created by: bartlettroscoe*
**CC:** @trilinos/teko, @fryeguy52
## Next Action Status
PR #2715 was merged on 5/10/2018. The test `Teko_testdriver_tpetra_MPI_1` passed in build `GCC-4.8.4-OpenMPI-1.10.1-MpiReleaseDebugSharedPtOpen...*Created by: bartlettroscoe*
**CC:** @trilinos/teko, @fryeguy52
## Next Action Status
PR #2715 was merged on 5/10/2018. The test `Teko_testdriver_tpetra_MPI_1` passed in build `GCC-4.8.4-OpenMPI-1.10.1-MpiReleaseDebugSharedPtOpenMP` on 5/11/2018. Next: Find someone to debug and fix the test.
## Description
The test `Teko_testdriver_tpetra_MPI_1` is failing in the new build `GCC-4.8.4-OpenMPI-1.10.1-MpiReleaseDebugSharedPtOpenMP` as shown at:
* https://testing-vm.sandia.gov/cdash/queryTests.php?project=Trilinos&filtercombine=and&filtercombine=and&filtercount=3&showfilters=1&filtercombine=and&field1=buildname&compare1=65&value1=GCC-4.8.4-OpenMPI-1.10.1-MpiReleaseDebugSharedPtOpenMP&field2=testname&compare2=61&value2=Teko_testdriver_tpetra_MPI_1&field3=buildstarttime&compare3=84&value3=now
The most recent run of this test shown at:
* https://testing-vm.sandia.gov/cdash/testDetails.php?test=46422819&build=3496203
shows the failure output:
```
...
Teko: LSCPrecFact::buildPO BuildStateTime = 0.04688
Teko: LSCPrecFact::buildPO GetInvTime = 2.86102e-06
Teko: LSCPrecFact::buildPO TotalTime = 0.0475988
"strategy" ... FAILED ( PID = 0 )
Test "LSCStabilized_tpetra" completed ... FAILED (1)
...
Teko: Building LSC strategy "The Cat"
LSC Construction failed: Strategy "The Cat" could not be constructed
Teko: Begin debug MSG
Looked up "NS LSC"
Built Teuchos::RCP<Teko::PreconditionerFactory>{ptr=0x1d14a48,node=0x1e48230,strong_count=1,weak_count=0}
Teko: End debug MSG
LSC Construction failed: Strategy "The Cat" requires a "Strategy Settings" sublist
...
Tests Passed: 90, Tests Failed: 1
(Incidently, you want no failures)
Teko tests failed
```
**NOTE:** This build satisfies all of the requirements for the GCC 4.8.4 build described in #2317 and #2462. Other than the ShyLU_DD failures addressed in #2691, this failing test is the only test blocking this build from being 100% passing.
## Steps to Reproduce
One should be able to reproduce this failing test on any SNL COE RHEL6 machine that mounts the SEMS env. After cloning the Trilinos git repo and checking out the `develop` branch, one should be able to do:
```
$ cd <some-build-dir>/
$ source <trilinos-dir>/cmake/std/GCC-4.8.4-OpenMPI-1.10.1-MpiReleaseDebugSharedPtOpenMP_env.sh
$ cmake \
-C <trilinos-dir>/cmake/std/GCC-4.8.4-OpenMPI-1.10.1-MpiReleaseDebugSharedPtOpenMP.cmake \
-DTrilinos_ENABLE_TESTS=ON -DTrilinos_ENABLE_Teko=ON \
<trilinos-dir>
$ make -j16
$ ctest -VV -R Teko_testdriver_tpetra_MPI_1
```
## Related Issues
* Blocking Issues: #2462