Trilinos issueshttps://gitlab.osti.gov/jmwille/Trilinos/-/issues2018-11-30T03:22:19Zhttps://gitlab.osti.gov/jmwille/Trilinos/-/issues/3090ATDM Trilinos Intel builds failing on 7/10/2018 due to license server problems2018-11-30T03:22:19ZJames WillenbringATDM Trilinos Intel builds failing on 7/10/2018 due to license server problems*Created by: bartlettroscoe*
CC: @fryeguy52
## Next Action Status
There was a planned and advertised offlining and update of the Intel license server machine on 7/10/2018 from 5:30-8:00 AM MT so Intel build failures were expected...*Created by: bartlettroscoe*
CC: @fryeguy52
## Next Action Status
There was a planned and advertised offlining and update of the Intel license server machine on 7/10/2018 from 5:30-8:00 AM MT so Intel build failures were expected.
## Description
As shown in [this query](https://testing-vm.sandia.gov/cdash/index.php?project=Trilinos&date=2018-07-10&filtercombine=and&filtercount=2&showfilters=1&filtercombine=and&field1=buildname&compare1=63&value1=-atdm-&field2=buildname&compare2=63&value2=-intel-), a bunch of the ATDM Trilinos Intel builds failed on several machines. While we did not look at all of the failures, these appear to be problems with the Intel compiler communicating with the license server. There was a planned updating/offline of the Intel license server today from 5:30-8:00 AM MT. Therefore, this was expected.
What is interesting is to see that this impacted Intel builds on every platform that has Intel builds including the Test Bed machine 'hansen', the HPC machines 'chama', 'mutrino', and 'serrano'. The build 'Trilinos-atdm-rhel6-intel-opt-openmp' on the SEMS Jenkins test machine 'sems-rhel6' was not impacted but it may have just completed before the license server went offline. But this shows how many different machines all shared the same Intel license server.
Keep promoted "ATDM" builds of Trilinos cleanhttps://gitlab.osti.gov/jmwille/Trilinos/-/issues/3072ROL build failure for CUDA 8.0 Debug build on white/ride2018-11-30T03:15:56ZJames WillenbringROL build failure for CUDA 8.0 Debug build on white/ride*Created by: bartlettroscoe*
CC: @trilinos/rol, @rppawlo (Trilinos Nonlinear Solvers Product Lead)
## Next Action Status
With the merge of PR #3081 on 7/9/2018 the ROL build failure was gone on CDash on 7/1/2018
## Description
...*Created by: bartlettroscoe*
CC: @trilinos/rol, @rppawlo (Trilinos Nonlinear Solvers Product Lead)
## Next Action Status
With the merge of PR #3081 on 7/9/2018 the ROL build failure was gone on CDash on 7/1/2018
## Description
The compilation of the file `packages/rol/example/burgers-control/example_01.cpp` fails in the CUDA 8.0 debug build `Trilinos-atdm-white-ride-cuda-debug-pt-all-at-once` on 'white' and 'ride'. The build error output for the build this morning shown [here](https://testing-vm.sandia.gov/cdash/viewBuildError.php?buildid=3692122) on 'white' shows:
```
...
/home/rabartl/WHITE/ATDM_Driver/Trilinos-atdm-white-ride-cuda-debug-pt-all-at-once/SRC_AND_BUILD/Trilinos/packages/rol/example/burgers-control/example_01.hpp(522): error: the type in a dynamic_cast must be a pointer or reference to a complete class type, or void *
...
1 error detected in the compilation of "/tmp/tmpxft_00001203_00000000-5_example_01.cpp4.ii".
```
## Steps to reproduce
This build error can reproduced on 'white' or 'ride' as described in the document:
* https://github.com/trilinos/Trilinos/blob/develop/cmake/std/atdm/README.md
The specific instructions for 'white' or 'ride' are given at:
* https://github.com/trilinos/Trilinos/blob/develop/cmake/std/atdm/README.md#ridewhite
The one difference is that this build of all of the Primary Tested Trilinos packages (that includes more package than are being used by ATDM APPs currently) does not exclude any Trilinos packages and tweaks a few other settings so it uses the file `Trilinos/cmake/std/atdm/ATDMDevEnvAllPtPackages.cmake` instead of the file `ATDMDevEnv.cmake`.
After cloning Trilinos, the following commands should reproduce the build failure:
```
$ cd <some_build_dir>/
$ source $TRILINOS_DIR/cmake/std/atdm/load-env.sh cuda-debug
$ cmake \
-GNinja \
-DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnvAllPtPackages.cmake \
-DTrilinos_ENABLE_TESTS=ON -DTrilinos_ENABLE_ROL=ON \
$TRILINOS_DIR
$ make NP=16
```
I (@bartlettroscoe) just tired this on 'white' and I was able to reproduce the same build failure shown on CDash shown above.
Initial cleanup of new ATDM builds of Trilinoshttps://gitlab.osti.gov/jmwille/Trilinos/-/issues/3069Stokhos build failure for CUDA 8.0 Debug build on white/ride2018-11-30T03:20:52ZJames WillenbringStokhos build failure for CUDA 8.0 Debug build on white/ride*Created by: bartlettroscoe*
CC: @trilinos/stokhos, @rppawlo (Trilinos Nonlinear Solvers Product Lead)
## Next Action Status
PR #3100 merged on 7/13/2018 resulted in 100% clean build (but not tests), including Stokhos, on 7/14/201...*Created by: bartlettroscoe*
CC: @trilinos/stokhos, @rppawlo (Trilinos Nonlinear Solvers Product Lead)
## Next Action Status
PR #3100 merged on 7/13/2018 resulted in 100% clean build (but not tests), including Stokhos, on 7/14/2018.
## Description
The creation of the Stokhos library `libstokhos_muelu.a` fails in the CUDA 8.0 debug build `Trilinos-atdm-white-ride-cuda-debug-pt-all-at-once` 'white' and 'ride'. The build error output for the build this morning shown [here](https://testing-vm.sandia.gov/cdash/viewBuildError.php?buildid=3688768) on 'white' shows:
```
/usr/bin/ar: packages/stokhos/src/libstokhos_muelu.a: File truncated
```
## Steps to reproduce
This build error can reproduced on 'white' or 'ride' as described in the document:
* https://github.com/trilinos/Trilinos/blob/develop/cmake/std/atdm/README.md
The specific instructions for 'white' or 'ride' are given at:
* https://github.com/trilinos/Trilinos/blob/develop/cmake/std/atdm/README.md#ridewhite
The one difference is that this build of all of the Primary Tested Trilinos packages (that includes more package than are being used by ATDM APPs currently) does not exclude any Trilinos packages and tweaks a few other settings so it uses the file `Trilinos/cmake/std/atdm/ATDMDevEnvAllPtPackages.cmake` instead of the file `ATDMDevEnv.cmake`.
After cloning Trilinos, the following commands should reproduce the build failure:
```
$ cd <some_build_dir>/
$ source $TRILINOS_DIR/cmake/std/atdm/load-env.sh cuda-debug
$ cmake \
-GNinja \
-DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnvAllPtPackages.cmake \
-DTrilinos_ENABLE_TESTS=ON -DTrilinos_ENABLE_Stokhos=ON \
$TRILINOS_DIR
$ make NP=16
```
I (@bartlettroscoe) just tired this on 'white' and I was able to reproduce the same build failure shown on CDash shown above.
Initial cleanup of new ATDM builds of Trilinoshttps://gitlab.osti.gov/jmwille/Trilinos/-/issues/3065Zoltan build failure for CUDA 8.0 Debug build on white/ride2018-11-30T03:19:09ZJames WillenbringZoltan build failure for CUDA 8.0 Debug build on white/ride*Created by: bartlettroscoe*
CC: @trilinos/zoltan, @kddevin (Trilinos Data Services Product Lead)
## Next Action Status
Merge of PR #3066 on 7/6/2018 resulted in no Zoltan build errors and all 34 passing Zoltan tests on 7/7/2018.
...*Created by: bartlettroscoe*
CC: @trilinos/zoltan, @kddevin (Trilinos Data Services Product Lead)
## Next Action Status
Merge of PR #3066 on 7/6/2018 resulted in no Zoltan build errors and all 34 passing Zoltan tests on 7/7/2018.
## Description
The Zoltan executable `zCPPdrive` fails to link in the CUDA 8.0 debug build `Trilinos-atdm-white-ride-cuda-debug-pt-all-at-once` run on 'white' and 'ride'. The link error output for the build this morning shown [here](https://testing-vm.sandia.gov/cdash/viewBuildError.php?buildid=3688780) on 'white' shows:
```
packages/zoltan/src/driver/CMakeFiles/zCPPdrive.dir/dr_input.c.o:(.toc+0x0): undefined reference to `Output'
packages/zoltan/src/driver/CMakeFiles/zCPPdrive.dir/dr_input.c.o:(.toc+0x8): undefined reference to `Output'
packages/zoltan/src/driver/CMakeFiles/zCPPdrive.dir/dr_input.c.o:(.toc+0x10): undefined reference to `Output'
packages/zoltan/src/driver/CMakeFiles/zCPPdrive.dir/dr_input.c.o:(.toc+0x18): undefined reference to `Output'
packages/zoltan/src/driver/CMakeFiles/zCPPdrive.dir/dr_input.c.o:(.toc+0x20): undefined reference to `Output'
packages/zoltan/src/driver/CMakeFiles/zCPPdrive.dir/dr_input.c.o:(.toc+0x28): undefined reference to `Test'
packages/zoltan/src/driver/CMakeFiles/zCPPdrive.dir/dr_input.c.o:(.toc+0x30): undefined reference to `Test'
packages/zoltan/src/driver/CMakeFiles/zCPPdrive.dir/dr_input.c.o:(.toc+0x38): undefined reference to `Test'
packages/zoltan/src/driver/CMakeFiles/zCPPdrive.dir/dr_input.c.o:(.toc+0x40): undefined reference to `Test'
packages/zoltan/src/driver/CMakeFiles/zCPPdrive.dir/dr_input.c.o:(.toc+0x48): undefined reference to `Test'
packages/zoltan/src/driver/CMakeFiles/zCPPdrive.dir/dr_input.c.o:(.toc+0x50): more undefined references to `Test' follow
packages/zoltan/src/driver/CMakeFiles/zCPPdrive.dir/dr_output.c.o:(.toc+0x0): undefined reference to `Output'
packages/zoltan/src/driver/CMakeFiles/zCPPdrive.dir/dr_gnuplot.c.o:(.toc+0x0): undefined reference to `Output'
packages/zoltan/src/driver/CMakeFiles/zCPPdrive.dir/dr_mainCPP.cpp.o:(.toc+0x8): undefined reference to `Test'
packages/zoltan/src/driver/CMakeFiles/zCPPdrive.dir/dr_mainCPP.cpp.o:(.toc+0x10): undefined reference to `Output'
packages/zoltan/src/driver/CMakeFiles/zCPPdrive.dir/dr_loadbalCPP.cpp.o:(.toc+0x8): undefined reference to `Test'
packages/zoltan/src/driver/CMakeFiles/zCPPdrive.dir/dr_migrateCPP.cpp.o:(.toc+0x0): undefined reference to `Test'
packages/zoltan/src/driver/CMakeFiles/zCPPdrive.dir/dr_mapsCPP.cpp.o:(.toc+0x0): undefined reference to `Test'
collect2: error: ld returned 1 exit status
```
## Steps to reproduce
This build error can reproduced on 'white' or 'ride' as described in the document:
* https://github.com/trilinos/Trilinos/blob/develop/cmake/std/atdm/README.md
The specific instructions for 'white' or 'ride' are given at:
* https://github.com/trilinos/Trilinos/blob/develop/cmake/std/atdm/README.md#ridewhite
The one difference is that this build of all of the Primary Tested Trilinos packages (that includes more package than are being used by ATDM APPs currently) does not exclude any Trilinos packages so it uses the file `Trilinos/cmake/std/atdm/ATDMDevEnvSettings.cmake` instead of the file `ATDMDevEnv.cmake`.
After cloning Trilinos, the following commands should reproduce the build failure:
```
$ cd <some_build_dir>/
$ source $TRILINOS_DIR/cmake/std/atdm/load-env.sh cuda-debug
$ cmake \
-GNinja \
-DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnvSettings.cmake \
-DTrilinos_ENABLE_TESTS=ON -DTrilinos_ENABLE_Zoltan=ON \
$TRILINOS_DIR
$ make NP=16
```
I (@bartlettroscoe) just tired this on 'white' and I was able to reproduce the same build failure shown on CDash shown above.
Initial cleanup of new ATDM builds of Trilinoshttps://gitlab.osti.gov/jmwille/Trilinos/-/issues/3064PanzerAdaptersSTK_response_residual_MPI_2 Failing on ATDM Chama build2018-11-30T03:15:27ZJames WillenbringPanzerAdaptersSTK_response_residual_MPI_2 Failing on ATDM Chama build*Created by: fryeguy52*
CC: @trilinos/panzer, @mperego (Trilinos Discretizations Product Area Lead)
## Next Action Status
PR #3082 which increases tolerance to `10*eps` was merged 7/9/2018 and the test `PanzerAdaptersSTK_response_...*Created by: fryeguy52*
CC: @trilinos/panzer, @mperego (Trilinos Discretizations Product Area Lead)
## Next Action Status
PR #3082 which increases tolerance to `10*eps` was merged 7/9/2018 and the test `PanzerAdaptersSTK_response_residual_MPI_2` has been passing in the 'chama' builds `Trilinos-atdm-chama-intel-debug-openmp` and `Trilinos-atdm-chama-intel-opt-openmp every day 7/11/2018 through 7/23/2018.
## Description
The test `PanzerAdaptersSTK_response_residual_MPI_2` has been failing in ATDM builds on `chama`. It has been failing for the last three days and intermittently before that.
[query showing test's results from 6-15 to 7-4](https://testing-vm.sandia.gov/cdash/queryTests.php?project=Trilinos&date=2018-07-01&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercount=5&showfilters=1&filtercombine=and&field1=buildname&compare1=65&value1=Trilinos-atdm-chama&field2=groupname&compare2=61&value2=ATDM&field3=testname&compare3=61&value3=PanzerAdaptersSTK_response_residual_MPI_2&field4=buildstarttime&compare4=84&value4=2018-07-04&field5=buildstarttime&compare5=83&value5=2018-06-15)
both
`Trilinos-atdm-chama-intel-debug-openmp`
and
`Trilinos-atdm-chama-intel-opt-openmp`
are being affected by this
## Steps to Reproduce
On Chama clone Trilinos and then run the following which will take care of the environment as well
```
export JOB_NAME=Trilinos-atdm-chama-intel-debug-openmp
export TRILINOS_DIR=<Where ever you cloned trilinos>
source $TRILINOS_DIR/cmake/std/atdm/load-env.sh $JOB_NAME
cmake \
-DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnv.cmake \
-DTrilinos_ENABLE_TESTS=ON \
-DTrilinos_ENABLE_Thyra=ON \
$TRILINOS_DIR
make -j16
ctest -j16
```
Keep promoted "ATDM" builds of Trilinos cleanhttps://gitlab.osti.gov/jmwille/Trilinos/-/issues/3035 MueLu_UnitTestsTpetra_MPI_4 fails in all openMP builds2018-11-30T03:12:09ZJames Willenbring MueLu_UnitTestsTpetra_MPI_4 fails in all openMP builds*Created by: prwolfe*
see #3003 and any openMP build
https://testing-vm.sandia.gov/cdash/testSummary.php?project=1&name=MueLu_UnitTestsTpetra_MPI_4&date=2018-06-28
@mhoemmen thinks this is related to some recent repartitioning work....*Created by: prwolfe*
see #3003 and any openMP build
https://testing-vm.sandia.gov/cdash/testSummary.php?project=1&name=MueLu_UnitTestsTpetra_MPI_4&date=2018-06-28
@mhoemmen thinks this is related to some recent repartitioning work.
@trilinos/muelu
## Expectations
I will turn this off for the 4.8.4 PR build, we will need to re-enable it when this is fixed.
## Related Issues
* Blocks #3034
* Related to #3003
Keep promoted "ATDM" builds of Trilinos cleanhttps://gitlab.osti.gov/jmwille/Trilinos/-/issues/3007Tests Belos_pseudo_pcg_hb_[0,1]_MPI_4 seeming fail with max iterations random...2018-12-03T20:38:52ZJames WillenbringTests Belos_pseudo_pcg_hb_[0,1]_MPI_4 seeming fail with max iterations randomly in some builds on white/ride*Created by: bartlettroscoe*
CC: @trilinos/belos, @fryeguy52, @srajama1 (Trilinos Linear Solvers Product Lead)
## Next Action Status
PR #3546 merged on 10/2/2018 re-enabled these tests on 'white' and 'ride'. No new failures as of...*Created by: bartlettroscoe*
CC: @trilinos/belos, @fryeguy52, @srajama1 (Trilinos Linear Solvers Product Lead)
## Next Action Status
PR #3546 merged on 10/2/2018 re-enabled these tests on 'white' and 'ride'. No new failures as of 12/3/2018.
## Description
As shown in [this query](https://testing-vm.sandia.gov/cdash/queryTests.php?project=Trilinos&date=2018-06-21&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercount=10&showfilters=1&filtercombine=and&field1=buildname&compare1=65&value1=Trilinos-atdm-&field2=status&compare2=62&value2=passed&field3=status&compare3=62&value3=notrun&field4=buildname&compare4=62&value4=Trilinos-atdm-white-ride-cuda-debug-pt-all-at-once&field5=testname&compare5=65&value5=Belos_pseudo_pcg_hb_&field6=buildstarttime&compare6=84&value6=now&field7=buildstarttime&compare7=83&value7=16%20weeks%20ago&field8=site&compare8=62&value8=serrano&field9=site&compare9=62&value9=chama&field10=site&compare10=62&value10=mutrino), the tests `Belos_pseudo_pcg_hb_0_MPI_4` and `Belos_pseudo_pcg_hb_1_MPI_4` appear to be failing randomly (24 times in total) in the following builds on 'whte' and 'ride' over the last 16 weeks:
* Trilinos-atdm-white-ride-cuda-debug
* Trilinos-atdm-white-ride-cuda-debug-all-at-once
* Trilinos-atdm-white-ride-cuda-opt
* Trilinos-atdm-white-ride-gnu-debug-openmp
* Trilinos-atdm-white-ride-gnu-opt-openmp
The failure output for the most recent failure for the test ` Belos_pseudo_pcg_hb_1_MPI_4` for the build `Trilinos-atdm-white-ride-cuda-debug` on ride just today shown [here](https://testing-vm.sandia.gov/cdash/testDetails.php?test=49037326&build=3642012) shows:
```
...
Belos::StatusTestGeneralOutput: Passed
(Num calls,Mod test,State test): (104, 1, Passed)
Passed.......OR Combination ->
Failed.......Number of Iterations = 100 == 100
Unconverged..(2-Norm Imp Res Vec) / (2-Norm Res0)
residual [ 0 ] = 8.95881e-09 < 1e-08
residual [ 1 ] = 1.21989e-08 > 1e-08
residual [ 2 ] = 6.84374e-09 < 1e-08
residual [ 3 ] = 9.15804e-09 < 1e-08
residual [ 4 ] = 7.2567e-09 < 1e-08
Passed.......OR Combination ->
Failed.......Number of Iterations = 100 == 100
Unconverged..(2-Norm Imp Res Vec) / (2-Norm Res0)
residual [ 0 ] = 8.95881e-09 < 1e-08
residual [ 1 ] = 1.21989e-08 > 1e-08
residual [ 2 ] = 6.84374e-09 < 1e-08
residual [ 3 ] = 9.15804e-09 < 1e-08
residual [ 4 ] = 7.2567e-09 < 1e-08
==============================================================================================================================
TimeMonitor results over 4 processors
Timer Name MinOverProcs MeanOverProcs MaxOverProcs MeanOverCallCounts
------------------------------------------------------------------------------------------------------------------------------
Belos: Operation Op*x 0.06529 (101) 0.06993 (101) 0.07256 (101) 0.0006924 (101)
Belos: Operation Prec*x 0.09817 (104) 0.1055 (104) 0.1237 (104) 0.001015 (104)
Belos: PseudoBlockCGSolMgr total solve time 0.2098 (1) 0.2099 (1) 0.21 (1) 0.2099 (1)
Epetra_CrsMatrix::Multiply(TransA,X,Y) 0.06611 (102) 0.07081 (102) 0.0734 (102) 0.0006942 (102)
Epetra_CrsMatrix::Solve(Upper,Trans,UnitDiag,X,Y) 0.09769 (210) 0.1051 (210) 0.1233 (210) 0.0005004 (210)
==============================================================================================================================
---------- Actual Residuals (normalized) ----------
Problem 0 : 8.95881e-09
Problem 1 : 1.21989e-08
Problem 2 : 6.84374e-09
Problem 3 : 9.15804e-09
Problem 4 : 7.2567e-09
End Result: TEST FAILED
```
See, it maxed out the number of iterations:
```
Failed.......Number of Iterations = 100 == 100
```
The previous day this same test in this same build on 'ride' passsed as shown [here]() showing:
```
Belos::StatusTestGeneralOutput: Passed
(Num calls,Mod test,State test): (89, 1, Passed)
Passed.......OR Combination ->
OK...........Number of Iterations = 87 < 100
Converged....(2-Norm Imp Res Vec) / (2-Norm Res0)
residual [ 0 ] = 5.02551e-09 < 1e-08
residual [ 1 ] = 5.92159e-09 < 1e-08
residual [ 2 ] = 6.61897e-09 < 1e-08
residual [ 3 ] = 8.2598e-09 < 1e-08
residual [ 4 ] = 3.67011e-09 < 1e-08
Passed.......OR Combination ->
OK...........Number of Iterations = 87 < 100
Converged....(2-Norm Imp Res Vec) / (2-Norm Res0)
residual [ 0 ] = 5.02551e-09 < 1e-08
residual [ 1 ] = 5.92159e-09 < 1e-08
residual [ 2 ] = 6.61897e-09 < 1e-08
residual [ 3 ] = 8.2598e-09 < 1e-08
residual [ 4 ] = 3.67011e-09 < 1e-08
=============================================================================================================================
TimeMonitor results over 4 processors
Timer Name MinOverProcs MeanOverProcs MaxOverProcs MeanOverCallCounts
-----------------------------------------------------------------------------------------------------------------------------
Belos: Operation Op*x 0.06567 (88) 0.07594 (88) 0.08703 (88) 0.0008629 (88)
Belos: Operation Prec*x 0.09592 (89) 0.1201 (89) 0.15 (89) 0.00135 (89)
Belos: PseudoBlockCGSolMgr total solve time 0.2488 (1) 0.2489 (1) 0.249 (1) 0.2489 (1)
Epetra_CrsMatrix::Multiply(TransA,X,Y) 0.06652 (89) 0.07686 (89) 0.08802 (89) 0.0008635 (89)
Epetra_CrsMatrix::Solve(Upper,Trans,UnitDiag,X,Y) 0.09555 (180) 0.1198 (180) 0.1497 (180) 0.0006653 (180)
=============================================================================================================================
---------- Actual Residuals (normalized) ----------
Problem 0 : 5.02551e-09
Problem 1 : 5.92159e-09
Problem 2 : 6.61897e-09
Problem 3 : 8.2598e-09
Problem 4 : 3.67011e-09
End Result: TEST PASSED
```
See, the number of iterations was:
```
OK...........Number of Iterations = 87 < 100
```
The other instances of failing tests I looked at from the above query all show maxing out the number of iterations:
```
Failed.......Number of Iterations = 100 == 100
```
This looks about identical to the behavior of the randomly filing tests `Belos_pseudo_stochastic_pcg_hb_0_MPI_4` and `Belos_pseudo_stochastic_pcg_hb_1_MPI_4` described in issue #2920 that got disabled. My guess is that this is suffering from the same random generator problem described in https://github.com/trilinos/Trilinos/issues/2920#issuecomment-398109326.
## Steps to Reproduce
One could try to reproduce this on 'white' or 'ride' as described at:
* https://github.com/trilinos/Trilinos/blob/develop/cmake/std/atdm/README.md#ridewhite
using something like:
```
$ cd <some_build_dir>/
$ source $TRILINOS_DIR/cmake/std/atdm/load-env.sh cuda-debug # or cuda-opt, gnu-debug-openmp, etc.
$ cmake \
-GNinja \
-DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnv.cmake \
-DTrilinos_ENABLE_TESTS=ON -DTrilinos_ENABLE_Belos=ON \
$TRILINOS_DIR
$ make NP=16
$ bsub -x -Is -q rhel7F -n 16 ctest -j16
```
But since this is a randomly failing test that usually passes, it may be hard to reproduce this behavior locally.
Keep promoted "ATDM" builds of Trilinos cleanhttps://gitlab.osti.gov/jmwille/Trilinos/-/issues/2998Intel license server problems on serveral HPC machines causing many build fai...2018-11-30T03:22:19ZJames WillenbringIntel license server problems on serveral HPC machines causing many build failures for intel builds on 6/22/2018*Created by: bartlettroscoe*
CC: @fryeguy52
This issue is to just log a bunch of Intel build failures on the HPC machines 'chaman', 'serrano', and 'mutrino' as shown in [this query](https://testing-vm.sandia.gov/cdash/index.php?proj...*Created by: bartlettroscoe*
CC: @fryeguy52
This issue is to just log a bunch of Intel build failures on the HPC machines 'chaman', 'serrano', and 'mutrino' as shown in [this query](https://testing-vm.sandia.gov/cdash/index.php?project=Trilinos&date=2018-06-21&filtercombine=and&filtercount=2&showfilters=1&filtercombine=and&field1=buildname&compare1=63&value1=-atdm-&field2=buildname&compare2=63&value2=intel) that appear to be due to problems communicating with the Intel License sever. This resulted in a lot of "not run" tests.
This is the type of thing that will bring down any automated processes that require builds on these machines to run and pass.
Keep promoted "ATDM" builds of Trilinos cleanhttps://gitlab.osti.gov/jmwille/Trilinos/-/issues/2976Test Tempus_DIRK_Combined_FSA_MPI_1 timing out semi-randomly in 'gnu-opt-seri...2018-11-30T03:21:40ZJames WillenbringTest Tempus_DIRK_Combined_FSA_MPI_1 timing out semi-randomly in 'gnu-opt-serial' and 'gnu-debug-serial' builds on 'hansen' starting 6/5/2018*Created by: bartlettroscoe*
CC: @fryeguy52
## Next Action Status
Reduced parallelism from `ctest -j16` to `ctest -j8` in #2987 merged on 6/20/2018 and on 6/21/2018 the max test time was under 6m 30s. But this was at the price o...*Created by: bartlettroscoe*
CC: @fryeguy52
## Next Action Status
Reduced parallelism from `ctest -j16` to `ctest -j8` in #2987 merged on 6/20/2018 and on 6/21/2018 the max test time was under 6m 30s. But this was at the price of the test wall clock times for the GNU debug and opt serial builds going up from 38m and 28m to 49m and 35m, respectively.
## Description
As shown in [this query](https://testing-vm.sandia.gov/cdash/queryTests.php?project=Trilinos&date=2018-06-19&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercount=13&showfilters=1&filtercombine=and&field1=buildname&compare1=65&value1=Trilinos-atdm-&field2=status&compare2=62&value2=passed&field3=status&compare3=62&value3=notrun&field4=buildname&compare4=62&value4=Trilinos-atdm-mutrino-intel-debug-openmp&field5=buildname&compare5=62&value5=Trilinos-atdm-mutrino-intel-opt-openmp&field6=buildname&compare6=62&value6=Trilinos-atdm-white-ride-cuda-debug-pt-all-at-once&field7=buildname&compare7=62&value7=Trilinos-atdm-serrano-intel-debug-openmp&field8=buildname&compare8=62&value8=Trilinos-atdm-serrano-intel-opt-openmp&field9=site&compare9=62&value9=ride&field10=testname&compare10=65&value10=Tempus&field11=buildstarttime&compare11=84&value11=2018-06-20&field12=buildstarttime&compare12=83&value12=2018-06-01&field13=site&compare13=61&value13=hansen) the test `Tempus_DIRK_Combined_FSA_MPI_1` is timing out semi-randomly starting on 6/5/2018 in the builds `Trilinos-atdm-hansen-shiller-gnu-opt-serial` and `Trilinos-atdm-hansen-shiller-gnu-debug-serial` and every day since 6/15/2018 in the build `Trilinos-atdm-hansen-shiller-gnu-opt-serial`
Looking at the history of this test in the build `Trilinos-atdm-hansen-shiller-gnu-opt-serial` over the last month in [this query](https://testing-vm.sandia.gov/cdash/queryTests.php?project=Trilinos&date=2018-06-19&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercount=5&showfilters=1&filtercombine=and&field1=buildname&compare1=65&value1=Trilinos-atdm-&field2=buildname&compare2=61&value2=Trilinos-atdm-hansen-shiller-gnu-opt-serial&field3=buildstarttime&compare3=84&value3=2018-06-20&field4=buildstarttime&compare4=83&value4=2018-05-20&field5=testname&compare5=61&value5=Tempus_DIRK_Combined_FSA_MPI_1) shows that this test has been taking nearly 7 minutes when it does complete and pass. Then starting on 6/15/2018, it went from taking just under 7 minutes most days to started to time-out on 6/15/2018.
What is interesting is that if you look at the history of this tset for the build `Trilinos-atdm-hansen-shiller-gnu-opt-serial` on 'hansen' in [this query](https://testing-vm.sandia.gov/cdash/queryTests.php?project=Trilinos&date=2018-06-19&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercount=5&showfilters=1&filtercombine=and&field1=buildname&compare1=65&value1=Trilinos-atdm-&field2=buildname&compare2=61&value2=Trilinos-atdm-hansen-shiller-intel-opt-serial&field3=buildstarttime&compare3=84&value3=2018-06-20&field4=buildstarttime&compare4=83&value4=2018-05-20&field5=testname&compare5=61&value5=Tempus_DIRK_Combined_FSA_MPI_1) you see that this test passes in under 25 sec every time!
Looking at what might have changed in Trilinos between 6/14 and 6/15/2018 for this build shown at:
* https://testing-vm.sandia.gov/cdash/viewNotes.php?buildid=3616638#!#note0
and given the analysis in the "Steps to Reproduce" section below, I believe that commit that is triggering these timeouts is:
```
22ec935: Revert "Switch from srun to salloc on hansen/shiller (TRIL-209)"
Author: Roscoe A. Bartlett <rabartl@sandia.gov>
Date: Wed Jun 13 16:59:08 2018 -0600
M cmake/ctest/drivers/atdm/shiller/local-driver.sh
M cmake/std/atdm/README.md
```
## Steps to Reproduce
As described at:
* https://github.com/trilinos/Trilinos/blob/develop/cmake/std/atdm/README.md#shillerhansen
I tried to reproduce this on 'shiller' (since 'hansen' is always loaded running tests) with:
```
$ ./checkin-test-atdm.sh gnu-opt-serial --enable-packages=Tempus --configure --build
$ /usr/bin/srun ./checkin-test-atdm.sh gnu-opt-serial --enable-packages=Tempus --test
```
but all of the tests passed with:
```
...
36/36 Test #20: Tempus_DIRK_Combined_FSA_MPI_1 ................... Passed 370.61 sec
100% tests passed, 0 tests failed out of 36
Label Time Summary:
Tempus = 2990.35 sec (36 tests)
Total Test time (real) = 381.72 sec
```
The time of 370 sec is pretty far from the 600 sec (10 min) timeout limit.
I wonder if you need more of a overload of the machine to produce this so I ran with:
```
$ ./checkin-test-atdm.sh gnu-opt-serial --enable-packages=Tempus,Panzer --configure --build
$ /usr/bin/srun \
./checkin-test-atdm.sh gnu-opt-serial --enable-packages=Tempus,Panzer --test
```
This produced:
```
...
174/191 Test #28: Tempus_IMEX_RK_Combined_FSA_MPI_1 ................................***Timeout 600.28 sec
175/191 Test #20: Tempus_DIRK_Combined_FSA_MPI_1 ...................................***Timeout 600.38 sec
...
99% tests passed, 2 tests failed out of 191
Label Time Summary:
Panzer = 1508.25 sec (155 tests)
Tempus = 3993.75 sec (36 tests)
Total Test time (real) = 736.46 sec
The following tests FAILED:
20 - Tempus_DIRK_Combined_FSA_MPI_1 (Timeout)
28 - Tempus_IMEX_RK_Combined_FSA_MPI_1 (Timeout)
```
If I then run that one test by itself with:
```
$ cd gnu-opt-serial/
$ . load-env.sh
Hostname 'shiller01' matches known ATDM host 'shiller' and system 'shiller'
ATDM_CONFIG_TRILNOS_DIR = /home/rabartl/Trilinos.base/Trilinos
Setting default compiler and build options for JOB_NAME='gnu-opt-serial'
Using hansen/shiller compiler stack GNU to build RELEASE code with Kokkos node type SERIAL
$ cd packages/tempus/
$ ctest -R Tempus_DIRK_Combined_FSA_MPI_1
Test project /home/rabartl/Trilinos.base/BUILDS/SHILLER/CHECKIN/gnu-opt-serial/packages/tempus
Start 20: Tempus_DIRK_Combined_FSA_MPI_1
1/1 Test #20: Tempus_DIRK_Combined_FSA_MPI_1 ... Passed 16.44 sec
100% tests passed, 0 tests failed out of 1
Label Time Summary:
Tempus = 16.44 sec*proc (1 test)
Total Test time (real) = 16.46 sec
```
This proves that this is a problem of test processes running on top of each other likely on the same core.Keep promoted "ATDM" builds of Trilinos cleanhttps://gitlab.osti.gov/jmwille/Trilinos/-/issues/2974Trilinos GCC 4.8.4 PR build and running of tests is broken2018-06-19T20:25:41ZJames WillenbringTrilinos GCC 4.8.4 PR build and running of tests is broken*Created by: bartlettroscoe*
@trilinos/framework, @eric-c-cyr, @rppawlo, @trilinos/nox
## Expectations
The GCC 4.8.4 auto PR build should work unless the PR code itself is broken.
## Current Behavior
The GCC 4.8.4 auto PR bu...*Created by: bartlettroscoe*
@trilinos/framework, @eric-c-cyr, @rppawlo, @trilinos/nox
## Expectations
The GCC 4.8.4 auto PR build should work unless the PR code itself is broken.
## Current Behavior
The GCC 4.8.4 auto PR build has broken NOX test builds and all of the tests in every package fail, starting in the auto PR testing iteration build last night:
* https://testing-vm.sandia.gov/cdash/index.php?project=Trilinos&parentid=3627737&filtercount=3&showfilters=1&field1=groupname&compare1=61&value1=Pull%20Request&field2=buildstarttime&compare2=84&value2=NOW&field3=buildstarttime&compare3=83&value3=1%20week%20ago&filtercombine=and
YOu can see the NOX failures at:
* https://testing-vm.sandia.gov/cdash/viewBuildError.php?buildid=3627737
And all of the tests fail because `--bind-to none` is being used with `mpiexec` with OpenMP 1.6.5 as shown, for example, at:
* https://testing-vm.sandia.gov/cdash/testDetails.php?test=48792061&build=3627756
which shows the commandline:
```
"/projects/sems/install/rhel6-x86_64/sems/compiler/gcc/4.8.4/openmpi/1.6.5/bin/mpiexec \"--bind-to\" \"none\" \"-np\" \"1\" \"/scratch/trilinos/workspace/trilinos-folder/Trilinos_pullrequest_gcc_4.8.4/pull_request_test/packages/kokkos-kernels/unit_test/KokkosKernels_blas_serial.exe\" \"--gtest_filter=-serial.gemm_double\"
```
As described in #2462 and #2788, you can't use those options with OpenMPI 1.6.5. OpenMPI 1.10.1 is the only version in the SEMS env where those options work (see #2462 and #2788).
It is important to fix this ASAP since every PR build will fail and therefore no one will be able to merge any PRs until this gets fixed.
## Motivation and Context
I can't merge my PR #2964.
## Definition of Done
PR builds are working unless code changes break them.
## Possible Solution
Use the source and *.cmake scripts in #2788. That will fix it.
## Steps to Reproduce
## Your Environment
N.A. This is the auto PR testing having this issue.
## Related Issues
* Blocks: #2964
Improve productivity, stability, and quality of Trilinoshttps://gitlab.osti.gov/jmwille/Trilinos/-/issues/2965Tests Belos_resolve_[cg,gmres]_hb_MPI_4 failing on ATDM builds on 'chama' and...2018-11-30T03:13:48ZJames WillenbringTests Belos_resolve_[cg,gmres]_hb_MPI_4 failing on ATDM builds on 'chama' and 'serrano'*Created by: fryeguy52*
## Next Action Status
Tests were disabled in commits cbb56a7 and 5be802c merged to 'develop' in 6/25/2018 in PR #3011. Commit that should fix the tests were pushed to develop in commit dcbfd13e3aad9f8be5baf9e...*Created by: fryeguy52*
## Next Action Status
Tests were disabled in commits cbb56a7 and 5be802c merged to 'develop' in 6/25/2018 in PR #3011. Commit that should fix the tests were pushed to develop in commit dcbfd13e3aad9f8be5baf9ebe964e6bc1353aea5 and PR #3098 was merged on 7/14/2018 that re-enables these tests. Next: Watch to see that these tests are running and let run for a few weeks before closing issue on 8/15/2018 ...
## Description
The test `Belos_resolve_cg_hb_MPI_4` is failing intermittently on the `serrano` and `chama` ATDM builds. Here is a history of that test for the effected builds.
[Trilinos-atdm-chama-intel-opt-openmp](https://testing.sandia.gov/cdash/queryTests.php?project=Trilinos&date=&filtercount=3&showfilters=1&filtercombine=and&field1=testname&compare1=63&value1=Belos_resolve_cg_hb_MPI_4&field2=buildname&compare2=61&value2=Trilinos-atdm-chama-intel-opt-openmp&field3=buildstarttime&compare3=84&value3=now)
[Trilinos-atdm-chama-intel-debug-openmp](https://testing.sandia.gov/cdash/queryTests.php?project=Trilinos&date=&filtercount=3&showfilters=1&filtercombine=and&field1=testname&compare1=63&value1=Belos_resolve_cg_hb_MPI_4&field2=buildname&compare2=61&value2=Trilinos-atdm-chama-intel-debug-openmp&field3=buildstarttime&compare3=84&value3=now)
[Trilinos-atdm-serrano-intel-opt-openmp](https://testing.sandia.gov/cdash/queryTests.php?project=Trilinos&date=&filtercount=3&showfilters=1&filtercombine=and&field1=testname&compare1=63&value1=Belos_resolve_cg_hb_MPI_4&field2=buildname&compare2=61&value2=Trilinos-atdm-serrano-intel-opt-openmp&field3=buildstarttime&compare3=84&value3=now)
[Trilinos-atdm-serrano-intel-debug-openmp](https://testing.sandia.gov/cdash/queryTests.php?project=Trilinos&date=&filtercount=3&showfilters=1&filtercombine=and&field1=testname&compare1=63&value1=Belos_resolve_cg_hb_MPI_4&field2=buildname&compare2=61&value2=Trilinos-atdm-serrano-intel-debug-openmp&field3=buildstarttime&compare3=84&value3=now)
```
Fail Reason | Required regular expression not found.Regex=[End Result: TEST PASSED<br /> ]
```
```
End Result: TEST FAILED
srun: error: ser403: task 0: Exited with exit code 1
```
[Full test output on CDash](https://testing.sandia.gov/cdash/testDetails.php?test=49643694&build=3640224)
## Definition of Done
The test is either reliably passing or we disable it for these particular builds
## Steps to Reproduce
On either `serrano` or `chama` just clone trilinos and run the following commands. It is failing randomly so it may pass or it may not. It appears to fail more frequently on `serrano` so it may be easier to reproduce there.
```
export JOB_NAME=Trilinos-atdm-intel-debug-openmp
export TRILINOS_DIR=<Wherever you cloned Trilinos>
source $TRILINOS_DIR/cmake/std/atdm/load-env.sh $JOB_NAME
cmake \
-DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnv.cmake \
-DTrilinos_ENABLE_TESTS=ON \
-DTrilinos_ENABLE_Belos=ON \
$TRILINOS_DIR
make -j16
ctest -j16
```
## Your Environment
The environment will be set up automatically by running the commands given above.
Keep promoted "ATDM" builds of Trilinos cleanhttps://gitlab.osti.gov/jmwille/Trilinos/-/issues/2944Xpetra: build error2018-11-30T03:19:09ZJames WillenbringXpetra: build error*Created by: jhux2*
## Next Action Status:
Issue fixed in commit b73a7c5 pushed to 'develop'. Next: Verify that build problem is gone for SPARC configuration ...
## Description
@micahahoward has reported errors building Trilin...*Created by: jhux2*
## Next Action Status:
Issue fixed in commit b73a7c5 pushed to 'develop'. Next: Verify that build problem is gone for SPARC configuration ...
## Description
@micahahoward has reported errors building Trilinos master, SHA1 1f66625b66
```
In file included from /home/mhoward/codes/sparc/Trilinos/packages/xpetra/src/MultiVector/Xpetra_EpetraIntMultiVector.hpp(57),
from /home/mhoward/codes/sparc/Trilinos/packages/xpetra/src/MultiVector/Xpetra_MultiVectorFactory.hpp(59),
from /home/mhoward/codes/sparc/Trilinos/packages/xpetra/src/BlockedCrsMatrix/Xpetra_BlockedCrsMatrix.hpp(60),
from /home/mhoward/codes/sparc/Trilinos/packages/xpetra/sup/Utils/Xpetra_MatrixMatrix.hpp(53),
from /home/mhoward/codes/sparc/Trilinos/packages/muelu/src/Utils/MueLu_Utilities_decl.hpp(76),
from /home/mhoward/codes/sparc/Trilinos/ats1-hsw_intel-17.0.4_mpich-7.7.0_static_opt_build/packages/muelu/src/MueLu_Utilities.hpp(1),
from /home/mhoward/codes/sparc/Trilinos/packages/muelu/src/MueCentral/MueLu_Level.hpp(68),
from /home/mhoward/codes/sparc/Trilinos/packages/muelu/src/MueCentral/MueLu_Factory.hpp(65),
from /home/mhoward/codes/sparc/Trilinos/packages/muelu/src/Smoothers/MueLu_SmootherPrototype_decl.hpp(52),
from /home/mhoward/codes/sparc/Trilinos/ats1-hsw_intel-17.0.4_mpich-7.7.0_static_opt_build/packages/muelu/src/MueLu_SmootherPrototype.hpp(1),
from /home/mhoward/codes/sparc/Trilinos/packages/muelu/src/Smoothers/MueLu_Amesos2Smoother_decl.hpp(56),
from /home/mhoward/codes/sparc/Trilinos/packages/muelu/src/Smoothers/MueLu_Amesos2Smoother_def.hpp(58),
from /home/mhoward/codes/sparc/Trilinos/ats1-hsw_intel-17.0.4_mpich-7.7.0_static_opt_build/packages/muelu/src/Utils/ExplicitInstantiation/MueLu_Amesos2Smoother.cpp(55):
/home/mhoward/codes/sparc/Trilinos/packages/epetra/src/Epetra_IntMultiVector.h(289): error #809: exception specification for virtual function "Epetra_IntMultiVector::~Epetra_IntMultiVector" is incompatible with that of overridden function "Epetra_Object::~Epetra_Object"
virtual ~Epetra_IntMultiVector();
```
@trilinos/xpetra @lucbv Keep promoted "ATDM" builds of Trilinos cleanhttps://gitlab.osti.gov/jmwille/Trilinos/-/issues/2920Belos_pseudo_stochastic_pcg_hb_[0,1]_MPI_4 tests failing due to max iteration...2018-12-03T20:39:41ZJames WillenbringBelos_pseudo_stochastic_pcg_hb_[0,1]_MPI_4 tests failing due to max iterations limit seemingly randomly in the `Trilinos-atdm-white-ride-cuda-debug` build on 'white'*Created by: bartlettroscoe*
CC: @trilinos/belos, @fryeguy52, @srajama1 (Linear Solvers Product Lead)
## Next Action Status
Disabled in build `Trilinos-atdm-white-ride-cuda-debug` in commit cc7fff2 pushed on 6/12/2018 and showed d...*Created by: bartlettroscoe*
CC: @trilinos/belos, @fryeguy52, @srajama1 (Linear Solvers Product Lead)
## Next Action Status
Disabled in build `Trilinos-atdm-white-ride-cuda-debug` in commit cc7fff2 pushed on 6/12/2018 and showed disabled and missing on CDash on 6/13/2018. PR #3546 merged on 10/2/2018 which re-enables tests that should be fixed from PR #3050 merged before. No new failures as of 12/3/2018!
## Description
As shown in [this rather complex query showing all failing Belos tests other than Belos_rcg_hb_MPI_4 in all promoted ATDM builds since 5/10/2018](https://testing-vm.sandia.gov/cdash/queryTests.php?project=Trilinos&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercount=17&showfilters=1&filtercombine=and&field1=buildname&compare1=65&value1=Trilinos-atdm-&field2=buildname&compare2=62&value2=Trilinos-atdm-mutrino-intel-debug-openmp&field3=buildname&compare3=62&value3=Trilinos-atdm-mutrino-intel-opt-openmp&field4=buildname&compare4=62&value4=Trilinos-atdm-white-ride-cuda-debug-pt-all-at-once&field5=site&compare5=62&value5=ride&field6=testname&compare6=62&value6=Belos_rcg_hb_MPI_4&field7=buildstarttime&compare7=84&value7=2018-06-08&field8=buildstarttime&compare8=83&value8=2018-05-10&field9=buildname&compare9=62&value9=Trilinos-atdm-white-ride-cuda-opt&field10=buildname&compare10=62&value10=Trilinos-atdm-white-ride-gnu-opt-openmp&field11=site&compare11=62&value11=serrano&field12=site&compare12=62&value12=shiller&field13=buildname&compare13=62&value13=Trilinos-atdm-white-ride-cuda-debug-all-at-once&field14=site&compare14=62&value14=chama&field15=testname&compare15=65&value15=Belos&field16=status&compare16=62&value16=passed&field17=status&compare17=62&value17=notrun) the tests:
* Belos_pseudo_stochastic_pcg_hb_0_MPI_4
* Belos_pseudo_stochastic_pcg_hb_1_MPI_4
failed 5 times in total and appear to be randomly failing in the `Trilinos-atdm-white-ride-cuda-debug` build. (The other failing test shown was `Belos_pseudo_pcg_hb_1_MPI_4` also for the `Trilinos-atdm-white-ride-cuda-debug` build but that only failed once yesterday so we will ignore that for now.) (The test `Belos_rcg_hb_MPI_4` was excluded from the above query because it is addressed in #2919.)
Looking at the testing history for these tests `Belos_pseudo_stochastic_pcg_hb_[0,1]_MPI_4` from 5/10/2018 through today 6/8/2018 in [this less complex query](https://testing-vm.sandia.gov/cdash/queryTests.php?project=Trilinos&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercount=6&showfilters=1&filtercombine=and&field1=buildname&compare1=65&value1=Trilinos-atdm-&field2=site&compare2=62&value2=ride&field3=testname&compare3=65&value3=Belos_pseudo_stochastic_pcg_hb_&field4=buildstarttime&compare4=84&value4=2018-06-08&field5=buildstarttime&compare5=83&value5=2018-05-10&field6=buildname&compare6=61&value6=Trilinos-atdm-white-ride-cuda-debug) one can see that these tests complete in about the same time in under 2 seconds when they pass or fail.
The output when these tests fail (such as shown for the test `Belos_pseudo_stochastic_pcg_hb_1_MPI_4` yesterday on 6/7/2018 [here](https://testing-vm.sandia.gov/cdash/testDetails.php?test=48082702&build=3589607)) looks like:
```
Belos::StatusTestGeneralOutput: Passed
(Num calls,Mod test,State test): (104, 1, Passed)
Passed.......OR Combination ->
Failed.......Number of Iterations = 100 == 100
Unconverged..(2-Norm Imp Res Vec) / (2-Norm Res0)
residual [ 0 ] = 8.95881e-09 < 1e-08
residual [ 1 ] = 1.21989e-08 > 1e-08
residual [ 2 ] = 6.84374e-09 < 1e-08
residual [ 3 ] = 9.15804e-09 < 1e-08
residual [ 4 ] = 7.2567e-09 < 1e-08
Passed.......OR Combination ->
Failed.......Number of Iterations = 100 == 100
Unconverged..(2-Norm Imp Res Vec) / (2-Norm Res0)
residual [ 0 ] = 8.95881e-09 < 1e-08
residual [ 1 ] = 1.21989e-08 > 1e-08
residual [ 2 ] = 6.84374e-09 < 1e-08
residual [ 3 ] = 9.15804e-09 < 1e-08
residual [ 4 ] = 7.2567e-09 < 1e-08
==================================================================================================================================
TimeMonitor results over 4 processors
Timer Name MinOverProcs MeanOverProcs MaxOverProcs MeanOverCallCounts
----------------------------------------------------------------------------------------------------------------------------------
Belos: Operation Op*x 0.06571 (101) 0.07122 (101) 0.07694 (101) 0.0007051 (101)
Belos: Operation Prec*x 0.1014 (104) 0.108 (104) 0.1151 (104) 0.001039 (104)
Belos: PseudoBlockStochasticCGSolMgr total solve time 0.2159 (1) 0.216 (1) 0.2162 (1) 0.216 (1)
Epetra_CrsMatrix::Multiply(TransA,X,Y) 0.0665 (102) 0.07206 (102) 0.07777 (102) 0.0007065 (102)
Epetra_CrsMatrix::Solve(Upper,Trans,UnitDiag,X,Y) 0.101 (210) 0.1076 (210) 0.1147 (210) 0.0005122 (210)
==================================================================================================================================
---------- Actual Residuals (normalized) ----------
Problem 0 : 8.95881e-09
Problem 1 : 1.21989e-08
Problem 2 : 6.84374e-09
Problem 3 : 9.15804e-09
Problem 4 : 7.2567e-09
End Result: TEST FAILED
```
So this shows that the test fails due to the max iteration limit of 100 being reached before reaching the desired residual tolerance . The other failures for the tests `Belos_pseudo_stochastic_pcg_hb_0_MPI_4` and `Belos_pseudo_stochastic_pcg_hb_1_MPI_4` look to all be maxing out the number of iterations at 100.
When the test `Belos_pseudo_stochastic_pcg_hb_1_MPI_4` passed the day before on 6/6/2018 as shown [here](https://testing-vm.sandia.gov/cdash/testDetails.php?test=48012272&build=3584608) showed output like:
```
Belos::StatusTestGeneralOutput: Passed
(Num calls,Mod test,State test): (89, 1, Passed)
Passed.......OR Combination ->
OK...........Number of Iterations = 87 < 100
Converged....(2-Norm Imp Res Vec) / (2-Norm Res0)
residual [ 0 ] = 5.02551e-09 < 1e-08
residual [ 1 ] = 5.92159e-09 < 1e-08
residual [ 2 ] = 6.61897e-09 < 1e-08
residual [ 3 ] = 8.2598e-09 < 1e-08
residual [ 4 ] = 3.67011e-09 < 1e-08
Passed.......OR Combination ->
OK...........Number of Iterations = 87 < 100
Converged....(2-Norm Imp Res Vec) / (2-Norm Res0)
residual [ 0 ] = 5.02551e-09 < 1e-08
residual [ 1 ] = 5.92159e-09 < 1e-08
residual [ 2 ] = 6.61897e-09 < 1e-08
residual [ 3 ] = 8.2598e-09 < 1e-08
residual [ 4 ] = 3.67011e-09 < 1e-08
=================================================================================================================================
TimeMonitor results over 4 processors
Timer Name MinOverProcs MeanOverProcs MaxOverProcs MeanOverCallCounts
---------------------------------------------------------------------------------------------------------------------------------
Belos: Operation Op*x 0.0652 (88) 0.06892 (88) 0.07251 (88) 0.0007831 (88)
Belos: Operation Prec*x 0.09675 (89) 0.1009 (89) 0.1101 (89) 0.001134 (89)
Belos: PseudoBlockStochasticCGSolMgr total solve time 0.195 (1) 0.195 (1) 0.195 (1) 0.195 (1)
Epetra_CrsMatrix::Multiply(TransA,X,Y) 0.06596 (89) 0.06969 (89) 0.07333 (89) 0.0007831 (89)
Epetra_CrsMatrix::Solve(Upper,Trans,UnitDiag,X,Y) 0.09635 (180) 0.1006 (180) 0.1098 (180) 0.0005587 (180)
=================================================================================================================================
---------- Actual Residuals (normalized) ----------
Problem 0 : 5.02551e-09
Problem 1 : 5.92159e-09
Problem 2 : 6.61897e-09
Problem 3 : 8.2598e-09
Problem 4 : 3.67011e-09
End Result: TEST PASSED
```
which shows it converged in 87 iterations. I looked at several other instances when these tests passed and they all look to be converging in 87 iterations.
Is this non-deterministic behavior due to fact that this is "stochastic" code and therefore the behavior is truly random or is it due to the fact that the random seed is not set consistently, or is it due to non-deterministic behavior in the accumulations with the CUDA 8.0 threaded Kokkos implementation on this machine? The fact that the test seems to converge in 87 iterations when it passes suggests that this is not purposeful random behavior but is a result of some other undesired and unintended random behavior.
## Steps to reproduce
Following the instructions at:
* https://github.com/trilinos/Trilinos/blob/develop/cmake/std/atdm/README.md
one might be able to reproduce this behavior on 'white' or 'ride' by cloning the Trilinos github repo, getting on the 'develop' branch and then doing:
```
$ cd <some_build_dir>/
$ source $TRILINOS_DIR/cmake/std/atdm/load-env.sh cuda-debug
$ cmake \
-GNinja \
-DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnv.cmake \
-DTrilinos_ENABLE_TESTS=ON -DTrilinos_ENABLE_Belos=ON \
$TRILINOS_DIR
$ make NP=16
$ bsub -x -Is -q rhel7F -n 16 ctest -j16
```
But given that this test looks to be randomly failing, it may be hard to reproduce this behavior locally.
Keep promoted "ATDM" builds of Trilinos cleanhttps://gitlab.osti.gov/jmwille/Trilinos/-/issues/2919Belos_rcg_hb_MPI_4 timing out in several ATDM Trilinos builds on 'hansen' sin...2018-11-30T11:16:53ZJames WillenbringBelos_rcg_hb_MPI_4 timing out in several ATDM Trilinos builds on 'hansen' since 5/29/2018*Created by: bartlettroscoe*
CC: @trilinos/belos, @fryeguy52, @srajama1 (Linear Solves Project Lead)
## Next Action Status
Test was disabled in these builds on 'hansen' in the commit 8850c64 pushed on 6/12/2018 and was shown to be...*Created by: bartlettroscoe*
CC: @trilinos/belos, @fryeguy52, @srajama1 (Linear Solves Project Lead)
## Next Action Status
Test was disabled in these builds on 'hansen' in the commit 8850c64 pushed on 6/12/2018 and was shown to be disabled in the builds on CDash 6/13/2018
## Description
As shown in [this large query](https://testing-vm.sandia.gov/cdash/queryTests.php?project=Trilinos&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercount=19&showfilters=1&filtercombine=and&field1=buildname&compare1=65&value1=Trilinos-atdm-&field2=buildname&compare2=62&value2=Trilinos-atdm-mutrino-intel-debug-openmp&field3=buildname&compare3=62&value3=Trilinos-atdm-mutrino-intel-opt-openmp&field4=buildname&compare4=62&value4=Trilinos-atdm-white-ride-cuda-debug-pt-all-at-once&field5=buildname&compare5=62&value5=Trilinos-atdm-serrano-intel-debug-openmp&field6=buildname&compare6=62&value6=Trilinos-atdm-serrano-intel-opt-openmp&field7=buildname&compare7=62&value7=Trilinos-atdm-chama-intel-opt-openmp&field8=buildname&compare8=62&value8=Trilinos-atdm-chama-intel-debug-openmp-panzer&field9=buildname&compare9=62&value9=Trilinos-atdm-chama-intel-debug-openmp&field10=buildname&compare10=62&value10=Trilinos-atdm-chama-intel-opt-openmp-panzer&field11=site&compare11=62&value11=ride&field12=testname&compare12=61&value12=Belos_rcg_hb_MPI_4&field13=buildstarttime&compare13=84&value13=2018-06-08&field14=buildstarttime&compare14=83&value14=2018-05-10&field15=buildname&compare15=62&value15=Trilinos-atdm-white-ride-cuda-opt&field16=buildname&compare16=62&value16=Trilinos-atdm-white-ride-gnu-opt-openmp&field17=site&compare17=62&value17=serrano&field18=site&compare18=62&value18=shiller&field19=buildname&compare19=62&value19=Trilinos-atdm-white-ride-cuda-debug-all-at-once) the test `Belos_rcg_hb_MPI_4` looks to be consistently timing out in the builds:
* Trilinos-atdm-hansen-shiller-cuda-8.0-debug
* Trilinos-atdm-hansen-shiller-cuda-8.0-opt
* Trilinos-atdm-hansen-shiller-cuda-9.0-debug
* Trilinos-atdm-hansen-shiller-cuda-9.0-opt
* Trilinos-atdm-hansen-shiller-gnu-debug-serial
* Trilinos-atdm-hansen-shiller-gnu-opt-serial
all on 'hansen' starting on 5/29/201 or 5/30/2018. (Since the these builds are pulling directly from the 'develop' branch, they may be testing different versions on the same day and this is UTC time so they may be on the same testing day in Mountain time.)
That same query shows that that test has been consistently passing in every other promoted build on every other ATDM Trilinos testing machine.
What that query also shows is that in those same builds that are now timing out, the test was taking upwards of 6+ minutes to complete before it started timing out at 10 minutes on 5/29/201 or 5/30/2018 as shown in the last non-timing-out builds:
* Trilinos-atdm-hansen-shiller-cuda-8.0-debug: 6m 26s 280ms
* Trilinos-atdm-hansen-shiller-cuda-8.0-opt: 6m 25s 680ms
* Trilinos-atdm-hansen-shiller-cuda-9.0-debug: 6m 22s 810ms
* Trilinos-atdm-hansen-shiller-cuda-9.0-opt: 6m 22s 440ms
* Trilinos-atdm-hansen-shiller-gnu-debug-serial: 6m 13s 150ms
* Trilinos-atdm-hansen-shiller-gnu-opt-serial: 5m 58s 960ms
But the other builds that are not showing any timeouts, that test completes very fast (in under 30 seconds in about every case). Some of the recent test times shown in that query for the various builds that don't have timeouts now are:
* Trilinos-atdm-hansen-shiller-gnu-debug-openmp: 23s 850ms
* Trilinos-atdm-hansen-shiller-gnu-opt-openmp: 8s 650ms
* Trilinos-atdm-hansen-shiller-intel-debug-openmp: 7s 720ms
* Trilinos-atdm-hansen-shiller-intel-debug-serial: 7s 950ms
* Trilinos-atdm-hansen-shiller-intel-opt-openmp: 6s 150ms
* Trilinos-atdm-hansen-shiller-intel-opt-serial: 5s 910ms
* Trilinos-atdm-rhel6-gnu-debug-openmp: 6s 840ms
* Trilinos-atdm-rhel6-gnu-debug-serial: 5s 340ms
* Trilinos-atdm-rhel6-gnu-opt-openmp: 5s 180ms
* Trilinos-atdm-rhel6-gnu-opt-serial: 4s 250ms
* Trilinos-atdm-rhel6-intel-opt-openmp: 3s 740ms
* Trilinos-atdm-sems-gcc-7-2-0: 5s 290ms
* Trilinos-atdm-white-ride-cuda-debug: 9s 430ms
* Trilinos-atdm-white-ride-gnu-debug-openmp: 9s 90ms
So this seems pretty crazy. How can the same test take over 6 minutes to complete for a CUDA 8.0 and 9.0 optimized build on 'hansen' and only take 9 seconds for a CUDA debug on 'white'? And this test takes a very long time (and are now timing out) for the `gnu-debug-serial` and `gnu-opt-serial` builds as well on 'hansen' but is fast for the `intel-debug-serial` and `intel-opt-serial` builds on the same machine. How can that be the case?
To try to get more insight about this test we can look at the test output for a case where it takes a long time to run (and is timing out currently) and compare that to the test output for a case that completes very quickly.
First, lets look at the last time this test passed for the `Trilinos-atdm-hansen-shiller-gnu-debug-serial` build on 'hansen' which took 6m 13s 150ms to complate and pass on 2018-05-29T06:41:19 UTC with output shown at:
* https://testing-vm.sandia.gov/cdash/testDetails.php?test=47454651&build=3555977
which shows:
```
Passed.......OR Combination ->
OK...........Number of Iterations = 2206 < 4000
Converged....(2-Norm Imp Res Vec) / (2-Norm Res0)
residual [ 0 ] = 9.56537e-07 < 1e-06
residual [ 1 ] = 9.4486e-07 < 1e-06
residual [ 2 ] = 9.24543e-07 < 1e-06
residual [ 3 ] = 9.44363e-07 < 1e-06
residual [ 4 ] = 9.64382e-07 < 1e-06
residual [ 5 ] = 9.14533e-07 < 1e-06
residual [ 6 ] = 9.50517e-07 < 1e-06
residual [ 7 ] = 8.31671e-07 < 1e-06
residual [ 8 ] = 9.59686e-07 < 1e-06
residual [ 9 ] = 9.74218e-07 < 1e-06
==================================================================================================================================
TimeMonitor results over 4 processors
Timer Name MinOverProcs MeanOverProcs MaxOverProcs MeanOverCallCounts
----------------------------------------------------------------------------------------------------------------------------------
Belos: Operation Op*x 1.489 (2.114e+04) 1.582 (2.114e+04) 1.668 (2.114e+04) 7.483e-05 (2.114e+04)
Belos: Operation Prec*x 0 (0) 0 (0) 0 (0) 0 (0)
Belos: RCGSolMgr total solve time 365.4 (1) 365.4 (1) 365.4 (1) 365.4 (1)
Epetra_CrsMatrix::Multiply(TransA,X,Y) 1.45 (2.114e+04) 1.542 (2.114e+04) 1.629 (2.114e+04) 7.295e-05 (2.114e+04)
==================================================================================================================================
```
And let's compare this to the test output for the build `Trilinos-atdm-hansen-shiller-intel-debug-serial` on 'hansen' which took 6s 740ms to complete and pass on 2018-05-29T14:52:35 UTC shown at:
* https://testing-vm.sandia.gov/cdash/testDetails.php?test=47482010&build=3557186
which shows:
```
Passed.......OR Combination ->
OK...........Number of Iterations = 2131 < 4000
Converged....(2-Norm Imp Res Vec) / (2-Norm Res0)
residual [ 0 ] = 9.5909e-07 < 1e-06
residual [ 1 ] = 9.65321e-07 < 1e-06
residual [ 2 ] = 8.59334e-07 < 1e-06
residual [ 3 ] = 9.55053e-07 < 1e-06
residual [ 4 ] = 9.97094e-07 < 1e-06
residual [ 5 ] = 7.53902e-07 < 1e-06
residual [ 6 ] = 8.46489e-07 < 1e-06
residual [ 7 ] = 9.64082e-07 < 1e-06
residual [ 8 ] = 9.92318e-07 < 1e-06
residual [ 9 ] = 9.92263e-07 < 1e-06
==================================================================================================================================
TimeMonitor results over 4 processors
Timer Name MinOverProcs MeanOverProcs MaxOverProcs MeanOverCallCounts
----------------------------------------------------------------------------------------------------------------------------------
Belos: Operation Op*x 2.026 (2.109e+04) 2.179 (2.109e+04) 2.403 (2.109e+04) 0.0001033 (2.109e+04)
Belos: Operation Prec*x 0 (0) 0 (0) 0 (0) 0 (0)
Belos: RCGSolMgr total solve time 5.945 (1) 5.946 (1) 5.946 (1) 5.946 (1)
Epetra_CrsMatrix::Multiply(TransA,X,Y) 1.975 (2.109e+04) 2.116 (2.109e+04) 2.316 (2.109e+04) 0.0001003 (2.109e+04)
==================================================================================================================================
```
The times for the individual operations is not that different but "Belos: RCGSolMgr total solve time" at 365.4 vs. 5.946 is the real problem. The final results show that the test is doing different computations in these two builds but the total number of operations is not radically different (e,g, 2.114e+04 vs. 2.109e+04 mat-vecs). So what is going on here to cause the huge increase in wall clock time for a serial Kokkos threading test?
Looking at the new commits pulled in when this started to fail for the build `Trilinos-atdm-hansen-shiller-gnu-opt-serial` on 2018-05-29 14:05:09 shown at:
* https://testing-vm.sandia.gov/cdash/viewNotes.php?buildid=3560199#!#note0
it is hard to tell what might have caused these tests to start timing out. I would guess that the most likely trigger was:
```
c840658: Switch to CMake 3.11.2, Ninja 1.8.2 and all-at-once mode on hansen/shiller (TRIL-209)
Author: Roscoe A. Bartlett <rabartl@sandia.gov>
Date: Tue May 29 08:12:42 2018 -0600
M cmake/ctest/drivers/atdm/shiller/local-driver.sh
M cmake/std/atdm/shiller/environment.sh
```
That will increase the number of tests running on the machine and could result in single tests taking longer to run.
But the fact that the same test takes 6 minutes GCC but only takes 7 seconds with Intel is a major problem, in my opinion and that has to be investigated.
Someone is going to need to add some more timers to account for where the time is going.
## Steps to reproduce
One should be able to follow the instructions at:
* https://github.com/trilinos/Trilinos/blob/develop/cmake/std/atdm/README.md
to reproduce this behavior on 'hansen' or 'shiller'. To avoid needing to run on a compute node, one could use the `gnu-debug-serial` build and do:
```
$ cd <some_build_dir>/
$ source $TRILINOS_DIR/cmake/std/atdm/load-env.sh gnu-debug-serial
$ cmake \
-GNinja \
-DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnv.cmake \
-DTrilinos_ENABLE_TESTS=ON -DTrilinos_ENABLE_Belos=ON \
$TRILINOS_DIR
$ make NP=16
$ ctest -VV -R Belos_rcg_hb_MPI_4
```
Keep promoted "ATDM" builds of Trilinos cleanhttps://gitlab.osti.gov/jmwille/Trilinos/-/issues/2925Test Stratimikos_test_aztecoo_thyra_driver_MPI_1 timing out in Trilinos-atdm...2018-11-30T11:16:53ZJames WillenbringTest Stratimikos_test_aztecoo_thyra_driver_MPI_1 timing out in Trilinos-atdm-hansen-shiller-gnu-debug-serial build since 5/30/2018*Created by: bartlettroscoe*
CC: @trilinos/stratimikos, @fryeguy52
## Next Action Stauts
Test was disabled for these two builds on 'hansen' in commit 73ae19c pushed on 6/12/2018 and this test disappeared in these builds on 6/13/2...*Created by: bartlettroscoe*
CC: @trilinos/stratimikos, @fryeguy52
## Next Action Stauts
Test was disabled for these two builds on 'hansen' in commit 73ae19c pushed on 6/12/2018 and this test disappeared in these builds on 6/13/2018.
## Description
As shown in [this query](https://testing-vm.sandia.gov/cdash/queryTests.php?project=Trilinos&date=2018-06-11&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercount=11&showfilters=1&filtercombine=and&field1=buildname&compare1=65&value1=Trilinos-atdm-&field2=status&compare2=62&value2=passed&field3=status&compare3=62&value3=notrun&field4=buildname&compare4=62&value4=Trilinos-atdm-white-ride-cuda-debug-pt-all-at-once&field5=site&compare5=62&value5=mutrino&field6=site&compare6=62&value6=serrano&field7=site&compare7=62&value7=chama&field8=site&compare8=62&value8=ride&field9=buildstarttime&compare9=84&value9=2018-06-11&field10=buildstarttime&compare10=83&value10=2018-05-20&field11=testname&compare11=65&value11=Stratimikos), the test `Stratimikos_test_aztecoo_thyra_driver_MPI_1` has been timing out in the builds `Trilinos-atdm-hansen-shiller-gnu-debug-serial` and `Trilinos-atdm-hansen-shiller-gnu-opt-serial` since 5/30/2018. (That query also shows this is the only Stratimikos test that has failed in any of the promoted "ATDM" builds since 5/20/2018.)
[This query](https://testing-vm.sandia.gov/cdash/queryTests.php?project=Trilinos&date=2018-06-11&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercount=5&showfilters=1&filtercombine=and&field1=buildname&compare1=65&value1=Trilinos-atdm-&field2=buildname&compare2=61&value2=Trilinos-atdm-hansen-shiller-gnu-debug-serial&field3=buildstarttime&compare3=84&value3=2018-06-11&field4=buildstarttime&compare4=83&value4=2018-05-20&field5=testname&compare5=61&value5=Stratimikos_test_aztecoo_thyra_driver_MPI_1) shows that the test `Stratimikos_test_aztecoo_thyra_driver_MPI_1` went from passing at under 21s every day to timing out at 10 minutes every day since 5/29/2018 (but it did pass once taking 9m 56s 930ms on 6/8/2018, the only time it did not time-out since 5/29/2018).
What changed from 5/29/2018 to 5/30/2018? Looking at the updates pulled in for the build `Trilinos-atdm-hansen-shiller-gnu-debug-serial` with build stamp `20180530-0400-ATDM` shown at:
* https://testing-vm.sandia.gov/cdash/viewNotes.php?buildid=3558860#!#note0
it seems like only commits that could have impacted this were:
```
c9ccf7d: Switch from srun to salloc on hansen/shiller (TRIL-209)
Author: Roscoe A. Bartlett <rabartl@sandia.gov>
Date: Tue May 29 08:35:16 2018 -0600
M cmake/ctest/drivers/atdm/shiller/local-driver.sh
M cmake/std/atdm/README.md
c840658: Switch to CMake 3.11.2, Ninja 1.8.2 and all-at-once mode on hansen/shiller (TRIL-209)
Author: Roscoe A. Bartlett <rabartl@sandia.gov>
Date: Tue May 29 08:12:42 2018 -0600
M cmake/ctest/drivers/atdm/shiller/local-driver.sh
M cmake/std/atdm/shiller/environment.sh
```
There are no other commits that I could see that could impact this AztecOO test. So it looks like moving to CMake/CTest 3.11.2 and to the all-at-once approach triggered this large increase in runtime for the test `Stratimikos_test_aztecoo_thyra_driver_MPI_1` for the build `Trilinos-atdm-hansen-shiller-gnu-debug-serial`. This may have been a result of having more tests running while this Stratimikos test is running.
Looking in [this query](https://testing-vm.sandia.gov/cdash/queryTests.php?project=Trilinos&date=2018-06-10&filtercombine=and&filtercount=2&showfilters=1&filtercombine=and&field1=buildname&compare1=65&value1=Trilinos-atdm-&field2=testname&compare2=61&value2=Stratimikos_test_aztecoo_thyra_driver_MPI_1), we can see that the test `Stratimikos_test_aztecoo_thyra_driver_MPI_1` timed-out in the build `Trilinos-atdm-hansen-shiller-gnu-debug-serial` yesterday 6/10/2018 but it took upwards of 2.5 to 3.5 minutes to run in the CUDA builds. Ohterwise, this test did not take any longer than 22s to run in all of the other ATDM builds of Trilinos. And what is also interesting is that query showed that this test passed in 4s 460ms for the build `Trilinos-atdm-hansen-shiller-intel-debug-serial` also run on 'hansen'. How can the same test pass on an `intel-debug-serial` build in under 5 seconds but then time out at 10 minutes for a `gnu-debug-serial` build on the same hardware with the same MPI implementation and settings?
For that matter, [this query](https://testing-vm.sandia.gov/cdash/queryTests.php?project=Trilinos&date=2018-06-10&filtercombine=and&filtercount=1&showfilters=1&field1=testname&compare1=61&value1=Stratimikos_test_aztecoo_thyra_driver_MPI_1) shows that other than the CUDA builds of Trilinos and the yet-to-be-cleaned-up 'mutrinos' build `Trilinos-atdm-mutrino-intel-debug-openmp`, this test did not take any longer than 22s to run in any of the 46 Trilinos builds where this test ran yesterday. On some platforms, this test completed in less than 2s!
This is very strange behavior for a test. There must be some type of machine or system usage issue going on here. But why would it impact a `gnu-debug-serial` build but not an `intel-debug-serial` build on the same machine?
## Steps to reproduce
Following the instructions at:
* https://github.com/trilinos/Trilinos/blob/develop/cmake/std/atdm/README.md#shillerhansen
one can log on to 'hansen' or 'shiller', clone Trilinos and get on to the 'develop' branch, and then do:
```
$ cd <some_build_dir>/
$ source $TRILINOS_DIR/cmake/std/atdm/load-env.sh intel-opt-openmp
$ cmake \
-GNinja \
-DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnv.cmake \
-DTrilinos_ENABLE_TESTS=ON -DTrilinos_ENABLE_Stratimikos=ON \
$TRILINOS_DIR
$ make NP=16
$ salloc ctest -j16
```
I did this on 'shiller' but unfortunately all of the Stratimikos tests passed:
```
100% tests passed, 0 tests failed out of 40
Subproject Time Summary:
Stratimikos = 256.50 sec*proc (40 tests)
Total Test time (real) = 20.84 sec
```
Therefore, I was not able to reproduce this behavior on 'shiller'. Therefore, this must be some type of system issue.
Keep promoted "ATDM" builds of Trilinos cleanhttps://gitlab.osti.gov/jmwille/Trilinos/-/issues/2921Test TpetraCore_MatrixMatrix_UnitTests_MPI_4 failing in all ATDM Trilinos CUD...2018-11-30T03:09:57ZJames WillenbringTest TpetraCore_MatrixMatrix_UnitTests_MPI_4 failing in all ATDM Trilinos CUDA builds starting 6/7/2018*Created by: bartlettroscoe*
CC: @trilinos/tpetra, @fryeguy52, @kddevin (Data Services Product Lead)
## Next Action Status
Merged PR #2122 fixed this in all ATDM builds on 6/11/2018.
## Description
As shown in [this query fo...*Created by: bartlettroscoe*
CC: @trilinos/tpetra, @fryeguy52, @kddevin (Data Services Product Lead)
## Next Action Status
Merged PR #2122 fixed this in all ATDM builds on 6/11/2018.
## Description
As shown in [this query for the test TpetraCore_MatrixMatrix_UnitTests_MPI_4 between 6/5/2018 and 6/8/2018](https://testing-vm.sandia.gov/cdash/queryTests.php?project=Trilinos&date=2018-06-07&filtercombine=and&filtercombine=and&filtercount=14&showfilters=1&filtercombine=and&field1=buildname&compare1=65&value1=Trilinos-atdm-&field2=buildname&compare2=62&value2=Trilinos-atdm-mutrino-intel-debug-openmp&field3=buildname&compare3=62&value3=Trilinos-atdm-mutrino-intel-opt-openmp&field4=buildname&compare4=62&value4=Trilinos-atdm-white-ride-cuda-debug-pt-all-at-once&field5=buildname&compare5=62&value5=Trilinos-atdm-serrano-intel-debug-openmp&field6=buildname&compare6=62&value6=Trilinos-atdm-serrano-intel-opt-openmp&field7=buildname&compare7=62&value7=Trilinos-atdm-chama-intel-opt-openmp&field8=buildname&compare8=62&value8=Trilinos-atdm-chama-intel-debug-openmp-panzer&field9=buildname&compare9=62&value9=Trilinos-atdm-chama-intel-debug-openmp&field10=buildname&compare10=62&value10=Trilinos-atdm-chama-intel-opt-openmp-panzer&field11=site&compare11=62&value11=ride&field12=testname&compare12=61&value12=TpetraCore_MatrixMatrix_UnitTests_MPI_4&field13=buildstarttime&compare13=84&value13=2018-06-09&field14=buildstarttime&compare14=83&value14=2018-06-05), this test started failing in all of the CUDA builds:
* Trilinos-atdm-hansen-shiller-cuda-8.0-debug
* Trilinos-atdm-hansen-shiller-cuda-8.0-opt
* Trilinos-atdm-hansen-shiller-cuda-9.0-debug
* Trilinos-atdm-hansen-shiller-cuda-9.0-opt
* Trilinos-atdm-white-ride-cuda-debug
* Trilinos-atdm-white-ride-cuda-opt
starting on 6/7/2018.
The failing test output, for example, for the build `Trilinos-atdm-hansen-shiller-cuda-9.0-opt` on 'hansen' on 2018-06-07T18:36:47 UTC shown at:
* https://testing-vm.sandia.gov/cdash/testDetails.php?test=48129338&build=3591841
showed
```
p=0: *** Caught standard std::exception of type 'std::runtime_error' :
Invalid SPGEMMAlgorithm name
[FAILED] (6.84 sec) Tpetra_MatMat_double_int_int_Kokkos_Compat_KokkosCudaWrapperNode_operations_test_UnitTest
Location: /home/jenkins/hansen/workspace/Trilinos-atdm-hansen-shiller-cuda-9.0-opt/SRC_AND_BUILD/Trilinos/packages/tpetra/core/test/MatrixMatrix/MatrixMatrix_UnitTests.cpp:789
...
p=0: *** Caught standard std::exception of type 'std::runtime_error' :
Invalid SPGEMMAlgorithm name
Tpetra sparse matrix-matrix multiply: range row test
getIdentityMatrix
Create row Map
Create CrsMatrix
[FAILED] (0.865 sec) Tpetra_MatMat_double_int_longlong_Kokkos_Compat_KokkosCudaWrapperNode_operations_test_UnitTest
Location: /home/jenkins/hansen/workspace/Trilinos-atdm-hansen-shiller-cuda-9.0-opt/SRC_AND_BUILD/Trilinos/packages/tpetra/core/test/MatrixMatrix/MatrixMatrix_UnitTests.cpp:789
...
The following tests FAILED:
0. Tpetra_MatMat_double_int_int_Kokkos_Compat_KokkosCudaWrapperNode_operations_test_UnitTest ...
10. Tpetra_MatMat_double_int_longlong_Kokkos_Compat_KokkosCudaWrapperNode_operations_test_UnitTest ...
Total Time: 11.1 sec
Summary: total = 20, run = 20, passed = 18, failed = 2
End Result: TEST FAILED
```
All of the other failed test runs showed about identical test output.
When the test passed for the build `Trilinos-atdm-hansen-shiller-cuda-9.0-opt` on 'hansen' on 2018-06-05T20:22:00 UTC shown at:
* https://testing-vm.sandia.gov/cdash/testDetails.php?test=47987300&build=3582617
it showed the test output:
```
Total Time: 29.4 sec
Summary: total = 20, run = 20, passed = 20, failed = 0
End Result: TEST PASSED
```
## Steps to reproduce
Following the instructions at:
* https://github.com/trilinos/Trilinos/blob/develop/cmake/std/atdm/README.md
one should be able to reproduce this failure on 'hansen', 'shiller', 'white', or 'ride'. Given that 'white' is on the SON and is pretty unloaded, one can reproduce this as described at:
* https://github.com/trilinos/Trilinos/blob/develop/cmake/std/atdm/README.md#ridewhite
with:
```
$ cd <some_build_dir>/
$ source $TRILINOS_DIR/cmake/std/atdm/load-env.sh cuda-debug
$ cmake \
-GNinja \
-DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnv.cmake \
-DTrilinos_ENABLE_TESTS=ON -DTrilinos_ENABLE_Tpetra=ON \
$TRILINOS_DIR
$ make NP=16
$ bsub -x -Is -q rhel7F -n 16 ctest -VV -R TpetraCore_MatrixMatrix_UnitTests_MPI_4
```
Keep promoted "ATDM" builds of Trilinos cleanhttps://gitlab.osti.gov/jmwille/Trilinos/-/issues/2913Panzer tests randomly crashing with "out of memory" errors on CUDA 8.0 and 9....2018-11-30T03:15:27ZJames WillenbringPanzer tests randomly crashing with "out of memory" errors on CUDA 8.0 and 9.0 debug builds on 'hansen' since Kokkos and KokkosKernels update on 5/26/2018*Created by: bartlettroscoe*
CC: @trilinos/kokkos, @trilinos/panzer, @fryeguy52
## Next Action Status
After fixing the jobs on 'hansen' to run on the compute nodes instead of the login node in PR #2941 merged on 6/14/2018, we hav...*Created by: bartlettroscoe*
CC: @trilinos/kokkos, @trilinos/panzer, @fryeguy52
## Next Action Status
After fixing the jobs on 'hansen' to run on the compute nodes instead of the login node in PR #2941 merged on 6/14/2018, we have not seen any of these Panzer test failures showing this "out of memory" error.
## Description
Once the consistent crashes and timeouts of Panzer tests were fixed after the last Kokkos and KokkosKernels update (see #2827), we have been seeing randomly failing Panzer tests (13 total failures so far) in the `cuda-8.0-debug` and `cuda-9.0-debug` builds on 'hansen' (but **not** 'white') as shown in the [this query](https://testing-vm.sandia.gov/cdash/queryTests.php?project=Trilinos&date=2018-06-07&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercount=16&showfilters=1&filtercombine=and&field1=buildname&compare1=65&value1=Trilinos-atdm-&field2=status&compare2=62&value2=passed&field3=status&compare3=62&value3=notrun&field4=buildname&compare4=62&value4=Trilinos-atdm-mutrino-intel-debug-openmp&field5=buildname&compare5=62&value5=Trilinos-atdm-mutrino-intel-opt-openmp&field6=buildname&compare6=62&value6=Trilinos-atdm-white-ride-cuda-debug-pt-all-at-once&field7=buildname&compare7=62&value7=Trilinos-atdm-serrano-intel-debug-openmp&field8=buildname&compare8=62&value8=Trilinos-atdm-serrano-intel-opt-openmp&field9=buildname&compare9=62&value9=Trilinos-atdm-chama-intel-opt-openmp&field10=buildname&compare10=62&value10=Trilinos-atdm-chama-intel-debug-openmp-panzer&field11=buildname&compare11=62&value11=Trilinos-atdm-chama-intel-debug-openmp&field12=buildname&compare12=62&value12=Trilinos-atdm-chama-intel-opt-openmp-panzer&field13=site&compare13=62&value13=ride&field14=testname&compare14=65&value14=Panzer&field15=buildstarttime&compare15=84&value15=2018-06-08&field16=buildstarttime&compare16=83&value16=2018-06-02). That query shows the following tests failing (not timing out):
* PanzerAdaptersSTK_CurlLaplacianExample
* PanzerAdaptersSTK_MixedPoissonExample-ConvTest-Hex-Order-1
* PanzerAdaptersSTK_MixedPoissonExample-ConvTest-Hex-Order-2
* PanzerAdaptersSTK_PoissonExample-ConvTest-Quad-Order-3
* PanzerAdaptersSTK_PoissonExample-ConvTest-Quad-Order-4
* PanzerMiniEM_MiniEM-BlockPrec_Augmentation_MPI_4
These tests all show the output:
```
terminate called after throwing an instance of 'std::runtime_error'
what(): cudaGetLastError() error( cudaErrorMemoryAllocation): out of memory /home/jenkins/hansen/workspace/Trilinos-atdm-hansen-shiller-cuda-9.0-debug/SRC_AND_BUILD/Trilinos/packages/kokkos/core/src/Cuda/Kokkos_CudaExec.hpp:401
Traceback functionality not available
[hansen02:50161] *** Process received signal ***
[hansen02:50161] Signal: Aborted (6)
[hansen02:50161] Signal code: (-6)
```
This error was first observed in https://github.com/trilinos/Trilinos/issues/2827#issuecomment-394457902 and in https://github.com/trilinos/Trilinos/issues/2827#issuecomment-394463525 it was stated that these failures are likely due to running several tests at the same time with `cuda -j8` vs. one-at-a-time using `ctest -j1`. (But we can't afford to run the tests with `ctest -j1` because that would take over 6 hours to run the tests as described in https://github.com/trilinos/Trilinos/issues/2464#issuecomment-394034743 there is not enough wall-clock time on 'hansen' to get all of the testing in if we did that and that wastes the machine.)
Looking at the history for one of these tests `PanzerAdaptersSTK_MixedPoissonExample-ConvTest-Hex-Order-2` in [this query](https://testing-vm.sandia.gov/cdash/queryTests.php?project=Trilinos&date=2018-06-07&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercount=6&showfilters=1&filtercombine=and&field1=buildname&compare1=65&value1=Trilinos-atdm-&field2=buildname&compare2=63&value2=cuda&field3=site&compare3=61&value3=hansen&field4=buildstarttime&compare4=84&value4=2018-06-08&field5=buildstarttime&compare5=83&value5=2018-06-02&field6=testname&compare6=61&value6=PanzerAdaptersSTK_MixedPoissonExample-ConvTest-Hex-Order-2) one can see that the test failed 4 times but otherwise the test takes upwards of 6 minutes to complete some days. Therefore, one might assume that this is a larger test that uses more memory and therefore might cause memory exhaustion.
## Steps to reproduce
These tests are crashing randomly as described so it may be hard to reproduce this behavior locally. But one could follow the instructions at:
* https://github.com/trilinos/Trilinos/blob/develop/cmake/std/atdm/README.md#shillerhansen
and give it a try. But this issue really needs to be resolved by addressing how resources for various tests are assigned and that will require a more comprehensive approach as described in #2422.
Keep promoted "ATDM" builds of Trilinos cleanhttps://gitlab.osti.gov/jmwille/Trilinos/-/issues/2883Kokkos include dirs in installed config file pointing into the build tree ins...2018-11-30T03:09:20ZJames WillenbringKokkos include dirs in installed config file pointing into the build tree instead of the install tree*Created by: bartlettroscoe*
CC: @trilinos/kokkos, @trilinos/framework, @bathmatt, @micahahoward, @fryeguy52
## Next Action Status
Merge of PR #2874 back on 6/6/2018 resolved the issue, EMPIRE updated Trilinos and verified they c...*Created by: bartlettroscoe*
CC: @trilinos/kokkos, @trilinos/framework, @bathmatt, @micahahoward, @fryeguy52
## Next Action Status
Merge of PR #2874 back on 6/6/2018 resolved the issue, EMPIRE updated Trilinos and verified they can build against the installed Trilinos.
## Description
The last Kokkos update broke the installation of Trilinos for the EMPIRE code. The issue is that the KokkosConfig.cmake filess seem to be pointing into the build tree instead of the install tree. More details can be provided by @bathmath and @crtrott.
NOTE: This slipped through the EMPIRE Jenkins Pipeline testing and upgrade process for Trilinos because the same Jenkins entity account is used to build and test EMPIRE that is used to build and install Trilinos. The issue is that no other users can access the build directories for that Jenkins entity account.Keep promoted "ATDM" builds of Trilinos cleanhttps://gitlab.osti.gov/jmwille/Trilinos/-/issues/2866 Amesos2 test file amesos2/test/adapters/Tpetra_CrsMatrix_Adapter_UnitTests.c...2018-11-30T03:21:23ZJames Willenbring Amesos2 test file amesos2/test/adapters/Tpetra_CrsMatrix_Adapter_UnitTests.cpp build failure for ATDM Trilinos CUDA 9.0 builds on 'hansen/'shiller'*Created by: bartlettroscoe*
CC: @trilinos/amesos2 (Package Team), @srajama1 (Product Lead), @fryeguy52
## Next Action Status
Build error was fixed in merged PR #2876 and the build failure went away and all Amesos2 tests passed i...*Created by: bartlettroscoe*
CC: @trilinos/amesos2 (Package Team), @srajama1 (Product Lead), @fryeguy52
## Next Action Status
Build error was fixed in merged PR #2876 and the build failure went away and all Amesos2 tests passed in CUDA 9.0 builds on 6/5/2018.
## Description
As shown in the query:
* https://testing-vm.sandia.gov/cdash/index.php?project=Trilinos&date=2018-05-31&filtercombine=and&filtercombine=and&filtercount=3&showfilters=1&filtercombine=and&field1=buildname&compare1=65&value1=Trilinos-atdm-hansen-shiller-cuda-9.0-&field2=subprojects&compare2=93&value2=Amesos2&field3=buildstarttime&compare3=84&value3=2018-06-02
the Amesos2 has had a single build failure in both of the CUDA 9.0 builds on 'hansen'/'shiller':
* `Trilinos-atdm-hansen-shiller-cuda-9.0-debug`
* `Trilinos-atdm-hansen-shiller-cuda-9.0-opt`
since this CUDA 9.0 build was first set up (see #2706).
The build failure is for the file `packages/amesos2/test/adapters/Tpetra_CrsMatrix_Adapter_UnitTests.cpp` and is shown, for example, at:
* https://testing-vm.sandia.gov/cdash/viewBuildError.php?buildid=3564003
and shows:
```
/home/jenkins/hansen/workspace/Trilinos-atdm-hansen-shiller-cuda-9.0-debug/SRC_AND_BUILD/Trilinos/packages/amesos2/test/adapters/Tpetra_CrsMatrix_Adapter_UnitTests.cpp(106): error: member "<unnamed>::test_traits<Scalar>::test_mat [with Scalar=double]" was referenced but not defined
1 error detected in the compilation of "/tmp/tmpxft_00007af7_00000000-4_Tpetra_CrsMatrix_Adapter_UnitTests.cpp4.ii".
```
This results in the "Not Run" test `Amesos2_Tpetra_CrsMatrix_Adapter_UnitTests_MPI_4` as shown, for example, at:
* https://testing-vm.sandia.gov/cdash/viewTest.php?onlynotrun&buildid=3564003
The only other failures in these CUDA 9.0 builds are failures do to the Kokos update described in #2728 which also impact the CUDA 8.0 builds. I fully expect those to go away once those issues are fixed in #2728.
## Steps to Reproduce
Following the instructions in:
* https://github.com/trilinos/Trilinos/blob/develop/cmake/std/atdm/README.md#shillerhansen
which is linked to from:
* https://snl-wiki.sandia.gov/display/CoodinatedDevOpsATDM/ATDM+Builds+of+Trilinos
one should be able to figure out how to reproduce this.
But to be specific, the exact instructions to reproduce this build failure are:
1. Log onto 'hansen' (SON) or 'shiller' (SON)
2. Clone the Trilinos repo (pointed to by `$TRILINOS_DIR` below) and get on the 'develop' branch
3. Create `<some_build_dir>` and do:
```
$ cd <some_build_dir>/
$ source $TRILINOS_DIR/cmake/std/atdm/load-env.sh cuda-9.0-opt
$ cmake \
-GNinja \
-DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnv.cmake \
-DTrilinos_ENABLE_TESTS=ON -DTrilinos_ENABLE_Amesos2=ON \
$TRILINOS_DIR
$ make NP=16
```
That should reproduce the build error.
Initial cleanup of new ATDM builds of Trilinoshttps://gitlab.osti.gov/jmwille/Trilinos/-/issues/2864Auto PR build failures due to Intel License server problems2018-06-01T18:11:28ZJames WillenbringAuto PR build failures due to Intel License server problems*Created by: bartlettroscoe*
CC: @trilinos/framework
## Expectations
Auto PR builds should be robust and not fail unless there is a failure in code itself.
## Current Behavior
The new Intel auto PR build fails randomly due ...*Created by: bartlettroscoe*
CC: @trilinos/framework
## Expectations
Auto PR builds should be robust and not fail unless there is a failure in code itself.
## Current Behavior
The new Intel auto PR build fails randomly due to Intel license server problems such as shown at:
* https://github.com/trilinos/Trilinos/pull/2860#issuecomment-393742327
which showed the build failure:
* https://testing-vm.sandia.gov/cdash/viewBuildError.php?buildid=3564874
which showed:
```
Error: A license for Comp-CL is not available now (-15,570,115).
A connection to the license server could not be made. You should
make sure that your license daemon process is running: both an
lmgrd process and an INTEL process should be running
if your license limits you to a specified number of licenses in use
at a time. Also, check to see if the wrong port@host or the wrong
license file is being used, or if the port or hostname in the license
file has changed.
License file(s) used were (in this order):
1. Trusted Storage
** 2. /projects/sems/install/rhel6-x86_64/sems/compiler/intel/17.0.1/base/Licenses/intel-Linux-SRN.lic
** 3. /projects/sems/install/rhel6-x86_64/sems/compiler/intel/17.0.1/base/compilers_and_libraries_2017.1.132/linux/bin/intel64/../../Licenses
** 4. /ascldap/users/trilinos/Licenses
```
The reason that I noticed this was because it happened in my PR #2860.
Looking at the current PR build history at:
* https://testing-vm.sandia.gov/cdash/index.php?project=Trilinos&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercount=3&showfilters=1&filtercombine=and&field1=buildname&compare1=63&value1=-test-Trilinos_pullrequest_intel_17.0.1&field2=buildstarttime&compare2=84&value2=2018-06-02&field3=buildstarttime&compare3=83&value3=2018-05-14
out of 10 builds, it failed twice with these Intel license server problems. That is an only 80% success rate so far. That is not robust enough for an auto PR build.
## Motivation and Context
Auto PR builds block what goes into the 'develop' branch and long delays make things harder.
## Definition of Done
Auto PR builds should only fail due to non-code issues very infrequently.
## Possible Solution
Don't know.
## Steps to Reproduce
Don't know.
## Your Environment
N.A. This is the auto PR builds that define their own env.
Improve productivity, stability, and quality of Trilinos