Trilinos issueshttps://gitlab.osti.gov/jmwille/Trilinos/-/issues2018-11-30T11:15:41Zhttps://gitlab.osti.gov/jmwille/Trilinos/-/issues/3103Test randomly failing test ROL_example_poisson-inversion_example_01_MPI_1 fai...2018-11-30T11:15:41ZJames WillenbringTest randomly failing test ROL_example_poisson-inversion_example_01_MPI_1 failing in PR Intel build*Created by: bartlettroscoe*
CC: @trilinos/framework, @trilinos/rol, @rppawlo (Trilinos Nonlinear Solvers Product Lead)
## Next Action Status
PR #3104 merged to 'develop' on 7/13/2018 which disables `ROL_example_poisson-inversion_...*Created by: bartlettroscoe*
CC: @trilinos/framework, @trilinos/rol, @rppawlo (Trilinos Nonlinear Solvers Product Lead)
## Next Action Status
PR #3104 merged to 'develop' on 7/13/2018 which disables `ROL_example_poisson-inversion_example_01_MPI_1` in Intel PR test build. Next: ROL developers fix behavior of test offline ...
## Description
As you can see in [this query](https://testing-vm.sandia.gov/cdash/queryTests.php?project=Trilinos&date=2018-07-12&filtercombine=and&filtercombine=and&filtercombine=and&filtercount=3&showfilters=1&filtercombine=and&field1=testname&compare1=61&value1=ROL_example_poisson-inversion_example_01_MPI_1&field2=buildstarttime&compare2=83&value2=2018-07-01&field3=groupname&compare3=61&value3=Pull%20Request), the test ROL_example_poisson-inversion_example_01_MPI_1` seems to be failing randomly in the Intel PR build. This just killed my PR testing iteration shown in #3100. (Now I have to put on a `AT: RETEST` and hope this does not fail again and then stay up late to click the "merge" button in order for this to clean up the build tomorrow.)
In the case of the [#3100 PR testing iteration](https://github.com/trilinos/Trilinos/pull/3100#issuecomment-404665175), the [failing test output](https://testing-vm.sandia.gov/cdash/testDetails.php?test=48059673&build=3716185) shows:
```
Newton-Krylov using Conjugate Gradients
Line Search: Cubic Interpolation satisfying Strong Wolfe Conditions
iter value gnorm snorm #fval #grad iterCG flagCG ls_#fval ls_#grad
0 2.340112e-03 1.927880e-03
1 1.597727e-04 4.157593e-04 3.727069e+00 2 2 4 2 1 0
2 5.442664e-06 5.009082e-05 8.348624e-01 3 3 5 2 1 0
3 1.146552e-06 6.106086e-06 3.163006e+00 4 4 11 2 1 0
4 8.023717e-07 3.144919e-06 5.128519e-01 6 5 11 2 2 0
5 6.126545e-07 2.642767e-06 5.167993e-01 8 6 15 2 2 0
6 4.613227e-07 2.330904e-06 4.228759e-01 10 7 14 2 2 0
7 3.685626e-07 2.259062e-06 3.602303e-01 12 8 16 2 2 0
8 3.352764e-07 3.447963e-06 5.608285e-01 15 9 19 2 3 0
9 3.352764e-07 3.447963e-06 0.000000e+00 35 10 22 2 20 0 Optimization Terminated with Status: Step Tolerance Met
old_optimal_value = 1.0485417402164909e-07
new_optimal_value = 3.3527637557488306e-07
abs(new_optimal_value - old_optimal_value) / abs(old_optimal_value) = 2.19754915532174255333e+00 > 1.49011611938476562500e-08
End Result: TEST FAILED
```
If you look at a previous Intel PR build shown [here](https://testing-vm.sandia.gov/cdash/testDetails.php?test=45813048&build=3716020), it shows the output:
```
Newton-Krylov using Conjugate Gradients
Line Search: Cubic Interpolation satisfying Strong Wolfe Conditions
iter value gnorm snorm #fval #grad iterCG flagCG ls_#fval ls_#grad
0 2.340112e-03 1.927880e-03
1 1.597727e-04 4.157593e-04 3.727069e+00 2 2 4 2 1 0
2 5.442664e-06 5.009082e-05 8.348624e-01 3 3 5 2 1 0
3 2.334731e-06 2.793001e-05 3.695260e+00 4 4 11 2 1 0
4 1.076543e-06 1.668824e-05 5.083248e-01 6 5 6 2 2 0
5 8.388745e-07 1.439672e-05 1.215272e+00 7 6 7 2 1 0
6 5.152760e-07 9.169432e-06 1.560582e+00 9 7 17 2 2 0
7 1.398695e-07 2.702421e-06 4.159034e-01 10 8 10 0 1 0
8 1.089003e-07 4.590927e-07 2.184686e-01 11 9 11 0 1 0
9 1.060664e-07 6.754781e-07 9.860141e-01 12 10 39 0 1 0
10 1.051188e-07 1.569364e-08 9.726082e-02 13 11 15 0 1 0
11 1.048559e-07 2.698522e-08 2.153267e-01 14 12 50 1 1 0
12 1.048544e-07 2.590253e-10 1.010471e-03 15 13 10 0 1 0
13 1.048542e-07 1.052513e-11 4.755420e-03 16 14 50 1 1 0
14 1.048542e-07 2.725544e-13 1.514146e-04 17 15 50 1 1 0 Optimization Terminated with Status: Converged
old_optimal_value = 1.0485417402273531e-07
new_optimal_value = 1.0485417402191586e-07
End Result: TEST PASSED
```
If you look at the last time this test failed in an Intel PR build on 7/3/2018 [here](https://testing-vm.sandia.gov/cdash/testDetails.php?test=48059673&build=3684202) is showed the output:
```
Newton-Krylov using Conjugate Gradients
Line Search: Cubic Interpolation satisfying Strong Wolfe Conditions
iter value gnorm snorm #fval #grad iterCG flagCG ls_#fval ls_#grad
0 2.340112e-03 1.927880e-03
1 1.597727e-04 4.157593e-04 3.727069e+00 2 2 4 2 1 0
2 5.442664e-06 5.009082e-05 8.348624e-01 3 3 5 2 1 0
3 1.146552e-06 6.106086e-06 3.163006e+00 4 4 11 2 1 0
4 8.023717e-07 3.144919e-06 5.128519e-01 6 5 11 2 2 0
5 6.126545e-07 2.642767e-06 5.167993e-01 8 6 15 2 2 0
6 4.613227e-07 2.330904e-06 4.228759e-01 10 7 14 2 2 0
7 3.685626e-07 2.259062e-06 3.602303e-01 12 8 16 2 2 0
8 3.352764e-07 3.447963e-06 5.608285e-01 15 9 19 2 3 0
9 3.352764e-07 3.447963e-06 0.000000e+00 35 10 22 2 20 0
Optimization Terminated with Status: Step Tolerance Met
old_optimal_value = 1.0485417402164909e-07
new_optimal_value = 3.3527637557488306e-07
abs(new_optimal_value - old_optimal_value) / abs(old_optimal_value) = 2.19754915532174255333e+00 > 1.49011611938476562500e-08
End Result: TEST FAILED
```
If you compare the output, you can see that the passing and the failing algorithms seem to diverge on the 3rd iteration.
So it seems there is some non-deterministic behavior of this code that causes it to reach a different solution randomly. Could there be different local minima and randomly floating point rounding can cause the algorithm.
## Motivation and Context
This is occurring in a PR build that is blocking other developers branch merges.
## Possible Solution
Long-term, the test should be fixed to not randomly fail.
Short-term, the test should be disabled in the Intel auto PR build. It can still be
https://gitlab.osti.gov/jmwille/Trilinos/-/issues/3543ROL tests failing in targeted CUDA PR build Trilinos-atdm-white-ride-cuda-9.2...2019-04-06T00:16:37ZJames WillenbringROL tests failing in targeted CUDA PR build Trilinos-atdm-white-ride-cuda-9.2-release-debug-pt*Created by: bartlettroscoe*
CC: @trilinos/rol , @rppawlo (Trilinos Nonlinear Solvers Product Area Lead)
## Next Action Status
## Description
The ROL package has 66 failing tests in the build `Trilinos-atdm-white-ride-cuda-9.2-...*Created by: bartlettroscoe*
CC: @trilinos/rol , @rppawlo (Trilinos Nonlinear Solvers Product Area Lead)
## Next Action Status
## Description
The ROL package has 66 failing tests in the build `Trilinos-atdm-white-ride-cuda-9.2-release-debug-pt ` on 'white' and 'ride' as shown [here](https://testing.sandia.gov/cdash-dev-view/viewTest.php?onlyfailed&buildid=3998251) which shows the failing tests:
* ROL_example_PDE-OPT_0ld_adv-diff-react_example_01_MPI_4
* ROL_example_PDE-OPT_0ld_adv-diff-react_example_02_MPI_4
* ROL_example_PDE-OPT_0ld_poisson_example_01_MPI_4
* ROL_example_PDE-OPT_0ld_stefan-boltzmann_example_03_MPI_4
* ROL_example_PDE-OPT_navier-stokes_example_01_MPI_4
* ROL_example_PDE-OPT_navier-stokes_example_02_MPI_4
* ROL_example_PDE-OPT_nonlinear-elliptic_example_01_MPI_4
* ROL_example_PDE-OPT_nonlinear-elliptic_example_02_MPI_4
* ROL_example_PDE-OPT_obstacle_example_01_MPI_4
* ROL_example_PDE-OPT_stefan-boltzmann_example_01_MPI_4
* ROL_example_PDE-OPT_stefan-boltzmann_example_03_MPI_4
* ROL_example_PDE-OPT_topo-opt_poisson_example_01_MPI_4
* ROL_example_tempus_example_parabolic_modeleval_MPI_1
* ROL_example_tempus_example_parabolic_thyravec_MPI_1
* ROL_test_elementwise_TpetraMultiVector_MPI_4
The first failing test `ROL_example_PDE-OPT_0ld_adv-diff-react_example_01_MPI_4` with detailed output shown [here](https://testing.sandia.gov/cdash-dev-view/testDetails.php?test=55725358&build=3998251) shows:
```
Total number of processors: 4
Number of nodes = 1089
Number of cells = 1024
Number of edges = 2112
Cell offsets across processors: {0, 256, 512, 768}
terminate called after throwing an instance of 'std::runtime_error'
terminate called after throwing an instance of 'std::runtime_error'
what(): cudaGetLastError() error( cudaErrorIllegalAddress): an illegal memory access was encountered /home/jenkins/white/workspace/Trilinos-atdm-white-ride-cuda-9.2-release-debug-pt/SRC_AND_BUILD/Trilinos/packages/kokkos/core/src/Cuda/Kokkos_CudaExec.hpp:401
Traceback functionality not available
what(): cudaGetLastError() error( cudaErrorIllegalAddress): an illegal memory access was encountered /home/jenkins/white/workspace/Trilinos-atdm-white-ride-cuda-9.2-release-debug-pt/SRC_AND_BUILD/Trilinos/packages/kokkos/core/src/Cuda/Kokkos_CudaExec.hpp:401
Traceback functionality not available
[white27:11203] *** Process received signal ***
[white27:11204] *** Process received signal ***
[white27:11203] Signal: Aborted (6)
[white27:11203] Signal code: (-6)
[white27:11204] Signal: Aborted (6)
[white27:11204] Signal code: (-6)
[white27:11203] [ 0] [white27:11204] [ 0] [0x3fff90070478]
[white27:11203] [ 1] [0x3fffa3f00478]
...
```
Randomly looking at the output of several of the other tests I looked at all show errors like shown above.
This is an important build because we are targeting this build on 'white' and 'ride' as a Trilinos PR testing build (see #2464 ). Also, SPARC uses ROL and as part of https://software-sandbox.sandia.gov/jira/browse/TRIL-212 we are about to update the ATDM Trilinos configuration to test ROL on many platforms (including CUDA builds) so it is critical to get these tests cleaned up for ATDM.
## Steps to reproduce
One should be able to reproduce these build errors on either 'white' or 'ride' by cloning the Trilinos git repo, checking out the 'develop' branch, creating a build directory, and then doing:
```
$ cd <some_build_dir>/
$ source $TRILINOS_DIR/cmake/std/atdm/load-env.sh cuda-9.2-release-debug
$ cmake \
-GNinja \
-DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnv.cmake \
-DTrilinos_ENABLE_TESTS=ON -DTrilinos_ENABLE_ROL=ON \
$TRILINOS_DIR
$ make NP=16
$ bsub -x -Is -q rhel7F -n 16 ctest -j16
```Initial cleanup of new ATDM builds of Trilinos