Test randomly failing test ROL_example_poisson-inversion_example_01_MPI_1 failing in PR Intel build
Created by: bartlettroscoe
CC: @trilinos/framework, @trilinos/rol, @rppawlo (Trilinos Nonlinear Solvers Product Lead)
Next Action Status
PR #3104 merged to 'develop' on 7/13/2018 which disables ROL_example_poisson-inversion_example_01_MPI_1
in Intel PR test build. Next: ROL developers fix behavior of test offline ...
Description
As you can see in this query, the test ROL_example_poisson-inversion_example_01_MPI_1seems to be failing randomly in the Intel PR build. This just killed my PR testing iteration shown in #3100. (Now I have to put on a
AT: RETEST` and hope this does not fail again and then stay up late to click the "merge" button in order for this to clean up the build tomorrow.)
In the case of the #3100 PR testing iteration, the failing test output shows:
Newton-Krylov using Conjugate Gradients
Line Search: Cubic Interpolation satisfying Strong Wolfe Conditions
iter value gnorm snorm #fval #grad iterCG flagCG ls_#fval ls_#grad
0 2.340112e-03 1.927880e-03
1 1.597727e-04 4.157593e-04 3.727069e+00 2 2 4 2 1 0
2 5.442664e-06 5.009082e-05 8.348624e-01 3 3 5 2 1 0
3 1.146552e-06 6.106086e-06 3.163006e+00 4 4 11 2 1 0
4 8.023717e-07 3.144919e-06 5.128519e-01 6 5 11 2 2 0
5 6.126545e-07 2.642767e-06 5.167993e-01 8 6 15 2 2 0
6 4.613227e-07 2.330904e-06 4.228759e-01 10 7 14 2 2 0
7 3.685626e-07 2.259062e-06 3.602303e-01 12 8 16 2 2 0
8 3.352764e-07 3.447963e-06 5.608285e-01 15 9 19 2 3 0
9 3.352764e-07 3.447963e-06 0.000000e+00 35 10 22 2 20 0 Optimization Terminated with Status: Step Tolerance Met
old_optimal_value = 1.0485417402164909e-07
new_optimal_value = 3.3527637557488306e-07
abs(new_optimal_value - old_optimal_value) / abs(old_optimal_value) = 2.19754915532174255333e+00 > 1.49011611938476562500e-08
End Result: TEST FAILED
If you look at a previous Intel PR build shown here, it shows the output:
Newton-Krylov using Conjugate Gradients
Line Search: Cubic Interpolation satisfying Strong Wolfe Conditions
iter value gnorm snorm #fval #grad iterCG flagCG ls_#fval ls_#grad
0 2.340112e-03 1.927880e-03
1 1.597727e-04 4.157593e-04 3.727069e+00 2 2 4 2 1 0
2 5.442664e-06 5.009082e-05 8.348624e-01 3 3 5 2 1 0
3 2.334731e-06 2.793001e-05 3.695260e+00 4 4 11 2 1 0
4 1.076543e-06 1.668824e-05 5.083248e-01 6 5 6 2 2 0
5 8.388745e-07 1.439672e-05 1.215272e+00 7 6 7 2 1 0
6 5.152760e-07 9.169432e-06 1.560582e+00 9 7 17 2 2 0
7 1.398695e-07 2.702421e-06 4.159034e-01 10 8 10 0 1 0
8 1.089003e-07 4.590927e-07 2.184686e-01 11 9 11 0 1 0
9 1.060664e-07 6.754781e-07 9.860141e-01 12 10 39 0 1 0
10 1.051188e-07 1.569364e-08 9.726082e-02 13 11 15 0 1 0
11 1.048559e-07 2.698522e-08 2.153267e-01 14 12 50 1 1 0
12 1.048544e-07 2.590253e-10 1.010471e-03 15 13 10 0 1 0
13 1.048542e-07 1.052513e-11 4.755420e-03 16 14 50 1 1 0
14 1.048542e-07 2.725544e-13 1.514146e-04 17 15 50 1 1 0 Optimization Terminated with Status: Converged
old_optimal_value = 1.0485417402273531e-07
new_optimal_value = 1.0485417402191586e-07
End Result: TEST PASSED
If you look at the last time this test failed in an Intel PR build on 7/3/2018 here is showed the output:
Newton-Krylov using Conjugate Gradients
Line Search: Cubic Interpolation satisfying Strong Wolfe Conditions
iter value gnorm snorm #fval #grad iterCG flagCG ls_#fval ls_#grad
0 2.340112e-03 1.927880e-03
1 1.597727e-04 4.157593e-04 3.727069e+00 2 2 4 2 1 0
2 5.442664e-06 5.009082e-05 8.348624e-01 3 3 5 2 1 0
3 1.146552e-06 6.106086e-06 3.163006e+00 4 4 11 2 1 0
4 8.023717e-07 3.144919e-06 5.128519e-01 6 5 11 2 2 0
5 6.126545e-07 2.642767e-06 5.167993e-01 8 6 15 2 2 0
6 4.613227e-07 2.330904e-06 4.228759e-01 10 7 14 2 2 0
7 3.685626e-07 2.259062e-06 3.602303e-01 12 8 16 2 2 0
8 3.352764e-07 3.447963e-06 5.608285e-01 15 9 19 2 3 0
9 3.352764e-07 3.447963e-06 0.000000e+00 35 10 22 2 20 0
Optimization Terminated with Status: Step Tolerance Met
old_optimal_value = 1.0485417402164909e-07
new_optimal_value = 3.3527637557488306e-07
abs(new_optimal_value - old_optimal_value) / abs(old_optimal_value) = 2.19754915532174255333e+00 > 1.49011611938476562500e-08
End Result: TEST FAILED
If you compare the output, you can see that the passing and the failing algorithms seem to diverge on the 3rd iteration.
So it seems there is some non-deterministic behavior of this code that causes it to reach a different solution randomly. Could there be different local minima and randomly floating point rounding can cause the algorithm.
Motivation and Context
This is occurring in a PR build that is blocking other developers branch merges.
Possible Solution
Long-term, the test should be fixed to not randomly fail.
Short-term, the test should be disabled in the Intel auto PR build. It can still be