Test randomly failing test ROL_example_poisson-inversion_example_01_MPI_1 failing in PR Intel build

Created by: bartlettroscoe

CC: @trilinos/framework, @trilinos/rol, @rppawlo (Trilinos Nonlinear Solvers Product Lead)

Next Action Status

PR #3104 merged to 'develop' on 7/13/2018 which disables ROL_example_poisson-inversion_example_01_MPI_1 in Intel PR test build. Next: ROL developers fix behavior of test offline ...

Description

As you can see in this query, the test ROL_example_poisson-inversion_example_01_MPI_1seems to be failing randomly in the Intel PR build. This just killed my PR testing iteration shown in #3100. (Now I have to put on aAT: RETEST` and hope this does not fail again and then stay up late to click the "merge" button in order for this to clean up the build tomorrow.)

In the case of the #3100 PR testing iteration, the failing test output shows:

Newton-Krylov using Conjugate Gradients
Line Search: Cubic Interpolation satisfying Strong Wolfe Conditions
  iter  value          gnorm          snorm          #fval     #grad     iterCG    flagCG    ls_#fval  ls_#grad  
  0     2.340112e-03   1.927880e-03   
  1     1.597727e-04   4.157593e-04   3.727069e+00   2         2         4         2         1         0         
  2     5.442664e-06   5.009082e-05   8.348624e-01   3         3         5         2         1         0         
  3     1.146552e-06   6.106086e-06   3.163006e+00   4         4         11        2         1         0         
  4     8.023717e-07   3.144919e-06   5.128519e-01   6         5         11        2         2         0         
  5     6.126545e-07   2.642767e-06   5.167993e-01   8         6         15        2         2         0         
  6     4.613227e-07   2.330904e-06   4.228759e-01   10        7         14        2         2         0         
  7     3.685626e-07   2.259062e-06   3.602303e-01   12        8         16        2         2         0         
  8     3.352764e-07   3.447963e-06   5.608285e-01   15        9         19        2         3         0         
  9     3.352764e-07   3.447963e-06   0.000000e+00   35        10        22        2         20        0         Optimization Terminated with Status: Step Tolerance Met
old_optimal_value = 1.0485417402164909e-07
new_optimal_value = 3.3527637557488306e-07

abs(new_optimal_value - old_optimal_value) / abs(old_optimal_value)  = 2.19754915532174255333e+00 > 1.49011611938476562500e-08
End Result: TEST FAILED

If you look at a previous Intel PR build shown here, it shows the output:

Newton-Krylov using Conjugate Gradients
Line Search: Cubic Interpolation satisfying Strong Wolfe Conditions
  iter  value          gnorm          snorm          #fval     #grad     iterCG    flagCG    ls_#fval  ls_#grad  
  0     2.340112e-03   1.927880e-03   
  1     1.597727e-04   4.157593e-04   3.727069e+00   2         2         4         2         1         0         
  2     5.442664e-06   5.009082e-05   8.348624e-01   3         3         5         2         1         0         
  3     2.334731e-06   2.793001e-05   3.695260e+00   4         4         11        2         1         0         
  4     1.076543e-06   1.668824e-05   5.083248e-01   6         5         6         2         2         0         
  5     8.388745e-07   1.439672e-05   1.215272e+00   7         6         7         2         1         0         
  6     5.152760e-07   9.169432e-06   1.560582e+00   9         7         17        2         2         0         
  7     1.398695e-07   2.702421e-06   4.159034e-01   10        8         10        0         1         0         
  8     1.089003e-07   4.590927e-07   2.184686e-01   11        9         11        0         1         0         
  9     1.060664e-07   6.754781e-07   9.860141e-01   12        10        39        0         1         0         
  10    1.051188e-07   1.569364e-08   9.726082e-02   13        11        15        0         1         0         
  11    1.048559e-07   2.698522e-08   2.153267e-01   14        12        50        1         1         0         
  12    1.048544e-07   2.590253e-10   1.010471e-03   15        13        10        0         1         0         
  13    1.048542e-07   1.052513e-11   4.755420e-03   16        14        50        1         1         0         
  14    1.048542e-07   2.725544e-13   1.514146e-04   17        15        50        1         1         0         Optimization Terminated with Status: Converged
old_optimal_value = 1.0485417402273531e-07
new_optimal_value = 1.0485417402191586e-07
End Result: TEST PASSED

If you look at the last time this test failed in an Intel PR build on 7/3/2018 here is showed the output:

Newton-Krylov using Conjugate Gradients
Line Search: Cubic Interpolation satisfying Strong Wolfe Conditions
  iter  value          gnorm          snorm          #fval     #grad     iterCG    flagCG    ls_#fval  ls_#grad  
  0     2.340112e-03   1.927880e-03   
  1     1.597727e-04   4.157593e-04   3.727069e+00   2         2         4         2         1         0         
  2     5.442664e-06   5.009082e-05   8.348624e-01   3         3         5         2         1         0         
  3     1.146552e-06   6.106086e-06   3.163006e+00   4         4         11        2         1         0         
  4     8.023717e-07   3.144919e-06   5.128519e-01   6         5         11        2         2         0         
  5     6.126545e-07   2.642767e-06   5.167993e-01   8         6         15        2         2         0         
  6     4.613227e-07   2.330904e-06   4.228759e-01   10        7         14        2         2         0         
  7     3.685626e-07   2.259062e-06   3.602303e-01   12        8         16        2         2         0         
  8     3.352764e-07   3.447963e-06   5.608285e-01   15        9         19        2         3         0         
  9     3.352764e-07   3.447963e-06   0.000000e+00   35        10        22        2         20        0         
Optimization Terminated with Status: Step Tolerance Met
old_optimal_value = 1.0485417402164909e-07
new_optimal_value = 3.3527637557488306e-07

abs(new_optimal_value - old_optimal_value) / abs(old_optimal_value)  = 2.19754915532174255333e+00 > 1.49011611938476562500e-08
End Result: TEST FAILED

If you compare the output, you can see that the passing and the failing algorithms seem to diverge on the 3rd iteration.

So it seems there is some non-deterministic behavior of this code that causes it to reach a different solution randomly. Could there be different local minima and randomly floating point rounding can cause the algorithm.

Motivation and Context

This is occurring in a PR build that is blocking other developers branch merges.

Possible Solution

Long-term, the test should be fixed to not randomly fail.

Short-term, the test should be disabled in the Intel auto PR build. It can still be