Random timeouts in auto PR builds
Created by: bartlettroscoe
@trilinos/framework
Expectations
Tests will not fail or timeout unless changes in a PR branch cause the failures or timeouts.
Current Behavior
Several auto PR build iterations have been failing for a while due to random timeouts in various auto PR builds. For example, the first PR testing iteration in my PR #3104 failed last night due to random timeouts as shown here. It is impossible for that one change to have impacted these test timeouts in any way.
You can see timeouts impacting other PR testing iterations as well in just the last 12 days in this query. This shows 12 tests timing out in 3 different PRs over the last 12 days. Looking at the numbering of the builds, it seems likely that up to 4 or 5 PR testing iterations failed due to this issue. (It is hard to know how many PR testing iterations failed due to these random timeouts due to the naming of the PR builds since it is hard to differentiate different PR testing iterations in the same PR or to match up which builds of different compilers match up to the same PR testing iteration). Also, it is possible that some of these timeouts were due to changes in the PR branch.
Motivation and Context
We want PR testing iterations to only fail if they are triggered by changes in the specific topic branch being tested.
While 4 or 5 failed PR testing iterations over 12 days may not seem like a lot, when combined with randomly failing tests (see #3103) and randomly failures in the Jenkins jobs (git pull fails, etc.), these add up to make the auto PR testing pretty unstable.
Definition of Done
- Eliminate randomly failing tests.
Possible Solution
The likely cause is that the Jenkins build farm machines are being overloaded. The very setup of the Jenkins site allows for this to occur because jobs can use more cores (i.e. "executors") than they declare and therefore overload the machine. Many of these timeouts occur late at night or in the early morning when the Jenkins build farm machines are likely to be processing nightly jobs.
Steps to Reproduce
Hard to reproduce because these are random failures. I wish we had load statistics on these Jenkins build farm machine so that we could know when these we see timeouts like this.