PR testing instability
Created by: jwillenbring
@trilinos/framework
Expectations
We need PR testing for Trilinos that runs reliably not only currently, but also when we continue to scale up the testing to include additional builds.
Current Behavior
There are two primary causes of instability in the PR testing currently:
- Communication issues
- Machine overloading issues
We have seen a number of distinct types of communication issues. Here is a partial list:
a) clone/Fetch timeouts on GitHub b) clone/fetch timeouts on internal gitlab-ex c) failed reporting of results to CDash d) Failure to communicate PR info from GitHub issues to autotester e) Failure to communicate from autotester back to GitHub issues
Machine overloading issues have shown up in the form of internal compiler errors and test timeouts.
Motivation and Context
We need to be able to run our PR testing in parallel (multiple PR testing at the same time and also multiple builds for each PR at the same time). If the environment won't scale, we will not be able to test PRs fast enough to make the system practical.
Definition of Done
Communication errors occur on average less than one time per day. Machine overload issues occur less than twice a week.
Possible Solution
We are looking into several possible solutions.
- Reducing the -j for builds so we don't run into compiler errors
- Reducing the number of available executors on testing nodes so that fewer tests can run at the same time
- Speaking with networking staff about why operations are timing out
- Reducing the number and/or frequency of PR testing instances to reduce the average communication traffic.