Change from individual CDash error emails to daily summary emails for the ATDM Trilinos builds (and perhaps other efforts)

Created by: bartlettroscoe

CC: @fryeguy52, @trilinos/framework, @dridzal

Description

After having to triage the promoted ATDM Trilinos builds for a couple of months now, and from extensive experience on other projects like CASL VERA, I have come to the realization that relying on CDash error emails is not a very effective notification and monitoring scheme in many of these situations. The reasons that CDash error emails are not effective for keeping on top of a lot of builds is that:

It is hard to tell if a failing test is new that day or has been failing for multiple days or if that same test is failing across several builds. (All you get is a single email telling you that there is a failure for that one build.)
When a failure does occur that results in a CDash error email, there is an urgency to address the problem ASAP (by either fixing, disabling, or reverting commits) in order to make the CDash error email go away. Otherwise, repeated CDash error emails day after day makes people accustomed to seeing CDash error emails and therefore new failures are ignored (and many people will create email filters and just ignore them from that point on).
Catastrophic failures due to system issues can occur that result in a huge number of CDash error emails that can spam people (sometimes a Trilinos developer can get a dozen or more emails since the are on several different package regression lists). This can occur for many reasons like the disk filling up, or when the Intel license server goes down, or when a module does not load correctly. The huge glut of CDash error emails that can occur in these cases can obscure new real failures and can cause some people to add email filters (which then makes the CDash error emails worthless).

Instead of relying on individual CDash error emails, we could move to a notification scheme that created a single email each day that summarized the builds and tests and gave some information about the history of failing tests. Such a system could solve all of the problems listed above and make top-level triaging and monitoring of a bunch of related builds much easier.

(NOTE: Really CDash error notification emails are the best solution for a small number of post-push CI builds that you expect to fail only very rarely and you need a notification ASAP. For nightly builds, they are not effective for the reasons described above.)

Possible Solution

It seems that a straightforward solution would be to write a Python script that extracted data off of the CDash site using multiple queries using the API interface that provides data as JSON data-structures. The Python script would analyze the data and create an HTML-formatted email with useful summary information and CDash URL links.

The full specification is given at:

https://docs.google.com/document/d/13A6tIXCS5EnL0a3ramu-4TvCMwFeiIKEKjOlP-z1Qvo

The input that would would provide to the Python script would be:

Name of the set of builds being analyzed (e.g. "ATDM Trilinos Builds")
Base CDash site (e.g. "https://testing-vm.sandia.gov/cdash/")
CDash project name (e.g. "project=Trilinos")
Current testing day (e.g. "YYYY-MM-DD")
CDash query URL fields (minus data, project, etc.) for queryTests.php to determine tests to be examined
CDash query URL fields (minus data, project, etc.) for index.php for list of builds to be examined
List of expected builds in the triplet ('site', 'build-name', 'group')

Given this data, the Python script would run queries and extract data off of the queryTests.php page for the current day and the previous two testing days (using the data=YYYY-MM-DD URL field) and then display that data as described below (sorted into various lists).

The Python script would then run the query on the index.php page and would note the builds that had any configure, build or test failures (including "not run" tests) and it would compare the list of builds extracted to the input list of expected builds and then note the expected builds that did not show up.

Then the Python script would construct an HTML-formatted email with the body having the following data:

(limited) List of tests that failed today but not the previous day (t1=??? in summary line)
(limited) List of tests that failed today and the previous day but not the day before that (t2=??? in summary line)
(limited) List of tests that failed today and the previous two consecutive days (t3+=??? in summary line)
Total number of "not-run" (non-disabled) tests for current testing day and CDash URL to that list (tnr=??? in summary line)
List of current-day builds that had any configure, build, or test failures (including "not run" tests) (b=??? is the sum of the build failures in those builds shown in summary line)
List of missing expected builds or builds that exist and pass the configure but don't have test results (meb=??? in summary) (NOTE: The current CDash implementation will only alert about missing expected builds but it will not alert about builds with missing tests.)
Total number of builds run and URL to the list of builds.
Total number of failing tests for the current testing day and the CDash URL
URL(s) to the list of all failing tests for the current day (but excluding "not run" tests)

The summary line for the email could be something like:

FAILED (t1=2, t2=1, t3+=5, tnr=18, b=3, meb=1): ATDM Trilinos Builds

That email summary message would look similar to the ones that CDash sends out and one could see just in the summary line how many tests newly failed in the current testing day (i.e. t1=2), how many tests failed in the last two consecutive days (i.e. t2=2) and how many tests failed in the last three or more consecutive days (i.e. t3+=5). It would also show if there were any build failures (i.e. b=3) and how may tests were not run (tnr=18). Lastly, it would show if there were any missing expected builds (meb=1).

For the ATDM Trilinos builds, we could run this script on a cron job or a Jenkins job after 12 midnight MT or so (or wait until 5 am to allow all of the jobs to finish).

Other data we might consider reporting on and showing are:

Number of, URL to, and (limited) list of newly passing tests for the current testing that failed the previous day (or the last day that the matching builds had any test results) (tnp=??? in summary line)
Number of, URL to, and (limited) list of newly missing tests compared to yesterday (but only if the build ran the current day and the tests ran for that build and likewise for the previous day) (tnm=??? in summary line)

The above two bits of data would really help in determining that failing tests got resolved (either by fixing them or temporarily disabling them).

And since you would only get one email, then I think it would be good to send out the email with the summary line:

PASSED (tnp=2, tnm=1): ATDM Trilinos Builds

and that email would contain links to the set of 100% passing builds!

That is an email that even a manager might want to get :-)

This script could also allow you to specify a set of "expected may fail" tests which would be provided in an array with the four fields [<test-name>, <build-name>, <site-name>, <github-issue-link>] and any failing tests that matched this criteria would be listed in their own sublist in the email could could be given tef=??? in the summary line. These failing tests would not be counted against global pass/fail when the fail but if they go from failing to passing, that would be listed along with the other "newly passing tests" (e.g. tnp). However, a better way to handle this would be to have CTest/CDash mark such tests as EXPECTED_MAY_FAIL as described in this CTest/CDash backlog item and then this script would automatically handle these tests differently without having to provide a separate list to this script. However, allowing someone to label a certain test as "expected may fail" specifically in this script would allow different customers to handle the same test differently. For example, one customer might consider a failing MueLu test as a show stopper and affect global pass/fail while another may not and therefore want to handle it as an "Expected may fail" and not affect global pass/fail. You can't do that with a single CTest/CDash property for each test. But without direct CTest/CDash support, the email body would list out the failing test along with the <github-issue-link> so one could immediately go to that issue to see how that failing test is being addressed.

Even for tests that we did not want to mark as "Expected may fail" (and therefore taken out of global pass/fail logic), it would also be useful to mark known failing tests that we did want to impact global pass/fail, it would also be nice to mark them with the GitHub issue ID if the failure is known and is being tracked. This could be done by passing in an array of "Known failing" tests with entries [<test-name>, <build-name>, <site-name>, <github-issue-link>]. This would be useful to see when looking at the summary email to know if we needed to triage those tests or not. (That is, if one sees failing tests that have failed for more than one consecutive day that don't have a GitHub Issue associated with them, then that would be a trigger to triage the failure and create a new Trilinos GitHub issue and then add to the list of "Expected may fail" tests or "Known failing" tests lists).

The script could also allow you to specify some "flaky" or "unstable" builds as an array of [<build-name>, <site-name>] entries where we expect random test failures. If a test failed in one of these "flaky" or "unstable" builds, then it would be reported in a separate section of the email and would not count toward the global pass/fail. Currently (as of 7/14/2018) we would categorize all of the ATDM Trilinos builds on 'ride' (see #2511 (closed)) and the builds on 'mutrino' (see TRIL-214) in this category. That way, we could keep track of these builds in case something big went wrong but the they would not count toward global pass/fail (and therefore would not disrupt automated processes that update Trilinos between branches and application customers). But if more than a small number of test failures occurred (e.g. 4 tests per build) then this could impact global pass/fail. This would avoid a new catastrophic failure on one of these platforms from allowing an update of Trilinos to an ATDM APP, for example.

Tasks:

Get initial script working that keeps track of failing tests with existing GitHub issue trackers can detect new failing tests that need to be triaged and get basic unit tests in place (see "TODO.txt" file in 'atdm-email' branch of 'TrilinosATDMStatus' repo and 'atdm-email' in TriBITS branch) ... PROGRESS ...
Set up mailman list and Jenkins job to run script and post emails to the mailman list (and we can sign up for the mail list). (The mail list will also provide an archive of past results). (There should be a different mail list for different types of results; .e.g. one for the main "Promoted ATDM Trilinos Builds", a different one for "Specialized ATDM Trilinos Builds", etc.)
Create documentation about the script somewhere and put in links to this documentation in the generated HTML emails somehow.
Flesh out the script to cover all of the types of failures we need to keep track of.
???