Add new labels to manage ATDM Trilinos GitHub Issues
Created by: bartlettroscoe
CC List:
- @trilinos/framework
- @maherou (Trilinos overall Lead)
- @jwillenbring (Trilinos Framework Product Area Lead)
- @kddevin (Trilinos Data Services Product Area Lead)
- @srajama1 (Trilinos Linear Solvers Product Area Lead)
- @rppawlo (Trilinos Nonlinear Solvers Product Area Lead)
- @mperego (Trilinos Discretizations Product Area Lead)
- @fryeguy52 (ATDM Tools Member)
Description
This Issue is to document the need for and to add some new labels to improve the management of ATDM Trilinos GitHub issues as part of the ATDM Trilinos triaging and issue management process currently documented at:
There are a few problems with that process as currently documented:
-
Problem-1: There is currently no way to quickly identify how serious a failure on the CDash Dashboard is ATDM Trilinos is w.r.t. the ATDM APPs. Does this failure really bad for the APP or does it not impact that APP usage of Trilinos at all? (We can say it in words in the GitHub Issue text but it is hard to query based on that and hard to find in the issue text.)
-
Problem-2: There is currently no convenient way to find all of the ATDM Trilinos GitHub issues that belong to a given Trilinos Product Area (for which each Trilinos Product Area Lead has the responsibility to oversee). (One can search based on the package labels but GitHub does not provide a way to
OR
a bunch of labels in an Issue query and such a query would be very long and hard to maintain even if GitHub supported this.) -
Problem-3: Some Trilinos failures on ATDM platforms are due to problems with the env and not really Trilinos code or test issues. It would be useful to segragate those issues that are likely Trilinos code/test bugs and thosse are are casued by defects in the env outside of Trilins.
Proposed Solution
For Problem-1 the proposed solution is to define three new Trilinos GitHub labels to represent the severity of the problem as it relates to that ATDM APPs usage of Trilinos:
-
ATDM Critical
: Problems that critically damage the ability to even run the ATDM Trilinos builds.- Examples:
- A library build failure on an important platform (e.g. CUDA on 'waterman') that takes out hundreds of tests.
- A runtime defect in an upstream package that takes out hundreds of
- Related practices:
- The issue is fixed with STOP THE LINE urgency:
- The problem is fixed ASAP, or
- The PR that introduced the bug is reverted
- ATDM APPs will not get an update of Trilinos until this problem is resolved
- Failures in any of the associated tests result a global FAILED status for that ATDM Trilinos status CDash analysis tool (see #2933)
- The issue is fixed with STOP THE LINE urgency:
- Examples:
-
ATDM Blocker
: Problems that make Trilinos unfit to be adopted by one or more ATDM APPs but do not seriously damage ability to run automated builds- Examples:
- A runtime issue that breaks the functioning of an important Trilinos capability used by an ATDM APP but only impacts a small number of Trilinos tests
- Related practices:
- The issue does not need to be resolved with STOP THE LINE urgency but needs to be fixed before the next Trilinos update for the affected APP. (For example, a problem with Phalanx might block EMPIRE getting an updated version of Trilinos but should not block SPARC from getting an updated version of Trilinos.)
- The impacted APP should not get an updated version of Trilinos until the issue is resolved
- Failures in any of the associated tests result a global FAILED status for that ATDM Trilinos status CDash analysis tool (see #2933)
- Examples:
-
ATDM Nonblocker
: Problems with the Trilinos that don't impact ATDM APPs- Examples:
- Failures in some tests for solvers that are not used by any of the ATDM APPs
- A defect in a Trilinos test on a small number of platforms that has been confirmed not to be a bug in Trilinos library code
- Related practices:
- The associated tests are marked as
okay_to_fail=1
in the tests with issue tracker files for the ATDM Trilinos CDash analysis tool (see #2933), or - The associated code and/or tests are disabled in ATDM Trilinos builds going forward
- Failures in any of the non-disabled associated tests will NOT result a global FAILED status for that ATDM Trilinos status CDash analysis tool (see #2933) (That is, it will be allowed to fail but we will still be able to track if it passes or fails.)
- The associated tests are marked as
- Examples:
For Problem-2 the proposed labels for the Trilinos Product Areas are:
-
PA: Framework
: All GitHub issues that fall under the Trilinos Framework Product Area -
PA: Data Services
: All GitHub issues that fall under the Trilinos Data Services Product Area -
PA: Linear Solvers
: All GitHub issues that fall under the Trilinos Linear Solvers Product Area -
PA: Nonlinear Solvers
: All GitHub issues that fall under the Trilinos Nonlinear Linear Solvers Product Area -
PA: Discretizations
: All GitHub issues that fall under the Trilinos Discretizations Product Area
For Problem-3, there is the label:
-
ATDM Env Issue
: Problem mostly caused by the env and not a defect in Trilinos code or tests- Examples:
- A bug in an LAPACK function or in calling that function (e.g. #2410, #3497 (closed))
- A bug in MPI (e.g. #3331 (closed), #3290 (closed))
- Related practices:
- Provide separate statistics for ATDM Trilinos issues based on those with the label
ATDM Env Issue
and those without it (i.e. are actual Trilinos code or test issues) - Would not generally block the update of Trilinos for an ATDM APP unless changes in Trilinos exposed the env problem but the old Trilinos version avoided it. (In that latter case, ATDM APPs should not get an update of Trilinos since the current version of Trilinos they are using does not trigger this problem with the env).
- Provide separate statistics for ATDM Trilinos issues based on those with the label
- Examples: