Set up robust portable pre-push and post-push CI tools and process based on the SEMS Dev Env
Created by: bartlettroscoe
Next Action Status:
New CI build is pushed to 'develop', new post-push CI server is running, and new checkin-test-sem.sh script ready for more testing and review ... Note going to pursue other extensions (e.g. mac OSX, tcsh, etc.). See https://github.com/trilinos/Trilinos/issues/482#issuecomment-266124179. Next: Leave in review til 1/1/2017 then close.
Blocked By: #158 (closed), #410 (closed), #362 (closed)
Blocking: #380
Related To: #370 (closed), #475 (closed), #476 (closed)
CC: @trilinos/framework
Description:
Trilinos has not had an effective pre-push CI development process for many years. When the checkin-test.py script was first created (back in 2008 or so), the primary stack of packages was based on Epetra and the main external dependencies were C/C++/Fortran compilers and BLAS and LAPACK. Those dependencies and the major Trilinos customers at the time were used to select the initial set of Primary Tested (initially called Primary Stable) packages that is being used to this day. However, since that time, many new Trilinos packages have been added and important Trilinos customers are relying on many of these newer packages (e.g. SEACAS, STK, Tpetra, Phalanx, Panzer, etc.). In addition, these new Trilinos packages require more dependencies than just BLAS and LAPACK and now TPLs like Boost, HDF5, NetCDF, ParMETIS, SuperLU and others used by Trilinos are also very important to many Trilinos customers.
Another problem with the current pre-push CI testing processes with Trilinos is that Trilinos developers have a variety of different types of machines, OSs, versions of compilers, TPL implementations, etc. that they use to develop on and push changes for Trilinos. This has resulted in people who tried to use the checkin-test.py script to suffer failed pushes due to failing tests on their machine not triggered by their changes. In contract, projects that have a uniform pre-push CI testing env don't experience these types of problems. One example of such a project is CASL VERA that uses TriBITS and the checkin-test.py script and has a set of uniform development machines where developers almost never see tests that fail in their build of the code that passed on another developer's build. Therefore, the only failed builds and tests are due to their own local changes. In that project, there is no trepidation to running the checkin-test.py script and everyone uses it uniformly for nearly every push.
Another problem with the current CI testing process for Trilinos is that the post-push CI server that posts to CDash enables a different set of packages and TPLs from what the pre-push CI build does (and of course uses different compilers, MPI, etc.). Therefore, a CI build/test failure seen on CDash may not be seen with the checkin-test.py script locally and visa vera. This makes it difficult for developers to determine if the failures they are seeing on their own machine are due to their local changes or due differences with the env on their machine compared to the machine running the CI build posting to CDash, if it is due to a different set of enabled packages and TPL or something else.
As a result, the stability of the main Trilinos development branch (now the 'develop' branch, see #370 (closed)) has degraded from what it was 5+ years ago. This is a problem because Trilinos needs to have a more stable 'develop' branch in order to more frequently update from the 'develop' branch to the 'master' branch (see #370 (closed)).
This story is to address all of these shortcomings of the current Trilinos CI testing process. The new SEMS Dev Env (#158 (closed)) provides an opportunity to create a fairly portable (at least for SNL staff members) uniform pre-push and post-push CI testing environment for the first time.
Here is the plan for setting up a more effective CI process based on the SEMS Dev Env, the checkin-test.py script, and CTest/CDash:
-
Select a standard pre-push CI build env based on the SEMS Dev Env: Currently, GCC 4.7.2 and OpenMPI 1.6.5 are being used for the post-push CI build that posts to CDash. These selections should be reexamined and potentially changed. This will be used to create a standard
load_ci_sems_dev_env.sh
script, which just calls thelocal_sems_dev_env.sh
script with the selections. - Select an expanded/revised set of Primary Tested (PT) packages and TPLs: This revised set should be based on the most important packages and TPLs to current Trilinos customers. Any important TPL not already supported by the SEMS Dev Env may need to be added (i.e. to the Trilinos space under the /projects/ NFS mount). Revising the set of PT packages and TPLs is being addressed in #410 (closed).
-
Set up a standard checkin-test-sems.sh script that all Trilinos developers can use to push changes to the Trilinos 'develop' branch: This should automatically load the correct standard SEMS Dev Env by sourcing
load_ci_sems_dev_env.sh
. This should likely only run a single build of Trilinos to speed up the testing/push process. (If there is a single build is would likely include-DTPL_ENABLE_MPI=ON -DCMAKE_BULD_TYPE=RELEASE -DTriinos_ENABLE_DEBUG=ON -DBUILD_SHARED_LIBS=ON -DTrilinos_ENABLE_EXPLICIT_INSTANTIATION=ON -DTrilinos_ENABLE_FLOAT=OFF -DTrilinos_ENABLE_COMPLEX=OFF
. See #362 (closed) about turning off float and complex gy default.) - Change the main post-push CI server that posts to CDash to use the exact same build as the default builds for the checkin-test-sems.sh script: This is needed to catch the violations of the additive test assumption of branches. This can also be used to alert Trilinos developers when there are failures in the standard CI build or to verify that failures they are seeing are not their doing. If other post-push CI builds are desired, like non-MPI serial and full release builds, then those can be added as extra CI builds (we just need extra machines for that).
After this Story is complete, then we can create new Stories to get Trilinos developers to use the checkin-test-sems.sh script and to commit to keeping the CI build(s) 100% all the time with "Stop the Line" urgency to fix.
Definition of Done:
- An initial implementation for
load_ci_sems_dev_env.sh
andcheckin-test-sems.sh
that provides a viable CI build based on the SEMS Dev Env. - Documentation for
load_ci_sems_dev_env.sh
andcheckin-test-sems.sh
has been written and has been reviewed by a few Trilinos developers. - The post-push CI build of Trilinos uses the same
load_ci_sems_dev_env.sh
env and the same default build(s) as defined in thecheckin-test-sems.sh
script. - Review the setup and documentation for the
checkin-test.py
script itself to determine what improvements that might help with usability and adoption.
Decisions that need to be made:
- What default timeout should be selected for pre-push tests (e.g. 3 minutes)?
- Should there just be an MPI_RELEASE_DEBUG_SHARED default build or also some serial build listed in --default-builds?
- What version of GCC and OpenMPI should be used?
- What other set of TPLs really should be added beyond what is provided in the SEMS Dev Env (e.g. a 64-bit build of Scotch without pthreads enabled, see this comment in #476).
- ???
Tasks:
- Create drafts for
load_ci_sems_dev_env.sh
andcheckin-test-sems.sh
[Done] - Discuss this Story at a Trilinos Leaders Meeting Done]
- Work #410 (closed) to select the updated set of PT packages and TPLs [Done]
- Work #362 (closed) to disable float and complex by default [Done]
- Select the new set (or just one)
--default-builds
for thecheckin-test.py
and therefore thecheckin-test-sems.sh
script" [Done]
- Make updates to Trilinos and checkin-test.py script on branch
better-ci-build-482
... IN PROGRESS ... - Get proposed changes reviewed (quickly) [Done]
- Create wiki documentation for usage
checkin-test-sems.sh
[Done] - Commit changes to 'develop' branch [Done]
- Create a new post-push CI build on crf450 that uses the identical CI build as
checkint-test-sems.py --local-do-all
[Done]
- Set up cron job or Jenkins job to run the build [Done]
- Run the CI build for several days and have people review it [Done]
- Have updated CI process and documentation reviewed ... In Progress ...
- Update the existing Jenkins CI build to use the new CI build and then remove the CI build on crf450 ...