Define better strategy for managing threaded testing with Trilinos
Created by: bartlettroscoe
CC:: @trilinos/framework, @trilinos/kokkos, @trilinos/tpetra, @nmhamster, @rppawlo
Next Action Status
Set up a meeting to discuss current status of threaded testing in Trilinos and some steps to try to address the issues ...
Description
It seems the testing strategy in Trilinos to test with threading is to build threaded code and then run all of the threaded tests with same number of threads (such as by setting export OMP_NUM_THREADS=2
when using OpenMP ) and then running the test suite with ctest -j<N>
with that fixed number of threads. But this approach, and testing with threads enabled in general, has some issues.
First, with some configurations and systems, running with any <N>
with ctest -j<N>
will result in all of the test executables binding to the same threads on the same cores making things run very slowly such as described in https://github.com/trilinos/Trilinos/issues/2398#issuecomment-374379614. A similar problem also occurs with CUDA builds with the various test processes running concurrently not spreading the load across the available GPUs (see https://github.com/trilinos/Trilinos/issues/2446#issuecomment-375804502).
Second, even when one does not experience the above problem of binding to the same cores (which is not always a problem), this approach does not make very good usage the test machine because it assumes that every MPI process is multi-threaded with Kokkos, which is not true. Even when OPM_NUM_THREADS > 1
, there are a lot of Trilinos tests that don't have any threaded code and even if ctest allocates room for 2 threads per MPI process, only one thread will be used. So this will result in not keeping many of the cores busy running code and therefore making in tests take longer to complete.
The impact of the two problems above has many developers and many automated builds having to run with a small ctest -j<N>
(i.e.g ctest -j8
is used on many of the ATDM Trilinos builds) and therefore not utilizing many of the cores that are available. This results in the time to run the full test suite going up significantly. This negatively impacts developer productivity (because they have to wait longer to get feedback from running tests locally) and this wastes existing testing hardware and/or limiting the number of builds and the number of tests that can be run in a given testing day (which reduces the number of defect that we can catch and therefore costs Trilinos developers and users time and $$).
Third, having to run the entire Trilinos test suite with a fixed number of threads like with export OMP_NUM_THREADS=2
or export OMP_NUM_THREADS=4
does not result in very good testing or results in very expensive testing having to run the entire Trilinos test suite multiple times. It has been observed that defects occur when some thread counts are used like export OMP_NUM_THREADS=5
, for example. This would be like having every MPI test in Trilinos run with exactly the same number of MPI processes which would not result in very good testing (which is not the case in Trilinos as several tests are run with different numbers of MPI processes).
Ideas for Possible Solutions
First, ctest needs to be extended in order to inform it of the architecture of the system were it will be running tests. CTest needs to know the number of sockets per node, the number of cores per socket, the number of threads per node, and the number of nodes. We will also need to inform CTest about the number of MPI ranks vs. threads per MPI rank for each test (i.e. add a THREAD_PERS_PROCESS
property in addition to a PROCESSORS
property).. With that type of information, ctest should be able to determine the binding of the different ranks in an MPI job that runs a test to specific cores on sockets on nodes. And we will need to find a way to communicate this information to the MPI jobs when they are run by ctest. I think this means adding the types of process process affinity and process placement like you see from modern MPI implementations (see https://github.com/open-mpi/ompi/wiki/ProcessAffinity and https://github.com/open-mpi/ompi/wiki/ProcessPlacement ). See this Kitware backlog item.
Second, we should investigate how to add a NUM_THREADS_PER_PROC <numThreadsPerProc>
argument to the TRIBITS_ADD_TEST() and TRIBITS_ADD_ADVANCED_TEST() commands. It would be good if this could be added directly to these TriBITS functions and provide some type of "plugin" system to allow us to define how the number of threads gets set when running the individual test. But the default TriBITS implementation could just compute NUM_TOTAL_CORES_USED <numTotalCoresUsed>
from <numThreadsPerProc> * <numMpiProcs>
.
The specialization of this new TriBITS functionality for Kokkos/Trilinos would set the number of requested threads based on the enabled threading model known at configure time. For OpenMP, it would set the env var OMP_NUM_THREADS=<numThreads>
and for other threading models would pass in --kokkos-threads=<numThreads>
. If the NUM_THREADS_PER_PROC <numThreadsPerProc>
argument was missing, this could use a default number of threads (e.g. global configure argument Trilinos_DEFAULT_NUM_THREADS
with default 1
). If the computed <numTotalCoresUsed>
was larger than ${MPI_EXEC_MAX_NUMPROCS}
(which should be set to the maximum number of threads that can be used run on that machine when using threading), then the test would get excluded and a message would be printed to STDOUT. Such a CMake function should be pretty easy to write, if you know the threading model used by Kokkos at configure time.
Definition of Done:
- ???
Tasks:
- ???