MueLu uncoupled aggregation not threaded
Created by: pwxy
Trilinos was built with kokkos node type OpenMP. Looked at thread scaling for a single KNL. Running with 1 MPI process and increasing number of threads from 1 to 64 (so have one thread per core).
Although the matrix-matrix multiplication (kokkoskernels SPGEMM) thread scales well, the uncoupled aggregation does not thread scale at all. For 1 thread, uncoupled aggregation accounts for 1/3 of the MueLu setup time, and 40% of the uncoupled aggregation time (so 1/8 of MueLu setup time) is due to the use of std::sort, which is not threaded (either lines 916 or 999 or both of muelu/src/Graph/MueLu_CoalesceDropFactory_def.hpp). As increase the number of threads, and because RAP thread scales well, std::sort becomes a larger fraction of the time. By the time get to 64 threads, matrix-matrix multiplication is only ~7% of MueLu setup time and uncoupled aggregation 90% of the time.
times in seconds
t MLsetup MLrap MLagg MLcoal MLagg1 MLagg2a MLagg3a
1 259.4 174.9 78.2 58.2 8.0 3.8 6.5
2 173.0 89.9 78.1 58.1 8.0 3.8 6.5
4 130.1 47.4 78.2 58.2 8.0 3.8 6.5
8 107.6 25.5 78.2 58.2 8.0 3.8 6.5
16 96.4 14.4 78.2 58.2 8.0 3.8 6.5
32 90.8 9.0 78.2 58.2 8.0 3.8 6.5
64 88.0 6.2 78.2 58.2 8.0 3.8 6.5
legend: MLsetup="MueLu Hierarchy: Setup (total)" MLrap="MueLu RAPFactory: Computing Ac" (not "total") MLagg="UncoupledAggregationFactory: Build (total)" MLcoal="MueLu CoalesceDropFactory: Build" (not "total") MLagg1="MueLu AggregationPhase1Algorithm: BuildAggregates" MLagg2a="MueLu AggregationPhase2aAlgorithm: BuildAggregates" MLagg2b="MueLu AggregationPhase2bAlgorithm: BuildAggregates"
What would be the schedule for threading uncoupled aggregation?