Tpetra: Improve thread scalability of CrsGraph::fillComplete
Created by: mhoemmen
@trilinos/tpetra Epic: #796 Tasks (incomplete list):
- #799: Thread-parallelize CrsGraph pack and unpack (used in globalAssemble)
- #658: Thread-parallelize makeColMap
- #834 (closed) (#660 optional in FY17): Thread-parallelize sort and merge of column indices
- #835 (closed): Thread-parallelize computeGlobalConstants
Improve thread scalability of Tpetra::CrsGraph::fillComplete.
This should always refer to first fillComplete, since that's the only time when CrsGraph::fillComplete should need to do anything. (CrsGraph does not allow changes after first fillComplete.)
Here's what CrsGraph::fillComplete does:
- Get "Sort column Map ghost GIDs" bool parameter (defaults to true)
- Get "No Nonlocal Changes" bool parameter (defaults to false)
- Allocate indices, if they have not already been allocated (they have if the graph is nonempty on the calling process)
- Do global assembly if needed (see (2))
- Set domain and range Maps (just RCP assignment)
- If the graph does not already have a column Map, make one (by calling makeColMap)
- If column indices are stored as global indices, convert them to local
- Sort the column indices in each row
- Merge the column indices in each row (inserts do not merge by default; this is different than Epetra)
- Make the Import and/or Export objects, if needed
- "Compute local and global constants" (e.g., scan the graph to check if locally upper or lower triangular; do an all-reduce to compute global number of entries, etc.)
- "Fill local graph": convert from the graph's current local data structure, to an optimized local data structure
Notes:
(1) has consequences for thread parallelization. (See discussion of (6) and (10) below.) Its default state of true gives makeColMap the freedom to relax the original order of GIDs. If false, makeColMap maintains a separate array of GIDs in order to remember the original order. It would be hard to thread-parallelize this step without atomic push_back or a two-pass (count, then fill) approach.
(2) means that by default, fillComplete must do an extra all-reduce in order to tell whether any processes have inserted into nonowned rows on any process. It should only need to do this all-reduce at first fillComplete, but see #833.
(3) is part of Crs{Graph,Matrix}'s lazy allocation strategy. This makes these objects use constant storage per process if the graph / matrix is empty. However, lazy allocation hinders thread parallelization of structure changes through Tpetra's interface. It's something that we will need to address at some point.
(4): CrsMatrix::globalAssemble is expensive compared with the two-matrix Export approach. However, CrsGraph does not (currently) allow structure changes after first fillComplete, so this is less of a concern. Note that the two-graph Export approach depends on #799 for thread parallelization.
(6) and (10) are opportunities for thread parallelization. There is an outstanding issue even in Bugzilla to merge makeColMap and makeImportExport. This has benefits for reducing MPI communication, by not throwing away information (what process owns what column indices in the column Map) that may have required MPI communication to get. Merging the two methods would involve a new Import constructor that uses this process ownership information to bypass some of the normal Import setup process. Merging makeColMap and makeImportExport would thus let us get away with not thread-parallelizing the normal Import constructor.
(7): If the CrsGraph is StaticProfile (argument to its constructor, governing pre-first-fillComplete storage format), makeIndicesLocal is already thread parallel. However, if the CrsGraph is DynamicProfile, this is not thread parallel. Fixing this would incur some sequential overhead or require a data structure change, due to current use of Teuchos memory management classes, which are not thread safe (see #229). It's more efficient anyway to fill a StaticProfile CrsGraph, so we recommend that applications take this approach. See Tpetra Lesson 04 on different fill strategies.
(8), (9): Inserting into a CrsGraph does not currently merge entries in the same row with the same column index. That's why sort and merge are separate steps. Regardless, a one-level thread parallelization of sort and merge is not hard. For each row, we would call a sequential sort. GPU-izing or vectorizing this is much harder; it calls for a segmented sort (and merge). See #662.
(11) reads all the graph's entries. It is not yet thread parallel. Making this thread parallel may call for doing (12) (fillLocalGraph) before (11). This would ensure that the graph's local data structure is a Kokkos::StaticCrsGraph, and thus safe and optimized for parallel kernels. We could also merge some of the local work in (11) with (12). This could mitigate some of the costs of the DynamicProfile or of the unpacked StaticProfile cases in fillLocalGraph.
(12): Copy data structure from DynamicProfile ("2-D," array of arrays) or unpacked StaticProfile (compressed sparse row, with extra space in some rows), to packed StaticProfile (compressed sparse row, with no extra space in any row). The StaticProfile case (unpacked to packed) is already thread parallel. Above notes on (7) explain why the DynamicProfile case would take work to make thread parallel. Best practice for applications would be to use StaticProfile graphs only.