Trilinos issueshttps://gitlab.osti.gov/jmwille/Trilinos/-/issues2017-10-26T21:12:55Zhttps://gitlab.osti.gov/jmwille/Trilinos/-/issues/1770Tpetra: Create methods to test runtime sized scalar types2017-10-26T21:12:55ZJames WillenbringTpetra: Create methods to test runtime sized scalar types*Created by: tjfulle*
This issue is to discuss and implement methods to test runtime sized scalar types as used in @trilinos/stokhos, without having to build all the way through to Stokhos.
@trilinos/tpetra
@mhoemmen
@etphipp *Created by: tjfulle*
This issue is to discuss and implement methods to test runtime sized scalar types as used in @trilinos/stokhos, without having to build all the way through to Stokhos.
@trilinos/tpetra
@mhoemmen
@etphipp Tpetra-backloghttps://gitlab.osti.gov/jmwille/Trilinos/-/issues/1021Tpetra: Matrix Market writeSparse assumes input row Map is one to one2017-01-25T22:29:24ZJames WillenbringTpetra: Matrix Market writeSparse assumes input row Map is one to one*Created by: mhoemmen*
@trilinos/tpetra
Tpetra::MatrixMarket::writeSparse (and writeSparseFile) assumes that the input matrix's row Map is one to one (not overlapping). This is because it does an Import from the input matrix, to a ...*Created by: mhoemmen*
@trilinos/tpetra
Tpetra::MatrixMarket::writeSparse (and writeSparseFile) assumes that the input matrix's row Map is one to one (not overlapping). This is because it does an Import from the input matrix, to a matrix with a gathered row Map (all indices on Process 0).
If the gathered row Map is always one to one, we could fix this easily by using an Export instead of an Import.
@vbrunini reported this issue, via CC by @kddevin .Tpetra-backloghttps://gitlab.osti.gov/jmwille/Trilinos/-/issues/1006Add gatherv wrapper to Teuchos2017-10-26T20:52:36ZJames WillenbringAdd gatherv wrapper to Teuchos*Created by: mhoemmen*
@trilinos/teuchos @trilinos/tpetra
@dridzal requested a gatherv wrapper for either Teuchos or Tpetra. Tpetra has an implementation of this wrapper now, but it currently lives in an anonymous namespace. @drid...*Created by: mhoemmen*
@trilinos/teuchos @trilinos/tpetra
@dridzal requested a gatherv wrapper for either Teuchos or Tpetra. Tpetra has an implementation of this wrapper now, but it currently lives in an anonymous namespace. @dridzal would like to see it promoted to the public interface. See commit b5b1b8f09812500d5bcbd53e9b2a99ead10823de.
It would be easy for just about any Trilinos developer without much Tpetra or Teuchos knowledge to promote Tpetra's current internal implementation to a public interface, and add a unit test. I'll be happy to review and accept pull requests.Tpetra-backloghttps://gitlab.osti.gov/jmwille/Trilinos/-/issues/964Kokkos::StaticCrsGraph does not use memory space template parameter passed to...2017-01-10T20:54:12ZJames WillenbringKokkos::StaticCrsGraph does not use memory space template parameter passed to KokkosSparse::CrsMatrix*Created by: mndevec*
I am providing a device to CrsMatrix in the constructor. This is different than default, where device is either Kokkos::Cuda with hostpinned space, or Kokkos::OpenMP with Kokkos::Hostspace when HBM is enabled.
...*Created by: mndevec*
I am providing a device to CrsMatrix in the constructor. This is different than default, where device is either Kokkos::Cuda with hostpinned space, or Kokkos::OpenMP with Kokkos::Hostspace when HBM is enabled.
I was expecting CrsMatrix memory to be allocated at the memory space I provide. However, this holds only for the values view, while row pointers and entries are still allocated at the default memory space of the provided execution space.
It seems that StaticCrsGraph do not take the device as template argument, instead it is provided the execution space. It creates a default device, as a result allocated memories diverge for values and entries views.
Shouldn't StaticCrsGraph take the device as template argument instead of execution space?
@srajama1 @crtrott @mhoemmen
Tpetra-backloghttps://gitlab.osti.gov/jmwille/Trilinos/-/issues/962Tpetra: Avoid atomic ops for unpack if no duplicate LIDs2019-02-17T23:45:14ZJames WillenbringTpetra: Avoid atomic ops for unpack if no duplicate LIDs*Created by: mhoemmen*
@trilinos/tpetra
Epic: #796
This story matters to other epics besides #796.
Thread-parallel implementations of Tpetra::DistObject::unpackAndCombine(New) do not need to do atomic updates if the result LIDs ha...*Created by: mhoemmen*
@trilinos/tpetra
Epic: #796
This story matters to other epics besides #796.
Thread-parallel implementations of Tpetra::DistObject::unpackAndCombine(New) do not need to do atomic updates if the result LIDs have no duplicates (meaning that at most one thread will ever modify any one value of the destination DistObject at a time). "Result LIDs have no duplicates" is a property of the Import / Export object, so the Import / Export object should remember this at construction time for reuse.
This matters because atomic updates have a run-time cost, even if there is no contention between threads. Sparse matrix-vector multiply does an Import on the input MultiVector. In the common case, the result LIDs in this Import should have no duplicates. (This is because the column Map is constructed that way, if users let CrsGraph or CrsMatrix construct the column Map.) Thus, atomic updates add unnecessary run time to this important kernel.
@jjellio and @tjfulle might be interested in this. I would like to fix this for MultiVector in a way that neatly encapsulates its unpack kernels (and ideally also its pack kernels). Fixing this issue should have the side effect of improving MPI-only performance, which is an important use case for many customers (who do not have hardware that requires MPI + threads for good performance).Tpetra-backloghttps://gitlab.osti.gov/jmwille/Trilinos/-/issues/960Tpetra: Feature request to add "min" CombineMode2017-10-26T20:50:03ZJames WillenbringTpetra: Feature request to add "min" CombineMode*Created by: ikalash*
@mhoemmen
@jrobbin
@trilinos/tpetra
I would like to request the addition of a method equivalent to Epetra_Min to the Tpetra exporter class. It is needed in the Albany code. *Created by: ikalash*
@mhoemmen
@jrobbin
@trilinos/tpetra
I would like to request the addition of a method equivalent to Epetra_Min to the Tpetra exporter class. It is needed in the Albany code. Tpetra-backloghttps://gitlab.osti.gov/jmwille/Trilinos/-/issues/958Tpetra::CrsMatrix: Possible bug in transposed apply2018-01-23T17:11:16ZJames WillenbringTpetra::CrsMatrix: Possible bug in transposed apply*Created by: mhoemmen*
@trilinos/tpetra
@ikalash reported the following:
> I am trying to use Tpetra::CrsMatrix apply method with a Teuchos::TRANS combine mode, and the method does not appear to be working correctly. I end up with...*Created by: mhoemmen*
@trilinos/tpetra
@ikalash reported the following:
> I am trying to use Tpetra::CrsMatrix apply method with a Teuchos::TRANS combine mode, and the method does not appear to be working correctly. I end up with a vector of zeros even though the operator is nonzero, nor is the input vector. If I set the combine mode to Teuchos::NO_TRANS, things work correctly. I assume Teuchos::TRANS has been tested, so perhaps I am doing something wrong, although I am not sure what it could be. Is there some caveat about the method’s usage with the TRANS combine mode? I printed the Boolean returned when calling hasTransposeApply() and it prints true.
@ikalash also sent data, which I'll post here.Tpetra-backloghttps://gitlab.osti.gov/jmwille/Trilinos/-/issues/953LinearSolverFactory: Add a way to add new solvers at run time2017-09-06T23:25:48ZJames WillenbringLinearSolverFactory: Add a way to add new solvers at run time*Created by: mhoemmen*
@trilinos/belos
Story: #748
For a given package's LinearSolverFactory, and for a particular template parameter combination, add a way to add new solvers at run time.
For example, if we implement Pipeline...*Created by: mhoemmen*
@trilinos/belos
Story: #748
For a given package's LinearSolverFactory, and for a particular template parameter combination, add a way to add new solvers at run time.
For example, if we implement Pipelined CG _just_ for Tpetra objects, we want a way to add this solver to the list of solvers that Belos' factory knows how to create.
Tpetra-backloghttps://gitlab.osti.gov/jmwille/Trilinos/-/issues/939Tpetra::MultiVector: Isolate & do separate ETI for pack & unpack kernels2016-12-17T06:54:09ZJames WillenbringTpetra::MultiVector: Isolate & do separate ETI for pack & unpack kernels*Created by: mhoemmen*
@trilinos/tpetra
Pack and unpack kernels for Tpetra::MultiVector currently get built on the fly. It would make sense to isolate those kernels and use the ETI system to build them separately from Tpetra_MultiVec...*Created by: mhoemmen*
@trilinos/tpetra
Pack and unpack kernels for Tpetra::MultiVector currently get built on the fly. It would make sense to isolate those kernels and use the ETI system to build them separately from Tpetra_MultiVector_def.hpp.Tpetra-backloghttps://gitlab.osti.gov/jmwille/Trilinos/-/issues/833Tpetra::CrsGraph::globalAssemble only needs to be called at first fillComplet...2017-10-26T20:42:13ZJames WillenbringTpetra::CrsGraph::globalAssemble only needs to be called at first fillComplete, if at all*Created by: mhoemmen*
@trilinos/tpetra
CrsGraph does not (currently) allow structure changes after first fillComplete, so there is no point in calling globalAssemble (with default all-reduce check) at subsequent fillComplete calls.*Created by: mhoemmen*
@trilinos/tpetra
CrsGraph does not (currently) allow structure changes after first fillComplete, so there is no point in calling globalAssemble (with default all-reduce check) at subsequent fillComplete calls.Tpetra-backloghttps://gitlab.osti.gov/jmwille/Trilinos/-/issues/832Tpetra: Improve thread scalability of CrsGraph::fillComplete2017-10-26T20:41:39ZJames WillenbringTpetra: Improve thread scalability of CrsGraph::fillComplete*Created by: mhoemmen*
@trilinos/tpetra
Epic: #796
Tasks (incomplete list):
- #799: Thread-parallelize CrsGraph pack and unpack (used in globalAssemble)
- #658: Thread-parallelize makeColMap
- #834 (#660 optional in FY17): ...*Created by: mhoemmen*
@trilinos/tpetra
Epic: #796
Tasks (incomplete list):
- #799: Thread-parallelize CrsGraph pack and unpack (used in globalAssemble)
- #658: Thread-parallelize makeColMap
- #834 (#660 optional in FY17): Thread-parallelize sort and merge of column indices
- #835: Thread-parallelize computeGlobalConstants
Improve thread scalability of Tpetra::CrsGraph::fillComplete.
This should always refer to first fillComplete, since that's the only time when CrsGraph::fillComplete should need to do anything. (CrsGraph does not allow changes after first fillComplete.)
Here's what CrsGraph::fillComplete does:
1. Get "Sort column Map ghost GIDs" bool parameter (defaults to true)
2. Get "No Nonlocal Changes" bool parameter (defaults to false)
3. Allocate indices, if they have not already been allocated (they have if the graph is nonempty on the calling process)
4. Do global assembly if needed (see (2))
5. Set domain and range Maps (just RCP assignment)
6. If the graph does not already have a column Map, make one (by calling makeColMap)
7. If column indices are stored as global indices, convert them to local
8. Sort the column indices in each row
9. Merge the column indices in each row (inserts do not merge by default; this is different than Epetra)
10. Make the Import and/or Export objects, if needed
11. "Compute local and global constants" (e.g., scan the graph to check if locally upper or lower triangular; do an all-reduce to compute global number of entries, etc.)
12. "Fill local graph": convert from the graph's current local data structure, to an optimized local data structure
Notes:
(1) has consequences for thread parallelization. (See discussion of (6) and (10) below.) Its default state of true gives makeColMap the freedom to relax the original order of GIDs. If false, makeColMap maintains a separate array of GIDs in order to remember the original order. It would be hard to thread-parallelize this step without atomic push_back or a two-pass (count, then fill) approach.
(2) means that by default, fillComplete must do an extra all-reduce in order to tell whether any processes have inserted into nonowned rows on any process. It should only need to do this all-reduce at first fillComplete, but see #833.
(3) is part of Crs{Graph,Matrix}'s lazy allocation strategy. This makes these objects use constant storage per process if the graph / matrix is empty. However, lazy allocation hinders thread parallelization of structure changes through Tpetra's interface. It's something that we will need to address at some point.
(4): CrsMatrix::globalAssemble is expensive compared with the two-matrix Export approach. However, CrsGraph does not (currently) allow structure changes after first fillComplete, so this is less of a concern. Note that the two-graph Export approach depends on #799 for thread parallelization.
(6) and (10) are opportunities for thread parallelization. There is an outstanding issue even in Bugzilla to merge makeColMap and makeImportExport. This has benefits for reducing MPI communication, by not throwing away information (what process owns what column indices in the column Map) that may have required MPI communication to get. Merging the two methods would involve a new Import constructor that uses this process ownership information to bypass some of the normal Import setup process. Merging makeColMap and makeImportExport would thus let us get away with not thread-parallelizing the normal Import constructor.
(7): If the CrsGraph is StaticProfile (argument to its constructor, governing pre-first-fillComplete storage format), makeIndicesLocal is already thread parallel. However, if the CrsGraph is DynamicProfile, this is not thread parallel. Fixing this would incur some sequential overhead or require a data structure change, due to current use of Teuchos memory management classes, which are not thread safe (see #229). It's more efficient anyway to fill a StaticProfile CrsGraph, so we recommend that applications take this approach. See Tpetra Lesson 04 on different fill strategies.
(8), (9): Inserting into a CrsGraph does not currently merge entries in the same row with the same column index. That's why sort and merge are separate steps. Regardless, a one-level thread parallelization of sort and merge is not hard. For each row, we would call a sequential sort. GPU-izing or vectorizing this is much harder; it calls for a segmented sort (and merge). See #662.
(11) reads all the graph's entries. It is not yet thread parallel. Making this thread parallel may call for doing (12) (fillLocalGraph) before (11). This would ensure that the graph's local data structure is a Kokkos::StaticCrsGraph, and thus safe and optimized for parallel kernels. We could also merge some of the local work in (11) with (12). This could mitigate some of the costs of the DynamicProfile or of the unpacked StaticProfile cases in fillLocalGraph.
(12): Copy data structure from DynamicProfile ("2-D," array of arrays) or unpacked StaticProfile (compressed sparse row, with extra space in some rows), to packed StaticProfile (compressed sparse row, with no extra space in any row). The StaticProfile case (unpacked to packed) is already thread parallel. Above notes on (7) explain why the DynamicProfile case would take work to make thread parallel. Best practice for applications would be to use StaticProfile graphs only.Tpetra-backloghttps://gitlab.osti.gov/jmwille/Trilinos/-/issues/801Tpetra: Make BlockCrsMatrix do thread-parallel pack & unpack2017-10-26T20:41:07ZJames WillenbringTpetra: Make BlockCrsMatrix do thread-parallel pack & unpack*Created by: mhoemmen*
@trilinos/tpetra
Story: #797 *Created by: mhoemmen*
@trilinos/tpetra
Story: #797 Tpetra-backloghttps://gitlab.osti.gov/jmwille/Trilinos/-/issues/799Tpetra: Make CrsGraph do thread-parallel pack & unpack2018-05-16T04:14:09ZJames WillenbringTpetra: Make CrsGraph do thread-parallel pack & unpack*Created by: mhoemmen*
@trilinos/tpetra
Story: #797 *Created by: mhoemmen*
@trilinos/tpetra
Story: #797 Tpetra-backloghttps://gitlab.osti.gov/jmwille/Trilinos/-/issues/797Tpetra: Improve thread scalability of Import/Export2017-10-26T20:39:02ZJames WillenbringTpetra: Improve thread scalability of Import/Export*Created by: mhoemmen*
@trilinos/tpetra
Epic: #796
[mfh edit 13 Jul 2017: promote the transferAndFillComplete task, #802, into its own story]
This involves several tasks, not all fully identified:
- [x] Change Tpetra::D...*Created by: mhoemmen*
@trilinos/tpetra
Epic: #796
[mfh edit 13 Jul 2017: promote the transferAndFillComplete task, #802, into its own story]
This involves several tasks, not all fully identified:
- [x] Change Tpetra::Details::PackTraits to support thread-parallel pack & unpack (#798)
- [ ] Make CrsGraph use the new PackTraits interface to do thread-parallel pack & unpack (#799)
- [x] Ditto for CrsMatrix (#800)
- [ ] Ditto for BlockCrsMatrix (#801)
- [x] Ditto for CrsMatrix::transferAndFillComplete (#802)
For #800, #801, and #802, we need to do performance tests to make sure that the changes thread-scale without sacrificing performance in the MPI-only case. It may make sense to have a non-threaded implementation if the number of threads is 1. (Some users may use Tpetra's OpenMP back-end without realizing it, but run with 1 thread per MPI process. That's why this should be a run-time decision rather than a decision based on the back-end type.)
Tpetra-backloghttps://gitlab.osti.gov/jmwille/Trilinos/-/issues/768Tpetra: Implement "dual view" semantics2017-10-26T20:38:32ZJames WillenbringTpetra: Implement "dual view" semantics*Created by: mhoemmen*
@trilinos/tpetra
Epic for collecting all stories relating to "dual view" semantics in Tpetra.*Created by: mhoemmen*
@trilinos/tpetra
Epic for collecting all stories relating to "dual view" semantics in Tpetra.Tpetra-backloghttps://gitlab.osti.gov/jmwille/Trilinos/-/issues/767Tpetra: Overlap communication and computation2017-10-26T20:37:53ZJames WillenbringTpetra: Overlap communication and computation*Created by: mhoemmen*
@trilinos/tpetra
Epic for tracking all stories relating to overlapping communication and computation in Tpetra.*Created by: mhoemmen*
@trilinos/tpetra
Epic for tracking all stories relating to overlapping communication and computation in Tpetra.Tpetra-backloghttps://gitlab.osti.gov/jmwille/Trilinos/-/issues/769Tpetra: Memory-scalable I/O2016-11-02T20:57:52ZJames WillenbringTpetra: Memory-scalable I/O*Created by: mhoemmen*
@trilinos/tpetra
Epic for collecting all stories relating to memory-scalable I/O in Tpetra.*Created by: mhoemmen*
@trilinos/tpetra
Epic for collecting all stories relating to memory-scalable I/O in Tpetra.Tpetra-backloghttps://gitlab.osti.gov/jmwille/Trilinos/-/issues/697Tpetra::RowMatrixTransposer: Generalize test that depends unnecessarily on GO...2016-11-02T20:14:11ZJames WillenbringTpetra::RowMatrixTransposer: Generalize test that depends unnecessarily on GO=int*Created by: mhoemmen*
@trilinos/tpetra
The Tpetra::RowMatrixTransposer test currently only builds and runs when GO=int is enabled. First, fix the test so that it is templated on GO (and LO and Node). Then, fix the CMakeLists.txt fi...*Created by: mhoemmen*
@trilinos/tpetra
The Tpetra::RowMatrixTransposer test currently only builds and runs when GO=int is enabled. First, fix the test so that it is templated on GO (and LO and Node). Then, fix the CMakeLists.txt file (tpetra/core/test/RowMatrixTransposer/CMakeLists.txt), so that it builds and runs the test unconditionally, whether or not GO=int is enabled (see #74). Use #695 and #696 as guides.
Tpetra-backloghttps://gitlab.osti.gov/jmwille/Trilinos/-/issues/689Tpetra: Is there a way to copy examples' .cpp files into user's build dir, th...2016-11-02T20:12:14ZJames WillenbringTpetra: Is there a way to copy examples' .cpp files into user's build dir, then have Makefiles there?*Created by: mhoemmen*
@trilinos/tpetra
@tjfulle asks this. Maybe this could be like Kokkos' tutorial, with a simple Makefile in each directory.
*Created by: mhoemmen*
@trilinos/tpetra
@tjfulle asks this. Maybe this could be like Kokkos' tutorial, with a simple Makefile in each directory.
Tpetra-backloghttps://gitlab.osti.gov/jmwille/Trilinos/-/issues/682Tpetra::Crs{Graph,Matrix}: Add local_offset_type typedef2016-11-02T20:11:36ZJames WillenbringTpetra::Crs{Graph,Matrix}: Add local_offset_type typedef*Created by: mhoemmen*
@trilinos/tpetra Per request by @kddevin (see #674 discussion), add a `local_offset_type` typedef to Tpetra::CrsGraph and Tpetra::CrsMatrix. This type tells users the type that Tpetra uses to store row offsets, i...*Created by: mhoemmen*
@trilinos/tpetra Per request by @kddevin (see #674 discussion), add a `local_offset_type` typedef to Tpetra::CrsGraph and Tpetra::CrsMatrix. This type tells users the type that Tpetra uses to store row offsets, in the local sparse graph / matrix. The `local_` prefix makes it clear that this refers to the _local_ data structure.
Tpetra-backlog