Trilinos issueshttps://gitlab.osti.gov/jmwille/Trilinos/-/issues2016-03-03T17:36:36Zhttps://gitlab.osti.gov/jmwille/Trilinos/-/issues/115Add an option for switching between storing block inverse or block factors st...2016-03-03T17:36:36ZJames WillenbringAdd an option for switching between storing block inverse or block factors storage in block Jacobi and Gauss-Seidel*Created by: aprokop*
@trilinos/ifpack2 @mhoemmen
Ifpack2 now stores and applies the inverse of the matrix's block diagonal explicitly (#96). Theoretically, this is a tradeoff. If block size is large enough, you can get better perfor...*Created by: aprokop*
@trilinos/ifpack2 @mhoemmen
Ifpack2 now stores and applies the inverse of the matrix's block diagonal explicitly (#96). Theoretically, this is a tradeoff. If block size is large enough, you can get better performance out of linear solves rather than full inversion when doing a limited number of iterations. So a user should be able to decide the appropriate approach for it.
https://gitlab.osti.gov/jmwille/Trilinos/-/issues/725Nalu diffs due to latest Trilinos update2016-10-27T15:57:29ZJames WillenbringNalu diffs due to latest Trilinos update*Created by: spdomin*
Bad test suite as of Trilinos SHA1:
commit 5320963fbe2ec9aeba1dc62871e4be4f44961970
Author: Tobias Wiesner tawiesn@sandia.gov
Date: Wed Oct 19 14:43:41 2016 -0600
Last good:
NaluCFD/Nalu SHA1: d2adbe8
Trilinos/m...*Created by: spdomin*
Bad test suite as of Trilinos SHA1:
commit 5320963fbe2ec9aeba1dc62871e4be4f44961970
Author: Tobias Wiesner tawiesn@sandia.gov
Date: Wed Oct 19 14:43:41 2016 -0600
Last good:
NaluCFD/Nalu SHA1: d2adbe8
Trilinos/master SHA1: 12fd285b28abeb7336a8d4f17913819b9ce7015b
This bisect is complicated in that the builds of Trilinos fail either due to configure errors or flat out build issues.
https://gitlab.osti.gov/jmwille/Trilinos/-/issues/813const of subview model in intrepid22016-11-11T17:07:40ZJames Willenbringconst of subview model in intrepid2*Created by: bathmatt*
@eric-c-cyr @kyungjoo-kim @rppawlo
I tested some of the kokkos refactor branch in intrepid2. I was looking at the serial getValues inside of a kokkos parallel_for. Here is a simple code
`
struct foo{
Dy...*Created by: bathmatt*
@eric-c-cyr @kyungjoo-kim @rppawlo
I tested some of the kokkos refactor branch in intrepid2. I was looking at the serial getValues inside of a kokkos parallel_for. Here is a simple code
`
struct foo{
DynRankView<double> in, out;
foo(DynRankView<double> in_,DynRankView<double> out_):in(in_),out(out_){}
KOKKOS_INLINE_FUNCTION
void operator ()(int i)const {
typedef Basis_HGRAD_TRI_C1_FEM Basis;
Basis::Serial<Intrepid2::OPERATOR_VALUE>::getValues(subview(out,i,ALL()), subview(in,i,ALL()));
}
};
`
and a hard coded version
`
void operator ()(int i)const {
out(i,0) = 1. - in(i,0) - in(1,1);
out(i,1) = in(i,0);
out(i,2) = in(i,1);
}
`
Now when I call this with 10M length view on a single core it takes 0.09s, if I skip the subview step and hard code the response it takes 0.03s
From that I'm assuming that the subview creation is roughly 0.06s of the run time.
This is lower than i assumed it would be, but still 70% of the cost. Not suggesting we fix this today, but we may want to in the future. https://gitlab.osti.gov/jmwille/Trilinos/-/issues/1329Stop using MPI_Reduce_scatter (in Epetra and elsewhere)2017-05-26T00:12:49ZJames WillenbringStop using MPI_Reduce_scatter (in Epetra and elsewhere)*Created by: mhoemmen*
@trilinos/xpetra @trilinos/muelu @trilinos/zoltan @trilinos/zoltan2
At least two platforms report performance and/or memory issues with `MPI_Reduce_Scatter` at scale. (See Bugzilla Bug 6336, and my Tpetra com...*Created by: mhoemmen*
@trilinos/xpetra @trilinos/muelu @trilinos/zoltan @trilinos/zoltan2
At least two platforms report performance and/or memory issues with `MPI_Reduce_Scatter` at scale. (See Bugzilla Bug 6336, and my Tpetra comments in 09 Jan 2012.) The easiest work-around is to replace calls to that function with `MPI_Reduce` followed by `MPI_Scatter`. Tpetra has done this for a few years already, for performance reasons. It looks like there used to be an MPICH (correctness) bug in `MPI_Reduce_Scatter`, since Epetra and Zoltan both have work-arounds. Zoltan always just calls `MPI_Reduce` followed by `MPI_Scatter`:
https://github.com/trilinos/Trilinos/blob/f0350316239aaf41e8e6b82378612aaedf9cc72c/packages/zoltan/src/Utilities/Communication/comm_invert_map.c
Epetra has an option to do the right thing, depending on the `REDUCE_SCATTER_BUG` macro (see line 570 of epetra/src/Epetra_MpiDistributor.cpp):
https://github.com/trilinos/Trilinos/blob/06f5eea36d0b40751717bc9ee995b9f5f1184a23/packages/epetra/src/Epetra_MpiDistributor.cpp
It would make sense to elevate that macro to a CMake option, if it is not one already.
This is relevant to the Xpetra and MueLu developers, in case they are using the Epetra back-end.
@pwxy @spdomin
https://gitlab.osti.gov/jmwille/Trilinos/-/issues/1455Tpetra::MultiVector::Dot() slower than Epetra for short vectors, faster for l...2017-10-30T19:52:03ZJames WillenbringTpetra::MultiVector::Dot() slower than Epetra for short vectors, faster for long vectors*Created by: CamelliaDPG*
The driver pasted below allows CLI arguments to control the length of two vectors that are dotted together, as well as the number of such dot products that should be done. It does this with both `Tpetra::Multi...*Created by: CamelliaDPG*
The driver pasted below allows CLI arguments to control the length of two vectors that are dotted together, as well as the number of such dot products that should be done. It does this with both `Tpetra::MultiVector` and `Epetra_Vector` objects, with a serial distribution, and reports the cumulative time for each of Epetra and Tpetra. Here are the timing results from a run on a shared Intel Xeon E7-4880 machine (times in seconds):
numDots | globalRowCount | Epetra | Tpetra | Slowdown
--------- | -----------------| -------| --------| --------
1000000 | 50 | 0.43 | 8.85 | 21x
100000 | 500 | 0.11 | 2.86 | 26x
10000 | 5000 | 0.07 | 2.25 | 32x
1000 | 50000 | 0.08 | 2.20 | 28x
100 | 500000 | 0.07 | 2.17 | 31x
10 | 5000000 | 0.15 | 2.29 | 15x
1 | 50000000 | 0.15 | 2.29 | 15x
Can this be addressed? From some profiling, it does appear that part of the story involves calls to `getVector()`, which result in new Tpetra Vector objects being created in each dot product (a shallow copy). That is probably why the slowdown is 15x as the vectors get larger, but as high as 32x for smaller vectors: with the smaller vectors you see the cost of the object creation and destruction. It appears, though, that the calls to `getVector()` are only about half the story: there is still a 15x slowdown to account for.
```C++
// Epetra
#include <Epetra_Map.h>
#include <Epetra_SerialComm.h>
#include <Epetra_Vector.h>
// Tpetra
#include <Tpetra_MultiVector.hpp>
// Teuchos
#include <Teuchos_Comm.hpp>
#include <Teuchos_RCP.hpp>
#include <Teuchos_TimeMonitor.hpp>
#include <iostream>
#include <fstream>
#include <string>
using namespace std;
using namespace Teuchos;
typedef long long GlobalOrdinal; // signed types preferred
typedef int LocalOrdinal; // signed types preferred
typedef double Scalar;
typedef Tpetra::Map<LocalOrdinal, GlobalOrdinal>::node_type NodeType;
typedef Tpetra::Map<LocalOrdinal,GlobalOrdinal,NodeType> TpetraMap;
typedef Tpetra::MultiVector<Scalar,LocalOrdinal,GlobalOrdinal, NodeType> TpetraMultiVector;
int main(int argc, char *argv[])
{
int numDots = 10000;
int globalRowCount = 1500;
if (argc == 2)
{
numDots = atoi(argv[1]);
cout << "setting numDots to " << numDots << endl;
}
if (argc == 3)
{
numDots = atoi(argv[1]);
globalRowCount = atoi(argv[2]);
}
cout << "numDots is " << numDots << endl;
cout << "globalRowCount is " << globalRowCount << endl;
RCP<Time> epetraTimer = TimeMonitor::getNewCounter("Epetra dot");
RCP<Time> tpetraTimer = TimeMonitor::getNewCounter("Tpetra dot");
/*** Epetra setup ***/
Epetra_SerialComm serialComm;
Epetra_Map serialMap(globalRowCount, 0, serialComm);
RCP<Epetra_Vector> a_epetra = rcp( new Epetra_Vector(serialMap) );
RCP<Epetra_Vector> b_epetra = rcp( new Epetra_Vector(serialMap) );
for (int localOrdinal=0; localOrdinal<serialMap.NumMyElements(); localOrdinal++)
{
(*a_epetra)[localOrdinal] = 2.0;
(*b_epetra)[localOrdinal] = 0.5;
}
/*** Tpetra setup ***/
RCP<Teuchos::Comm<int>> serialCommTeuchos = Teuchos::rcp( new Teuchos::SerialComm<int>() );
RCP<const TpetraMap> serialMapTpetra = rcp( new TpetraMap(globalRowCount, 0, serialCommTeuchos));
RCP<TpetraMultiVector> a_tpetra = rcp( new TpetraMultiVector(serialMapTpetra, 1)); // numVectors = 1
RCP<TpetraMultiVector> b_tpetra = rcp( new TpetraMultiVector(serialMapTpetra, 1)); // numVectors = 1
auto a_localView = a_tpetra->getLocalView<Kokkos::HostSpace>();
auto b_localView = b_tpetra->getLocalView<Kokkos::HostSpace>();
for (int localOrdinal=serialMapTpetra->getMinLocalIndex(); localOrdinal<=serialMapTpetra->getMaxLocalIndex(); localOrdinal++)
{
a_localView(localOrdinal,0) = 2.0;
b_localView(localOrdinal,0) = 0.5;
}
for (int i=0; i<numDots; i++)
{
double epetraResult;
int err;
epetraTimer->start();
epetraTimer->incrementNumCalls();
err = a_epetra->Dot(*b_epetra, &epetraResult);
epetraTimer->stop();
if (err != 0)
{
cout << "Epetra dot returned err code " << err << ". Halting execution.\n";
exit(err);
}
std::vector<Scalar> tpetraResult( 1 );
tpetraTimer->start();
tpetraTimer->incrementNumCalls();
a_tpetra->dot(*b_tpetra, tpetraResult);
tpetraTimer->stop();
}
TimeMonitor::summarize(serialCommTeuchos.ptr());
return 0;
}
```https://gitlab.osti.gov/jmwille/Trilinos/-/issues/1613Issues with TentativePFactory for "use kokkos refactor"=true on KNL?2017-09-11T05:46:30ZJames WillenbringIssues with TentativePFactory for "use kokkos refactor"=true on KNL?*Created by: pwxy*
I built drekar for develop Trilinos (cloned repo early Mon Aug 14 morning).
I built on ellis for KNL with intel 17 compiler
with the following cmake options
```
-D Tpetra_ENABLE_MMM_Timings=ON \
-D MueLu_ENA...*Created by: pwxy*
I built drekar for develop Trilinos (cloned repo early Mon Aug 14 morning).
I built on ellis for KNL with intel 17 compiler
with the following cmake options
```
-D Tpetra_ENABLE_MMM_Timings=ON \
-D MueLu_ENABLE_Experimental:BOOL=ON \
-D MueLu_ENABLE_Kokkos_Refactor:BOOL=ON \
-D Xpetra_ENABLE_Experimental:BOOL=ON \
-D Xpetra_ENABLE_Kokkos_Refactor:BOOL=ON \
```
the MueLu setup time for the drekar run was 14.6 sec for 1 MPI with 16 OMP threads.
I then added:
`<Parameter name="use kokkos refactor" type="bool" value="true"/>`
to the MueLu parameter list
and the MueLu setup time increased to 3557s, or 244x slower
Here are all the MueLu setup timers that are over 100s:
```
3557.00000 MueLu: N5MueLu9HierarchyIdixN6Kokkos6Compat23KokkosDeviceWrapperNodeINS1_6OpenMPENS1_9HostSpaceEEEEE: Setup (total)
3557.00000 MueLu: N5MueLu9HierarchyIdixN6Kokkos6Compat23KokkosDeviceWrapperNodeINS1_6OpenMPENS1_9HostSpaceEEEEE: Setup (total, level=1)
3557.00000 MueLu: N5MueLu10RAPFactoryIdixN6Kokkos6Compat23KokkosDeviceWrapperNodeINS1_6OpenMPENS1_9HostSpaceEEEEE: Computing Ac (total)
3557.00000 MueLu: N5MueLu18RepartitionFactoryIdixN6Kokkos6Compat23KokkosDeviceWrapperNodeINS1_6OpenMPENS1_9HostSpaceEEEEE: Build (total)
3557.00000 MueLu: N5MueLu24RebalanceTransferFactoryIdixN6Kokkos6Compat23KokkosDeviceWrapperNodeINS1_6OpenMPENS1_9HostSpaceEEEEE: Build (total, level=1)
3557.00000 MueLu: N5MueLu18RepartitionFactoryIdixN6Kokkos6Compat23KokkosDeviceWrapperNodeINS1_6OpenMPENS1_9HostSpaceEEEEE: Build (total, level=1)
3557.00000 MueLu: N5MueLu24RebalanceTransferFactoryIdixN6Kokkos6Compat23KokkosDeviceWrapperNodeINS1_6OpenMPENS1_9HostSpaceEEEEE: Build (total)
3556.00000 MueLu: N5MueLu10RAPFactoryIdixN6Kokkos6Compat23KokkosDeviceWrapperNodeINS1_6OpenMPENS1_9HostSpaceEEEEE: Computing Ac (total, level=1)
3554.00000 MueLu: N5MueLu24TentativePFactory_kokkosIdixN6Kokkos6Compat23KokkosDeviceWrapperNodeINS1_6OpenMPENS1_9HostSpaceEEEEE: Build (total)
3554.00000 MueLu: N5MueLu24TentativePFactory_kokkosIdixN6Kokkos6Compat23KokkosDeviceWrapperNodeINS1_6OpenMPENS1_9HostSpaceEEEEE: Build (total, level=1)
3391.00000 MueLu: N5MueLu24TentativePFactory_kokkosIdixN6Kokkos6Compat23KokkosDeviceWrapperNodeINS1_6OpenMPENS1_9HostSpaceEEEEE: Stage 1 (LocalQR) (sub, total)
3391.00000 MueLu: N5MueLu24TentativePFactory_kokkosIdixN6Kokkos6Compat23KokkosDeviceWrapperNodeINS1_6OpenMPENS1_9HostSpaceEEEEE: Build (level=1)
3391.00000 MueLu: N5MueLu24TentativePFactory_kokkosIdixN6Kokkos6Compat23KokkosDeviceWrapperNodeINS1_6OpenMPENS1_9HostSpaceEEEEE: Build
3391.00000 MueLu: N5MueLu24TentativePFactory_kokkosIdixN6Kokkos6Compat23KokkosDeviceWrapperNodeINS1_6OpenMPENS1_9HostSpaceEEEEE: Stage 1 (LocalQR) (sub, total, level=1)
163.10000 MueLu: N5MueLu34UncoupledAggregationFactory_kokkosIixN6Kokkos6Compat23KokkosDeviceWrapperNodeINS1_6OpenMPENS1_9HostSpaceEEEEE: Build (total)
163.00000 MueLu: N5MueLu34UncoupledAggregationFactory_kokkosIixN6Kokkos6Compat23KokkosDeviceWrapperNodeINS1_6OpenMPENS1_9HostSpaceEEEEE: Build (total, level=0)
161.10000 MueLu: N5MueLu26CoalesceDropFactory_kokkosIdixN6Kokkos6Compat23KokkosDeviceWrapperNodeINS1_6OpenMPENS1_9HostSpaceEEEEE: Build (total)
161.10000 MueLu: N5MueLu26CoalesceDropFactory_kokkosIdixN6Kokkos6Compat23KokkosDeviceWrapperNodeINS1_6OpenMPENS1_9HostSpaceEEEEE: Build (total, level=0)
160.90000 MueLu: N5MueLu26CoalesceDropFactory_kokkosIdixN6Kokkos6Compat23KokkosDeviceWrapperNodeINS1_6OpenMPENS1_9HostSpaceEEEEE: Build (level=0)
160.90000 MueLu: N5MueLu26CoalesceDropFactory_kokkosIdixN6Kokkos6Compat23KokkosDeviceWrapperNodeINS1_6OpenMPENS1_9HostSpaceEEEEE: Build
```
Edit (@aprokop): added quoting
https://gitlab.osti.gov/jmwille/Trilinos/-/issues/1646MueLu::ParameterListInterpreter::ParameterListInterpreter seems to take a lot...2017-12-22T15:59:12ZJames WillenbringMueLu::ParameterListInterpreter::ParameterListInterpreter seems to take a lot more time than expected*Created by: pwxy*
MueLu::CreateXpetraPreconditioner calls both MueLu::HierarchyManager::SetupHierarchy and MueLu::ParameterListInterpreter::ParameterListInterpreter
I'm running Chebyshev smoother, and I thought that the smoothers we...*Created by: pwxy*
MueLu::CreateXpetraPreconditioner calls both MueLu::HierarchyManager::SetupHierarchy and MueLu::ParameterListInterpreter::ParameterListInterpreter
I'm running Chebyshev smoother, and I thought that the smoothers were being constructed by MueLu::HierarchyManager::SetupHierarchy rather than MueLu::ParameterListInterpreter::ParameterListInterpreter. However, I saw that several levels down the stack trace, MueLu::ParameterListInterpreter::ParameterListInterpreter calls MueLu::ParameterListInterpreter::UpdateFactoryManager_Smoothers, which for presmoothing, calls:
preSmoother = rcp(new SmootherFactory(rcp(new TrilinosSmoother(preSmootherType, preSmootherParams, overlap))));
So it appears that MueLu::ParameterListInterpreter::ParameterListInterpreter is constructing the smoothers?
I've noticed that time spent in MueLu::ParameterListInterpreter::ParameterListInterpreter is rather large, and most of this time is due to the construction of the presmoother (I have both pre- and post- smoothing).https://gitlab.osti.gov/jmwille/Trilinos/-/issues/1676MueLu: Proposal: merge CoalesceDrop and FilteredA factories2017-09-29T23:47:29ZJames WillenbringMueLu: Proposal: merge CoalesceDrop and FilteredA factories*Created by: aprokop*
@trilinos/muelu @jhux2 @tawiesn
**Summary**
Right now both factories do very similar things. Each one has a loop that goes through a matrix and filters some entries. The proposal combines those loops in one ...*Created by: aprokop*
@trilinos/muelu @jhux2 @tawiesn
**Summary**
Right now both factories do very similar things. Each one has a loop that goes through a matrix and filters some entries. The proposal combines those loops in one thus speeding up execution.
**Description**
The current `CoalesceDropFactory` is responsible for a single thing: constructing a (amalgamated) filtered graph that is later used in the aggregation. For the filtered scenario (when drop tolerance is not 0) it has a loop that goes through rows one by one and for each row entry determines whether to drop it (based on the original matrix or distance laplacian). It simultaneously constructs the compressed graph (LWGraph).
The construction of the the filtered matrix in the `FilteredAFactory` has two variants: 1) through reusing the graph of the original matrix and zeroing out entries; and 2) though construction a brand new matrix with the compressed graph. In both cases, the looping is done through rows and uses a `filter` array that helps to determine which matrix values to drop/zero out.
The loop in the `FilteredAFactory` is remarkably similar to the loop in the `CoalesceDropFactory`. In fact, the only difference is that in the later it constructs only rows and columns, and in the former it constructs values. The work is duplicated, and in fact it's even more expensive to go through the matrix the second time as we have to determine *again* which entries to filter.
**Proposed solution**
In the proposed solution, the `CoalesceDropFactory` (or its renamed version) will construct both `LWGraph` that is used in aggregation **and** the filtered matrix if desired. The `FilteredAFactory` goes away. The filtering loop in the factory will construct rows, columns, and values. For the block variant it will also construct coalesced rows and columns.
**Benefits**
Looping through the level matrix is done once potentially achieving significant speedup, especially when reusing matrix graph.
This would benefit applications that use MueLu with filtering.
**Q&A**
*Are there issues with block systems?*
So far, I don't see any issues. The block systems are treated by first filtering a block row, and the coalescing it. Constructing filtered matrix is independent of coalescing, though they both depend on the filtering.
*What about a special non-lightweight graph branch in `CoalesceDropFactory`*?
Tricky question. I don't have a good answer as I don't know what that does and what is special about it. @tawiesn ?
*Does it benefit Kokkos version?*
It should benefit Kokkos version the same way. In fact, it could be merged with another idea where we do not even construct the compressed graph for `LWGraph_kokkos` but rather provide a wrapper around the graph of the original matrix. This would eliminate the 2nd loop in the current `CoalesceDropFactory_kokkos`.
*What are the steps to implement this?*
I will start implementing it in the kokkos-refactor branch of MueLu for the scalar case. This can be done independently from the default branch through modifying the dependency tree in the `ParameterListInterpreter`. If the result demonstrate the feasibility and speedup using this approach, we could discuss backporting to the non-kokkos version.https://gitlab.osti.gov/jmwille/Trilinos/-/issues/1727Cori KNL: best practices2017-09-27T08:27:47ZJames WillenbringCori KNL: best practices*Created by: jhux2*
I am opening this issue to record best practices for achieving good performance on the Cori KNL partition. @jjellio Could I ask you to document your recommendations in this ticket? I've had a number of questions re...*Created by: jhux2*
I am opening this issue to record best practices for achieving good performance on the Cori KNL partition. @jjellio Could I ask you to document your recommendations in this ticket? I've had a number of questions regarding this. Thanks!
@sayerhs @aprokop @alanw0 @spdomin https://gitlab.osti.gov/jmwille/Trilinos/-/issues/1743 MueLu: Proposal: use special form of the tentative matrix for smoothed prolo...2017-09-27T20:12:55ZJames Willenbring MueLu: Proposal: use special form of the tentative matrix for smoothed prolongator construction*Created by: aprokop*
@trilinos/muelu @jhux2
## Summary
Tentative factory produces a matrix of special form. This special form can be used to speed up matrix-matrix multiply procedure. For example, when nullspace is a constant vec...*Created by: aprokop*
@trilinos/muelu @jhux2
## Summary
Tentative factory produces a matrix of special form. This special form can be used to speed up matrix-matrix multiply procedure. For example, when nullspace is a constant vector, multiplying by tentative matrix is equivalent to averaging columns of matrix A. In addition, the tentative matrix can be stored in its "multivector" form, with columns compacted to multivectors as the aggregates do not overlap.
## Description
Let us first start with the situation where we *do* construct the QR factorization.
In this case, let us store the constructed Q factors in a single multivector. The multivector is based based on the same map that the current Ptent matrix is on, i.e. the row map of A. The multiplication of A by P could be seen to be similar to matrix-vector multiplication with the main difference being that instead of summation of products into a single value, we instead put product pairs into the bins corresponding to aggregates. To do that, we need to be able to translate local column ids to global aggregates ids (ids in the `coarseMap`). This may require an import of integer vector.
## Q&A
*What changes if we do not want to do QR decomposition?*
In this case, the TentativePFactory will set the "P" multivector to be the nullspace".
*What is the expected performance improvement?*
Ideally, we would skip the global tentative matrix construction, and instead just construct a local multivector and then do import (and, if domain map of A is the same as row map, we already have the Import object for it). In the SaPFactory, the hope is that the matrix-vector multiply-like procedure is about 2-3 times as expensive as regular matvec and is significantly cheaper than the full matrix-matrix multiplication. We should be able to do it using just local indices and parallelize over threads. There will be an additional cost of compression, as we don't know the number of nonzeros per row in the final matrix.
*How to optimally do the matrix-vector like procedure?*
An open question. There are some similarities with `sortAndMerge` procedure in Tpetra, so it may be possible to learn from that.https://gitlab.osti.gov/jmwille/Trilinos/-/issues/1904Zoltan2: Use MPI_Alltoallv for all-to-all, instead of hand-rolling it from se...2017-10-25T21:59:39ZJames WillenbringZoltan2: Use MPI_Alltoallv for all-to-all, instead of hand-rolling it from sends and receives*Created by: mhoemmen*
@trilinos/zoltan2
See `Trilinos/packages/zoltan2/src/util/Zoltan2_AlltoAll.[ch]pp`. If necessary, add a Teuchos wrapper for `MPI_Alltoallv`.*Created by: mhoemmen*
@trilinos/zoltan2
See `Trilinos/packages/zoltan2/src/util/Zoltan2_AlltoAll.[ch]pp`. If necessary, add a Teuchos wrapper for `MPI_Alltoallv`.https://gitlab.osti.gov/jmwille/Trilinos/-/issues/1914Teuchos::TimeMonitor: Use Kokkos profiling region hooks if Kokkos is available2017-10-30T20:18:21ZJames WillenbringTeuchos::TimeMonitor: Use Kokkos profiling region hooks if Kokkos is available*Created by: mhoemmen*
@trilinos/teuchos @nmhamster @jjellio
See discussion in #874.*Created by: mhoemmen*
@trilinos/teuchos @nmhamster @jjellio
See discussion in #874.https://gitlab.osti.gov/jmwille/Trilinos/-/issues/1931Improve MueLu setup scaling for repartitioning by reducing comm_split calls2017-11-21T21:41:16ZJames WillenbringImprove MueLu setup scaling for repartitioning by reducing comm_split calls*Created by: pwxy*
This was actually Bug 6346 from https://software.sandia.gov//bugzilla that was originally filed over 2.5 years ago (June 4, 2015) but conveniently failed to get transferred to github: "Bug 6346 - Improve MueLu scaling...*Created by: pwxy*
This was actually Bug 6346 from https://software.sandia.gov//bugzilla that was originally filed over 2.5 years ago (June 4, 2015) but conveniently failed to get transferred to github: "Bug 6346 - Improve MueLu scaling for repartitioning by reducing comm_split calls."
Unfortunately I completely forgot about this issue from over 2.5 years ago, and had to spend a lot of time to track down this issue again.
Currently when repartitioning on coarser levels occurs, there are 3 MPI_Comm_split calls: for Ac, for coordinates and for null space. Seems that two of these Comm_split calls could be removed which would improve scaling for large numbers of MPI processes (> 100,000). The Comm_split calls are the majority of the time to rebalance coordinates and rebalance null space.
For the drekar steady-state Poisson solve for a 4.1 billion row matrix on 524,288 MPI processes on BG/Q, time to rebalance coordinates and rebalance null space are the second and third most expensive items for MueLu setup (since Chebyshev smoother setup time is cheap; it is a different story for RILUK setup time for MHD problems).
Stack trace for Comm_split call to rebalance coordinates (line numbers are from trilinos source over 2.5 years ago):
Teuchos_DefaultMpiComm.hpp 1729 (current dev Trilinos line 1663 Teuchos::MpiComm::split)
Tpetra_Map_def.hpp 1151 (current dev Trilinos line 1787 Tpetra::Map::removeEmptyProcesses)
Xpetra_TpetraMap.hpp 226 (current dev Trilinos line 226 Xpetra::TpetraMap::removeEmptyProcesses)
MueLu_RebalanceTransferFactory_def.hpp 256 (current dev Trilinos line 260 MueLu::RebalanceTransferFactory::Build)
MueLu_TwoLevelFactoryBase.hpp 153
MueLu_Level.hpp 203
MueLu_HierarchyHelpers_def.hpp 89
MueLu_Hierarchy_def.hpp 305
MueLu_HierarchyManager.hpp 197
MueLu_ParameterListInterpreter_def.hpp 1203
MueLu_CreateTpetraPreconditioner.hpp 146
Thyra_MueLuTpetraPreconditionerFactory_def.hpp 194
NOX_Thyra_Group.C 776
NOX_Thyra_Group.C 647
NOX_Thyra_Group.C 544
Stack trace for Comm_split call to rebalance null space (line numbers are from trilinos source over 2.5 years ago):
Teuchos_DefaultMpiComm.hpp 1729
Tpetra_Map_def.hpp 1151
Xpetra_TpetraMap.hpp 226
MueLu_RebalanceTransferFactory_def.hpp 275
MueLu_TwoLevelFactoryBase.hpp 153
MueLu_Level.hpp 203
MueLu_HierarchyHelpers_def.hpp 89
MueLu_Hierarchy_def.hpp 305
MueLu_HierarchyManager.hpp 197
MueLu_ParameterListInterpreter_def.hpp 1203
MueLu_CreateTpetraPreconditioner.hpp 146
Thyra_MueLuTpetraPreconditionerFactory_def.hpp 194
NOX_Thyra_Group.C 776
NOX_Thyra_Group.C 647
NOX_Thyra_Group.C 544
Stack trace for Comm_split call to rebalance Ac (line numbers are from trilinos source over 2.5 years ago):
Teuchos_DefaultMpiComm.hpp 1729
Tpetra_Map_def.hpp 1151
Tpetra_KokkosRefactor_CrsMatrix_def.hpp 7232
Tpetra_KokkosRefactor_CrsMatrix_def.hpp 7632
Tpetra_CrsMatrix_decl.hpp 2665
Xpetra_CrsMatrixFactory.hpp 447
Xpetra_MatrixFactory.hpp 125
MueLu_RebalanceAcFactory_def.hpp 105
MueLu_TwoLevelFactoryBase.hpp 153
MueLu_Level.hpp 203
MueLu_HierarchyHelpers_def.hpp 108
MueLu_Hierarchy_def.hpp 305
MueLu_HierarchyManager.hpp 197
MueLu_ParameterListInterpreter_def.hpp 1203
MueLu_CreateTpetraPreconditioner.hpp 146