Trilinos issueshttps://gitlab.osti.gov/jmwille/Trilinos/-/issues2019-04-11T21:48:56Zhttps://gitlab.osti.gov/jmwille/Trilinos/-/issues/4836Tempus: Improve Interval Output Setup2019-04-11T21:48:56ZJames WillenbringTempus: Improve Interval Output Setup*Created by: ccober6*
In the output interval and the screen output interval, all the intervals are placed in a vector along with the specified output times or indices, and then sorted. This allowed all the output times and indices to b...*Created by: ccober6*
In the output interval and the screen output interval, all the intervals are placed in a vector along with the specified output times or indices, and then sorted. This allowed all the output times and indices to be in one vector and searched. However when the number of time steps reaches millions, the sort is far too expensive. This will rework the interval specification so that it is not part of the vector or sorting.
@trilinos/tempus
## Expectations
Work the same as previously but without the costly sorting.https://gitlab.osti.gov/jmwille/Trilinos/-/issues/1997MueLu: Move PerfUtils::PrintMatrixInfo stuff to Extreme output (rather than h...2017-11-21T21:39:29ZJames WillenbringMueLu: Move PerfUtils::PrintMatrixInfo stuff to Extreme output (rather than high)*Created by: csiefer2*
MPI-P results suggest that these guys represent around 20% of the total number of call locations in the MueLu driver (setup only) when run with high verbosity --- we get load balance and comm info for P and Ac.
...*Created by: csiefer2*
MPI-P results suggest that these guys represent around 20% of the total number of call locations in the MueLu driver (setup only) when run with high verbosity --- we get load balance and comm info for P and Ac.
For example:
P size = 1700 x 204
P Load balancing info
P # active processes: 2/2
P # rows per proc : avg = 8.50e+02, dev = 0.0%, min = +0.0%, max = +0.0%
P # nnz per proc : avg = 2.40e+03, dev = 0.0%, min = +0.0%, max = +0.0%
P Communication info
P # num export send : avg = 0.00, dev = 0.00, min = 0.0 , max = 0.0
P # num import send : avg = 1.70e+01, dev = 0.0%, min = +0.0%, max = +0.0%
P # num msgs : avg = 1.00, dev = 0.00, min = 1.0 , max = 1.0
P # min msg size : avg = 1.70e+01, dev = 0.0%, min = +0.0%, max = +0.0%
P # max msg size : avg = 1.70e+01, dev = 0.0%, min = +0.0%, max = +0.0%
Can we just limit this output to verb extreme by default?https://gitlab.osti.gov/jmwille/Trilinos/-/issues/1931Improve MueLu setup scaling for repartitioning by reducing comm_split calls2017-11-21T21:41:16ZJames WillenbringImprove MueLu setup scaling for repartitioning by reducing comm_split calls*Created by: pwxy*
This was actually Bug 6346 from https://software.sandia.gov//bugzilla that was originally filed over 2.5 years ago (June 4, 2015) but conveniently failed to get transferred to github: "Bug 6346 - Improve MueLu scaling...*Created by: pwxy*
This was actually Bug 6346 from https://software.sandia.gov//bugzilla that was originally filed over 2.5 years ago (June 4, 2015) but conveniently failed to get transferred to github: "Bug 6346 - Improve MueLu scaling for repartitioning by reducing comm_split calls."
Unfortunately I completely forgot about this issue from over 2.5 years ago, and had to spend a lot of time to track down this issue again.
Currently when repartitioning on coarser levels occurs, there are 3 MPI_Comm_split calls: for Ac, for coordinates and for null space. Seems that two of these Comm_split calls could be removed which would improve scaling for large numbers of MPI processes (> 100,000). The Comm_split calls are the majority of the time to rebalance coordinates and rebalance null space.
For the drekar steady-state Poisson solve for a 4.1 billion row matrix on 524,288 MPI processes on BG/Q, time to rebalance coordinates and rebalance null space are the second and third most expensive items for MueLu setup (since Chebyshev smoother setup time is cheap; it is a different story for RILUK setup time for MHD problems).
Stack trace for Comm_split call to rebalance coordinates (line numbers are from trilinos source over 2.5 years ago):
Teuchos_DefaultMpiComm.hpp 1729 (current dev Trilinos line 1663 Teuchos::MpiComm::split)
Tpetra_Map_def.hpp 1151 (current dev Trilinos line 1787 Tpetra::Map::removeEmptyProcesses)
Xpetra_TpetraMap.hpp 226 (current dev Trilinos line 226 Xpetra::TpetraMap::removeEmptyProcesses)
MueLu_RebalanceTransferFactory_def.hpp 256 (current dev Trilinos line 260 MueLu::RebalanceTransferFactory::Build)
MueLu_TwoLevelFactoryBase.hpp 153
MueLu_Level.hpp 203
MueLu_HierarchyHelpers_def.hpp 89
MueLu_Hierarchy_def.hpp 305
MueLu_HierarchyManager.hpp 197
MueLu_ParameterListInterpreter_def.hpp 1203
MueLu_CreateTpetraPreconditioner.hpp 146
Thyra_MueLuTpetraPreconditionerFactory_def.hpp 194
NOX_Thyra_Group.C 776
NOX_Thyra_Group.C 647
NOX_Thyra_Group.C 544
Stack trace for Comm_split call to rebalance null space (line numbers are from trilinos source over 2.5 years ago):
Teuchos_DefaultMpiComm.hpp 1729
Tpetra_Map_def.hpp 1151
Xpetra_TpetraMap.hpp 226
MueLu_RebalanceTransferFactory_def.hpp 275
MueLu_TwoLevelFactoryBase.hpp 153
MueLu_Level.hpp 203
MueLu_HierarchyHelpers_def.hpp 89
MueLu_Hierarchy_def.hpp 305
MueLu_HierarchyManager.hpp 197
MueLu_ParameterListInterpreter_def.hpp 1203
MueLu_CreateTpetraPreconditioner.hpp 146
Thyra_MueLuTpetraPreconditionerFactory_def.hpp 194
NOX_Thyra_Group.C 776
NOX_Thyra_Group.C 647
NOX_Thyra_Group.C 544
Stack trace for Comm_split call to rebalance Ac (line numbers are from trilinos source over 2.5 years ago):
Teuchos_DefaultMpiComm.hpp 1729
Tpetra_Map_def.hpp 1151
Tpetra_KokkosRefactor_CrsMatrix_def.hpp 7232
Tpetra_KokkosRefactor_CrsMatrix_def.hpp 7632
Tpetra_CrsMatrix_decl.hpp 2665
Xpetra_CrsMatrixFactory.hpp 447
Xpetra_MatrixFactory.hpp 125
MueLu_RebalanceAcFactory_def.hpp 105
MueLu_TwoLevelFactoryBase.hpp 153
MueLu_Level.hpp 203
MueLu_HierarchyHelpers_def.hpp 108
MueLu_Hierarchy_def.hpp 305
MueLu_HierarchyManager.hpp 197
MueLu_ParameterListInterpreter_def.hpp 1203
MueLu_CreateTpetraPreconditioner.hpp 146
https://gitlab.osti.gov/jmwille/Trilinos/-/issues/1914Teuchos::TimeMonitor: Use Kokkos profiling region hooks if Kokkos is available2017-10-30T20:18:21ZJames WillenbringTeuchos::TimeMonitor: Use Kokkos profiling region hooks if Kokkos is available*Created by: mhoemmen*
@trilinos/teuchos @nmhamster @jjellio
See discussion in #874.*Created by: mhoemmen*
@trilinos/teuchos @nmhamster @jjellio
See discussion in #874.https://gitlab.osti.gov/jmwille/Trilinos/-/issues/1904Zoltan2: Use MPI_Alltoallv for all-to-all, instead of hand-rolling it from se...2017-10-25T21:59:39ZJames WillenbringZoltan2: Use MPI_Alltoallv for all-to-all, instead of hand-rolling it from sends and receives*Created by: mhoemmen*
@trilinos/zoltan2
See `Trilinos/packages/zoltan2/src/util/Zoltan2_AlltoAll.[ch]pp`. If necessary, add a Teuchos wrapper for `MPI_Alltoallv`.*Created by: mhoemmen*
@trilinos/zoltan2
See `Trilinos/packages/zoltan2/src/util/Zoltan2_AlltoAll.[ch]pp`. If necessary, add a Teuchos wrapper for `MPI_Alltoallv`.https://gitlab.osti.gov/jmwille/Trilinos/-/issues/1743 MueLu: Proposal: use special form of the tentative matrix for smoothed prolo...2017-09-27T20:12:55ZJames Willenbring MueLu: Proposal: use special form of the tentative matrix for smoothed prolongator construction*Created by: aprokop*
@trilinos/muelu @jhux2
## Summary
Tentative factory produces a matrix of special form. This special form can be used to speed up matrix-matrix multiply procedure. For example, when nullspace is a constant vec...*Created by: aprokop*
@trilinos/muelu @jhux2
## Summary
Tentative factory produces a matrix of special form. This special form can be used to speed up matrix-matrix multiply procedure. For example, when nullspace is a constant vector, multiplying by tentative matrix is equivalent to averaging columns of matrix A. In addition, the tentative matrix can be stored in its "multivector" form, with columns compacted to multivectors as the aggregates do not overlap.
## Description
Let us first start with the situation where we *do* construct the QR factorization.
In this case, let us store the constructed Q factors in a single multivector. The multivector is based based on the same map that the current Ptent matrix is on, i.e. the row map of A. The multiplication of A by P could be seen to be similar to matrix-vector multiplication with the main difference being that instead of summation of products into a single value, we instead put product pairs into the bins corresponding to aggregates. To do that, we need to be able to translate local column ids to global aggregates ids (ids in the `coarseMap`). This may require an import of integer vector.
## Q&A
*What changes if we do not want to do QR decomposition?*
In this case, the TentativePFactory will set the "P" multivector to be the nullspace".
*What is the expected performance improvement?*
Ideally, we would skip the global tentative matrix construction, and instead just construct a local multivector and then do import (and, if domain map of A is the same as row map, we already have the Import object for it). In the SaPFactory, the hope is that the matrix-vector multiply-like procedure is about 2-3 times as expensive as regular matvec and is significantly cheaper than the full matrix-matrix multiplication. We should be able to do it using just local indices and parallelize over threads. There will be an additional cost of compression, as we don't know the number of nonzeros per row in the final matrix.
*How to optimally do the matrix-vector like procedure?*
An open question. There are some similarities with `sortAndMerge` procedure in Tpetra, so it may be possible to learn from that.https://gitlab.osti.gov/jmwille/Trilinos/-/issues/1739Tpetra MV putScalar does not thread scale and is less performant than scaling...2017-09-15T22:37:59ZJames WillenbringTpetra MV putScalar does not thread scale and is less performant than scaling a MV by a constant*Created by: jjellio*
This is an issue to track a performance problem identified on a Cray/KNL machine.
TLDR: putScalar on a block of multivectors takes over twice as long as computing an inner product or vector update on the same bl...*Created by: jjellio*
This is an issue to track a performance problem identified on a Cray/KNL machine.
TLDR: putScalar on a block of multivectors takes over twice as long as computing an inner product or vector update on the same block of vectors. Furthermore, putScalar is taking substantially longer than scale, yet they should scale similarly.
I've constructed a small slide deck with details (email me if you would like to see it).
To perform the test, I use Belos' MultiVectorTraits, and profiled MvInit, MvScale, Update and Inner Product. I gather data for block sizes of 1 to 100. This is effectively the building blocks for GMRES solver.
Timers have the format: "MVT::Operation:<block_size>"
There seems to be two issues:
1) putScalar performance is much slower than scale
2) putScalar scales more poorly when hardware threads are enabled.
Under a profiler, the time in putScalar is attributed to Kokkos::deep_copy(), the timings below are from production runs.
For example:
Running 64 processes per KNL node, with 1 HT per core enabled
(64x1x1):
Operation | MinOverProcs
------------------------|-----------
MVT::InnerProduct::100 | 4.042
MVT::MVScale::100 | 0.000998
MVT::MvInit::100 | 9.146
MVT::Update::100 | 4.101
Running 64 processes per KNL node, with 4 HT per core enabled
(64x1x4)
Operation | MinOverProcs
------------------------|-----------
MVT::InnerProduct::100 | 4.122
MVT::MVScale::100 | 0.001268
MVT::MvInit::100 | 55.53
MVT::Update::100 | 4.138
(full 64x1x1 data)
Operation | MinOverProcs
------------------------|-----------
MVT::InnerProduct::1 | 0.09228
MVT::InnerProduct::10 | 0.4413
MVT::InnerProduct::100 | 4.042
MVT::InnerProduct::15 | 0.6721
MVT::InnerProduct::2 | 0.1694
MVT::InnerProduct::20 | 0.8192
MVT::InnerProduct::25 | 1.028
MVT::InnerProduct::3 | 0.1989
MVT::InnerProduct::30 | 1.244
MVT::InnerProduct::35 | 1.445
MVT::InnerProduct::4 | 0.1983
MVT::InnerProduct::40 | 1.593
MVT::InnerProduct::45 | 1.823
MVT::InnerProduct::5 | 0.2641
MVT::InnerProduct::50 | 2
MVT::InnerProduct::6 | 0.2976
MVT::InnerProduct::60 | 2.394
MVT::InnerProduct::7 | 0.3759
MVT::InnerProduct::70 | 2.813
MVT::InnerProduct::8 | 0.3371
MVT::InnerProduct::80 | 3.189
MVT::InnerProduct::9 | 0.4084
MVT::InnerProduct::90 | 3.625
------------------------|-----------
MVT::MVScale::1 | 0.0008667
MVT::MVScale::10 | 0.0009456
MVT::MVScale::100 | 0.000998
MVT::MVScale::15 | 0.000947
MVT::MVScale::2 | 0.0009308
MVT::MVScale::20 | 0.0009654
MVT::MVScale::25 | 0.0009418
MVT::MVScale::3 | 0.0009527
MVT::MVScale::30 | 0.001003
MVT::MVScale::35 | 0.000972
MVT::MVScale::4 | 0.0009472
MVT::MVScale::40 | 0.000983
MVT::MVScale::45 | 0.0009711
MVT::MVScale::5 | 0.000968
MVT::MVScale::50 | 0.0009766
MVT::MVScale::6 | 0.0009365
MVT::MVScale::60 | 0.0009604
MVT::MVScale::7 | 0.0009186
MVT::MVScale::70 | 0.0009918
MVT::MVScale::8 | 0.0009317
MVT::MVScale::80 | 0.001002
MVT::MVScale::9 | 0.0009496
MVT::MVScale::90 | 0.001032
------------------------|-----------
MVT::MvInit::1 | 0.06442
MVT::MvInit::10 | 0.6943
MVT::MvInit::100 | 9.146
MVT::MvInit::15 | 0.9989
MVT::MvInit::2 | 0.2652
MVT::MvInit::20 | 1.32
MVT::MvInit::25 | 1.631
MVT::MvInit::3 | 0.3155
MVT::MvInit::30 | 1.923
MVT::MvInit::35 | 2.285
MVT::MvInit::4 | 0.4194
MVT::MvInit::40 | 2.645
MVT::MvInit::45 | 3.044
MVT::MvInit::5 | 0.4686
MVT::MvInit::50 | 4.089
MVT::MvInit::6 | 0.5292
MVT::MvInit::60 | 5.047
MVT::MvInit::7 | 0.5619
MVT::MvInit::70 | 6.44
MVT::MvInit::8 | 0.612
MVT::MvInit::80 | 7.344
MVT::MvInit::9 | 0.645
MVT::MvInit::90 | 8.386
------------------------|-----------
MVT::Update::1 | 0.1297
MVT::Update::10 | 0.4311
MVT::Update::100 | 4.101
MVT::Update::15 | 0.618
MVT::Update::2 | 0.1138
MVT::Update::20 | 0.8033
MVT::Update::25 | 1.031
MVT::Update::3 | 0.1424
MVT::Update::30 | 1.226
MVT::Update::35 | 1.406
MVT::Update::4 | 0.1721
MVT::Update::40 | 1.608
MVT::Update::45 | 1.838
MVT::Update::5 | 0.2438
MVT::Update::50 | 2.025
MVT::Update::6 | 0.271
MVT::Update::60 | 2.401
MVT::Update::7 | 0.2982
MVT::Update::70 | 2.853
MVT::Update::8 | 0.331
MVT::Update::80 | 3.232
MVT::Update::9 | 0.4017
MVT::Update::90 | 3.692
(full 64x1x4 data)
Operation | MinOverProcs
------------------------|-----------
MVT::InnerProduct::1 | 0.101
MVT::InnerProduct::10 | 0.4407
MVT::InnerProduct::100 | 4.122
MVT::InnerProduct::15 | 0.6858
MVT::InnerProduct::2 | 0.2098
MVT::InnerProduct::20 | 0.8349
MVT::InnerProduct::25 | 1.042
MVT::InnerProduct::3 | 0.2041
MVT::InnerProduct::30 | 1.246
MVT::InnerProduct::35 | 1.467
MVT::InnerProduct::4 | 0.2042
MVT::InnerProduct::40 | 1.618
MVT::InnerProduct::45 | 1.865
MVT::InnerProduct::5 | 0.265
MVT::InnerProduct::50 | 2.048
MVT::InnerProduct::6 | 0.2972
MVT::InnerProduct::60 | 2.446
MVT::InnerProduct::7 | 0.3803
MVT::InnerProduct::70 | 2.871
MVT::InnerProduct::8 | 0.3409
MVT::InnerProduct::80 | 3.243
MVT::InnerProduct::9 | 0.4084
MVT::InnerProduct::90 | 3.687
------------------------|-----------
MVT::MVScale::1 | 0.001116
MVT::MVScale::10 | 0.001141
MVT::MVScale::100 | 0.001268
MVT::MVScale::15 | 0.001144
MVT::MVScale::2 | 0.001169
MVT::MVScale::20 | 0.001157
MVT::MVScale::25 | 0.001182
MVT::MVScale::3 | 0.00119
MVT::MVScale::30 | 0.001184
MVT::MVScale::35 | 0.001155
MVT::MVScale::4 | 0.001174
MVT::MVScale::40 | 0.001208
MVT::MVScale::45 | 0.001172
MVT::MVScale::5 | 0.00122
MVT::MVScale::50 | 0.00117
MVT::MVScale::6 | 0.00115
MVT::MVScale::60 | 0.001222
MVT::MVScale::7 | 0.001136
MVT::MVScale::70 | 0.001207
MVT::MVScale::8 | 0.00117
MVT::MVScale::80 | 0.001201
MVT::MVScale::9 | 0.001175
MVT::MVScale::90 | 0.001189
------------------------|-----------
MVT::MvInit::1 | 0.06317
MVT::MvInit::10 | 0.6489
MVT::MvInit::100 | 55.53
MVT::MvInit::15 | 0.9798
MVT::MvInit::2 | 0.2715
MVT::MvInit::20 | 1.291
MVT::MvInit::25 | 1.629
MVT::MvInit::3 | 0.3176
MVT::MvInit::30 | 1.975
MVT::MvInit::35 | 2.268
MVT::MvInit::4 | 0.3524
MVT::MvInit::40 | 4.683
MVT::MvInit::45 | 13.34
MVT::MvInit::5 | 0.4013
MVT::MvInit::50 | 22.06
MVT::MvInit::6 | 0.4454
MVT::MvInit::60 | 32.64
MVT::MvInit::7 | 0.4939
MVT::MvInit::70 | 38.67
MVT::MvInit::8 | 0.5283
MVT::MvInit::80 | 44.46
MVT::MvInit::9 | 0.5911
MVT::MvInit::90 | 50.02
------------------------|-----------
MVT::Update::1 | 0.0894
MVT::Update::10 | 0.4455
MVT::Update::100 | 4.138
MVT::Update::15 | 0.6346
MVT::Update::2 | 0.1286
MVT::Update::20 | 0.8219
MVT::Update::25 | 1.045
MVT::Update::3 | 0.1589
MVT::Update::30 | 1.236
MVT::Update::35 | 1.425
MVT::Update::4 | 0.1867
MVT::Update::40 | 1.615
MVT::Update::45 | 1.846
MVT::Update::5 | 0.2577
MVT::Update::50 | 2.049
MVT::Update::6 | 0.2854
MVT::Update::60 | 2.449
MVT::Update::7 | 0.3129
MVT::Update::70 | 2.868
MVT::Update::8 | 0.3438
MVT::Update::80 | 3.27
MVT::Update::9 | 0.4171
MVT::Update::90 | 3.713
@trilinos/tpetra @trilinos/kokkos-kernels
@mhoemmen @crtrott @tjfulle
https://gitlab.osti.gov/jmwille/Trilinos/-/issues/1727Cori KNL: best practices2017-09-27T08:27:47ZJames WillenbringCori KNL: best practices*Created by: jhux2*
I am opening this issue to record best practices for achieving good performance on the Cori KNL partition. @jjellio Could I ask you to document your recommendations in this ticket? I've had a number of questions re...*Created by: jhux2*
I am opening this issue to record best practices for achieving good performance on the Cori KNL partition. @jjellio Could I ask you to document your recommendations in this ticket? I've had a number of questions regarding this. Thanks!
@sayerhs @aprokop @alanw0 @spdomin https://gitlab.osti.gov/jmwille/Trilinos/-/issues/1676MueLu: Proposal: merge CoalesceDrop and FilteredA factories2017-09-29T23:47:29ZJames WillenbringMueLu: Proposal: merge CoalesceDrop and FilteredA factories*Created by: aprokop*
@trilinos/muelu @jhux2 @tawiesn
**Summary**
Right now both factories do very similar things. Each one has a loop that goes through a matrix and filters some entries. The proposal combines those loops in one ...*Created by: aprokop*
@trilinos/muelu @jhux2 @tawiesn
**Summary**
Right now both factories do very similar things. Each one has a loop that goes through a matrix and filters some entries. The proposal combines those loops in one thus speeding up execution.
**Description**
The current `CoalesceDropFactory` is responsible for a single thing: constructing a (amalgamated) filtered graph that is later used in the aggregation. For the filtered scenario (when drop tolerance is not 0) it has a loop that goes through rows one by one and for each row entry determines whether to drop it (based on the original matrix or distance laplacian). It simultaneously constructs the compressed graph (LWGraph).
The construction of the the filtered matrix in the `FilteredAFactory` has two variants: 1) through reusing the graph of the original matrix and zeroing out entries; and 2) though construction a brand new matrix with the compressed graph. In both cases, the looping is done through rows and uses a `filter` array that helps to determine which matrix values to drop/zero out.
The loop in the `FilteredAFactory` is remarkably similar to the loop in the `CoalesceDropFactory`. In fact, the only difference is that in the later it constructs only rows and columns, and in the former it constructs values. The work is duplicated, and in fact it's even more expensive to go through the matrix the second time as we have to determine *again* which entries to filter.
**Proposed solution**
In the proposed solution, the `CoalesceDropFactory` (or its renamed version) will construct both `LWGraph` that is used in aggregation **and** the filtered matrix if desired. The `FilteredAFactory` goes away. The filtering loop in the factory will construct rows, columns, and values. For the block variant it will also construct coalesced rows and columns.
**Benefits**
Looping through the level matrix is done once potentially achieving significant speedup, especially when reusing matrix graph.
This would benefit applications that use MueLu with filtering.
**Q&A**
*Are there issues with block systems?*
So far, I don't see any issues. The block systems are treated by first filtering a block row, and the coalescing it. Constructing filtered matrix is independent of coalescing, though they both depend on the filtering.
*What about a special non-lightweight graph branch in `CoalesceDropFactory`*?
Tricky question. I don't have a good answer as I don't know what that does and what is special about it. @tawiesn ?
*Does it benefit Kokkos version?*
It should benefit Kokkos version the same way. In fact, it could be merged with another idea where we do not even construct the compressed graph for `LWGraph_kokkos` but rather provide a wrapper around the graph of the original matrix. This would eliminate the 2nd loop in the current `CoalesceDropFactory_kokkos`.
*What are the steps to implement this?*
I will start implementing it in the kokkos-refactor branch of MueLu for the scalar case. This can be done independently from the default branch through modifying the dependency tree in the `ParameterListInterpreter`. If the result demonstrate the feasibility and speedup using this approach, we could discuss backporting to the non-kokkos version.https://gitlab.osti.gov/jmwille/Trilinos/-/issues/1646MueLu::ParameterListInterpreter::ParameterListInterpreter seems to take a lot...2017-12-22T15:59:12ZJames WillenbringMueLu::ParameterListInterpreter::ParameterListInterpreter seems to take a lot more time than expected*Created by: pwxy*
MueLu::CreateXpetraPreconditioner calls both MueLu::HierarchyManager::SetupHierarchy and MueLu::ParameterListInterpreter::ParameterListInterpreter
I'm running Chebyshev smoother, and I thought that the smoothers we...*Created by: pwxy*
MueLu::CreateXpetraPreconditioner calls both MueLu::HierarchyManager::SetupHierarchy and MueLu::ParameterListInterpreter::ParameterListInterpreter
I'm running Chebyshev smoother, and I thought that the smoothers were being constructed by MueLu::HierarchyManager::SetupHierarchy rather than MueLu::ParameterListInterpreter::ParameterListInterpreter. However, I saw that several levels down the stack trace, MueLu::ParameterListInterpreter::ParameterListInterpreter calls MueLu::ParameterListInterpreter::UpdateFactoryManager_Smoothers, which for presmoothing, calls:
preSmoother = rcp(new SmootherFactory(rcp(new TrilinosSmoother(preSmootherType, preSmootherParams, overlap))));
So it appears that MueLu::ParameterListInterpreter::ParameterListInterpreter is constructing the smoothers?
I've noticed that time spent in MueLu::ParameterListInterpreter::ParameterListInterpreter is rather large, and most of this time is due to the construction of the presmoother (I have both pre- and post- smoothing).https://gitlab.osti.gov/jmwille/Trilinos/-/issues/1613Issues with TentativePFactory for "use kokkos refactor"=true on KNL?2017-09-11T05:46:30ZJames WillenbringIssues with TentativePFactory for "use kokkos refactor"=true on KNL?*Created by: pwxy*
I built drekar for develop Trilinos (cloned repo early Mon Aug 14 morning).
I built on ellis for KNL with intel 17 compiler
with the following cmake options
```
-D Tpetra_ENABLE_MMM_Timings=ON \
-D MueLu_ENA...*Created by: pwxy*
I built drekar for develop Trilinos (cloned repo early Mon Aug 14 morning).
I built on ellis for KNL with intel 17 compiler
with the following cmake options
```
-D Tpetra_ENABLE_MMM_Timings=ON \
-D MueLu_ENABLE_Experimental:BOOL=ON \
-D MueLu_ENABLE_Kokkos_Refactor:BOOL=ON \
-D Xpetra_ENABLE_Experimental:BOOL=ON \
-D Xpetra_ENABLE_Kokkos_Refactor:BOOL=ON \
```
the MueLu setup time for the drekar run was 14.6 sec for 1 MPI with 16 OMP threads.
I then added:
`<Parameter name="use kokkos refactor" type="bool" value="true"/>`
to the MueLu parameter list
and the MueLu setup time increased to 3557s, or 244x slower
Here are all the MueLu setup timers that are over 100s:
```
3557.00000 MueLu: N5MueLu9HierarchyIdixN6Kokkos6Compat23KokkosDeviceWrapperNodeINS1_6OpenMPENS1_9HostSpaceEEEEE: Setup (total)
3557.00000 MueLu: N5MueLu9HierarchyIdixN6Kokkos6Compat23KokkosDeviceWrapperNodeINS1_6OpenMPENS1_9HostSpaceEEEEE: Setup (total, level=1)
3557.00000 MueLu: N5MueLu10RAPFactoryIdixN6Kokkos6Compat23KokkosDeviceWrapperNodeINS1_6OpenMPENS1_9HostSpaceEEEEE: Computing Ac (total)
3557.00000 MueLu: N5MueLu18RepartitionFactoryIdixN6Kokkos6Compat23KokkosDeviceWrapperNodeINS1_6OpenMPENS1_9HostSpaceEEEEE: Build (total)
3557.00000 MueLu: N5MueLu24RebalanceTransferFactoryIdixN6Kokkos6Compat23KokkosDeviceWrapperNodeINS1_6OpenMPENS1_9HostSpaceEEEEE: Build (total, level=1)
3557.00000 MueLu: N5MueLu18RepartitionFactoryIdixN6Kokkos6Compat23KokkosDeviceWrapperNodeINS1_6OpenMPENS1_9HostSpaceEEEEE: Build (total, level=1)
3557.00000 MueLu: N5MueLu24RebalanceTransferFactoryIdixN6Kokkos6Compat23KokkosDeviceWrapperNodeINS1_6OpenMPENS1_9HostSpaceEEEEE: Build (total)
3556.00000 MueLu: N5MueLu10RAPFactoryIdixN6Kokkos6Compat23KokkosDeviceWrapperNodeINS1_6OpenMPENS1_9HostSpaceEEEEE: Computing Ac (total, level=1)
3554.00000 MueLu: N5MueLu24TentativePFactory_kokkosIdixN6Kokkos6Compat23KokkosDeviceWrapperNodeINS1_6OpenMPENS1_9HostSpaceEEEEE: Build (total)
3554.00000 MueLu: N5MueLu24TentativePFactory_kokkosIdixN6Kokkos6Compat23KokkosDeviceWrapperNodeINS1_6OpenMPENS1_9HostSpaceEEEEE: Build (total, level=1)
3391.00000 MueLu: N5MueLu24TentativePFactory_kokkosIdixN6Kokkos6Compat23KokkosDeviceWrapperNodeINS1_6OpenMPENS1_9HostSpaceEEEEE: Stage 1 (LocalQR) (sub, total)
3391.00000 MueLu: N5MueLu24TentativePFactory_kokkosIdixN6Kokkos6Compat23KokkosDeviceWrapperNodeINS1_6OpenMPENS1_9HostSpaceEEEEE: Build (level=1)
3391.00000 MueLu: N5MueLu24TentativePFactory_kokkosIdixN6Kokkos6Compat23KokkosDeviceWrapperNodeINS1_6OpenMPENS1_9HostSpaceEEEEE: Build
3391.00000 MueLu: N5MueLu24TentativePFactory_kokkosIdixN6Kokkos6Compat23KokkosDeviceWrapperNodeINS1_6OpenMPENS1_9HostSpaceEEEEE: Stage 1 (LocalQR) (sub, total, level=1)
163.10000 MueLu: N5MueLu34UncoupledAggregationFactory_kokkosIixN6Kokkos6Compat23KokkosDeviceWrapperNodeINS1_6OpenMPENS1_9HostSpaceEEEEE: Build (total)
163.00000 MueLu: N5MueLu34UncoupledAggregationFactory_kokkosIixN6Kokkos6Compat23KokkosDeviceWrapperNodeINS1_6OpenMPENS1_9HostSpaceEEEEE: Build (total, level=0)
161.10000 MueLu: N5MueLu26CoalesceDropFactory_kokkosIdixN6Kokkos6Compat23KokkosDeviceWrapperNodeINS1_6OpenMPENS1_9HostSpaceEEEEE: Build (total)
161.10000 MueLu: N5MueLu26CoalesceDropFactory_kokkosIdixN6Kokkos6Compat23KokkosDeviceWrapperNodeINS1_6OpenMPENS1_9HostSpaceEEEEE: Build (total, level=0)
160.90000 MueLu: N5MueLu26CoalesceDropFactory_kokkosIdixN6Kokkos6Compat23KokkosDeviceWrapperNodeINS1_6OpenMPENS1_9HostSpaceEEEEE: Build (level=0)
160.90000 MueLu: N5MueLu26CoalesceDropFactory_kokkosIdixN6Kokkos6Compat23KokkosDeviceWrapperNodeINS1_6OpenMPENS1_9HostSpaceEEEEE: Build
```
Edit (@aprokop): added quoting
https://gitlab.osti.gov/jmwille/Trilinos/-/issues/1455Tpetra::MultiVector::Dot() slower than Epetra for short vectors, faster for l...2017-10-30T19:52:03ZJames WillenbringTpetra::MultiVector::Dot() slower than Epetra for short vectors, faster for long vectors*Created by: CamelliaDPG*
The driver pasted below allows CLI arguments to control the length of two vectors that are dotted together, as well as the number of such dot products that should be done. It does this with both `Tpetra::Multi...*Created by: CamelliaDPG*
The driver pasted below allows CLI arguments to control the length of two vectors that are dotted together, as well as the number of such dot products that should be done. It does this with both `Tpetra::MultiVector` and `Epetra_Vector` objects, with a serial distribution, and reports the cumulative time for each of Epetra and Tpetra. Here are the timing results from a run on a shared Intel Xeon E7-4880 machine (times in seconds):
numDots | globalRowCount | Epetra | Tpetra | Slowdown
--------- | -----------------| -------| --------| --------
1000000 | 50 | 0.43 | 8.85 | 21x
100000 | 500 | 0.11 | 2.86 | 26x
10000 | 5000 | 0.07 | 2.25 | 32x
1000 | 50000 | 0.08 | 2.20 | 28x
100 | 500000 | 0.07 | 2.17 | 31x
10 | 5000000 | 0.15 | 2.29 | 15x
1 | 50000000 | 0.15 | 2.29 | 15x
Can this be addressed? From some profiling, it does appear that part of the story involves calls to `getVector()`, which result in new Tpetra Vector objects being created in each dot product (a shallow copy). That is probably why the slowdown is 15x as the vectors get larger, but as high as 32x for smaller vectors: with the smaller vectors you see the cost of the object creation and destruction. It appears, though, that the calls to `getVector()` are only about half the story: there is still a 15x slowdown to account for.
```C++
// Epetra
#include <Epetra_Map.h>
#include <Epetra_SerialComm.h>
#include <Epetra_Vector.h>
// Tpetra
#include <Tpetra_MultiVector.hpp>
// Teuchos
#include <Teuchos_Comm.hpp>
#include <Teuchos_RCP.hpp>
#include <Teuchos_TimeMonitor.hpp>
#include <iostream>
#include <fstream>
#include <string>
using namespace std;
using namespace Teuchos;
typedef long long GlobalOrdinal; // signed types preferred
typedef int LocalOrdinal; // signed types preferred
typedef double Scalar;
typedef Tpetra::Map<LocalOrdinal, GlobalOrdinal>::node_type NodeType;
typedef Tpetra::Map<LocalOrdinal,GlobalOrdinal,NodeType> TpetraMap;
typedef Tpetra::MultiVector<Scalar,LocalOrdinal,GlobalOrdinal, NodeType> TpetraMultiVector;
int main(int argc, char *argv[])
{
int numDots = 10000;
int globalRowCount = 1500;
if (argc == 2)
{
numDots = atoi(argv[1]);
cout << "setting numDots to " << numDots << endl;
}
if (argc == 3)
{
numDots = atoi(argv[1]);
globalRowCount = atoi(argv[2]);
}
cout << "numDots is " << numDots << endl;
cout << "globalRowCount is " << globalRowCount << endl;
RCP<Time> epetraTimer = TimeMonitor::getNewCounter("Epetra dot");
RCP<Time> tpetraTimer = TimeMonitor::getNewCounter("Tpetra dot");
/*** Epetra setup ***/
Epetra_SerialComm serialComm;
Epetra_Map serialMap(globalRowCount, 0, serialComm);
RCP<Epetra_Vector> a_epetra = rcp( new Epetra_Vector(serialMap) );
RCP<Epetra_Vector> b_epetra = rcp( new Epetra_Vector(serialMap) );
for (int localOrdinal=0; localOrdinal<serialMap.NumMyElements(); localOrdinal++)
{
(*a_epetra)[localOrdinal] = 2.0;
(*b_epetra)[localOrdinal] = 0.5;
}
/*** Tpetra setup ***/
RCP<Teuchos::Comm<int>> serialCommTeuchos = Teuchos::rcp( new Teuchos::SerialComm<int>() );
RCP<const TpetraMap> serialMapTpetra = rcp( new TpetraMap(globalRowCount, 0, serialCommTeuchos));
RCP<TpetraMultiVector> a_tpetra = rcp( new TpetraMultiVector(serialMapTpetra, 1)); // numVectors = 1
RCP<TpetraMultiVector> b_tpetra = rcp( new TpetraMultiVector(serialMapTpetra, 1)); // numVectors = 1
auto a_localView = a_tpetra->getLocalView<Kokkos::HostSpace>();
auto b_localView = b_tpetra->getLocalView<Kokkos::HostSpace>();
for (int localOrdinal=serialMapTpetra->getMinLocalIndex(); localOrdinal<=serialMapTpetra->getMaxLocalIndex(); localOrdinal++)
{
a_localView(localOrdinal,0) = 2.0;
b_localView(localOrdinal,0) = 0.5;
}
for (int i=0; i<numDots; i++)
{
double epetraResult;
int err;
epetraTimer->start();
epetraTimer->incrementNumCalls();
err = a_epetra->Dot(*b_epetra, &epetraResult);
epetraTimer->stop();
if (err != 0)
{
cout << "Epetra dot returned err code " << err << ". Halting execution.\n";
exit(err);
}
std::vector<Scalar> tpetraResult( 1 );
tpetraTimer->start();
tpetraTimer->incrementNumCalls();
a_tpetra->dot(*b_tpetra, tpetraResult);
tpetraTimer->stop();
}
TimeMonitor::summarize(serialCommTeuchos.ptr());
return 0;
}
```https://gitlab.osti.gov/jmwille/Trilinos/-/issues/1340MueLu timer summary at end of app does not agree with total preconditioner co...2017-05-30T17:15:53ZJames WillenbringMueLu timer summary at end of app does not agree with total preconditioner construction time*Created by: pwxy*
The MueLu timer summary (which uses Teuchos timers) at the end of drekar is wrong. Neither the Setup nor Ac times include the matrix-matrix multiply time. This is likely due to OpenMP threading and related to issue ...*Created by: pwxy*
The MueLu timer summary (which uses Teuchos timers) at the end of drekar is wrong. Neither the Setup nor Ac times include the matrix-matrix multiply time. This is likely due to OpenMP threading and related to issue #253, if not the same issue.
https://gitlab.osti.gov/jmwille/Trilinos/-/issues/1329Stop using MPI_Reduce_scatter (in Epetra and elsewhere)2017-05-26T00:12:49ZJames WillenbringStop using MPI_Reduce_scatter (in Epetra and elsewhere)*Created by: mhoemmen*
@trilinos/xpetra @trilinos/muelu @trilinos/zoltan @trilinos/zoltan2
At least two platforms report performance and/or memory issues with `MPI_Reduce_Scatter` at scale. (See Bugzilla Bug 6336, and my Tpetra com...*Created by: mhoemmen*
@trilinos/xpetra @trilinos/muelu @trilinos/zoltan @trilinos/zoltan2
At least two platforms report performance and/or memory issues with `MPI_Reduce_Scatter` at scale. (See Bugzilla Bug 6336, and my Tpetra comments in 09 Jan 2012.) The easiest work-around is to replace calls to that function with `MPI_Reduce` followed by `MPI_Scatter`. Tpetra has done this for a few years already, for performance reasons. It looks like there used to be an MPICH (correctness) bug in `MPI_Reduce_Scatter`, since Epetra and Zoltan both have work-arounds. Zoltan always just calls `MPI_Reduce` followed by `MPI_Scatter`:
https://github.com/trilinos/Trilinos/blob/f0350316239aaf41e8e6b82378612aaedf9cc72c/packages/zoltan/src/Utilities/Communication/comm_invert_map.c
Epetra has an option to do the right thing, depending on the `REDUCE_SCATTER_BUG` macro (see line 570 of epetra/src/Epetra_MpiDistributor.cpp):
https://github.com/trilinos/Trilinos/blob/06f5eea36d0b40751717bc9ee995b9f5f1184a23/packages/epetra/src/Epetra_MpiDistributor.cpp
It would make sense to elevate that macro to a CMake option, if it is not one already.
This is relevant to the Xpetra and MueLu developers, in case they are using the Epetra back-end.
@pwxy @spdomin
https://gitlab.osti.gov/jmwille/Trilinos/-/issues/1015InsertGlobalValues and fill complete in TpetraCrsMatrix2017-01-24T05:30:36ZJames WillenbringInsertGlobalValues and fill complete in TpetraCrsMatrix*Created by: kvmkrao*
I used Tpetra to created a contiguous row map. The map is used to create objects for a sparse matrix of bandwidth 7(_A_) and vectors (x and b). The non-zero entries in the matrix are inserted via InsertGlobalValues...*Created by: kvmkrao*
I used Tpetra to created a contiguous row map. The map is used to create objects for a sparse matrix of bandwidth 7(_A_) and vectors (x and b). The non-zero entries in the matrix are inserted via InsertGlobalValues().
for (int i=0; i< numMyElements; i++) {
_A_->insertGlobalValues (gblRow, NumEntries, Values, Indices);
}
_A-_>fillComplete ();
The CrsMatrix (_A_) and MultiVectors are used to create the problem (Ax=b). And the problem is solved with BiCGStab (BiCGStabSolMgr) in Belos.
I looked at the time spent in filling the matrix, Belos: BiCGStabSolMgr total solve time, Belos: Operation Op*x and Belos: Operation Prec*x and observed that the most of the time is spent in filling the sparse matrix.
I am using Trilinos to solve the system of equations from a finite volume algorithm. The calls to trilinos at each inner iterations indicate the filling the Sparse matrix and vector and solving the system of equations with Belos.
```
time step 1
inner iteration 1 //calculate the non-zero entries in the matrix and vector
call trilinos // Trilinos solves the system of equations
inner iterations 2 //calculate the non-zero entries in the matrix and vector
call trilionos // Trilinos solves the system of equations
.
time step 2
inner iteration 1 //calculate the non-zero entries in the matrix and vector
call trilinos // Trilinos solves the system of equations
```
Please let me know the procedure to reduce the time spend in filling the matrix. https://gitlab.osti.gov/jmwille/Trilinos/-/issues/813const of subview model in intrepid22016-11-11T17:07:40ZJames Willenbringconst of subview model in intrepid2*Created by: bathmatt*
@eric-c-cyr @kyungjoo-kim @rppawlo
I tested some of the kokkos refactor branch in intrepid2. I was looking at the serial getValues inside of a kokkos parallel_for. Here is a simple code
`
struct foo{
Dy...*Created by: bathmatt*
@eric-c-cyr @kyungjoo-kim @rppawlo
I tested some of the kokkos refactor branch in intrepid2. I was looking at the serial getValues inside of a kokkos parallel_for. Here is a simple code
`
struct foo{
DynRankView<double> in, out;
foo(DynRankView<double> in_,DynRankView<double> out_):in(in_),out(out_){}
KOKKOS_INLINE_FUNCTION
void operator ()(int i)const {
typedef Basis_HGRAD_TRI_C1_FEM Basis;
Basis::Serial<Intrepid2::OPERATOR_VALUE>::getValues(subview(out,i,ALL()), subview(in,i,ALL()));
}
};
`
and a hard coded version
`
void operator ()(int i)const {
out(i,0) = 1. - in(i,0) - in(1,1);
out(i,1) = in(i,0);
out(i,2) = in(i,1);
}
`
Now when I call this with 10M length view on a single core it takes 0.09s, if I skip the subview step and hard code the response it takes 0.03s
From that I'm assuming that the subview creation is roughly 0.06s of the run time.
This is lower than i assumed it would be, but still 70% of the cost. Not suggesting we fix this today, but we may want to in the future. https://gitlab.osti.gov/jmwille/Trilinos/-/issues/725Nalu diffs due to latest Trilinos update2016-10-27T15:57:29ZJames WillenbringNalu diffs due to latest Trilinos update*Created by: spdomin*
Bad test suite as of Trilinos SHA1:
commit 5320963fbe2ec9aeba1dc62871e4be4f44961970
Author: Tobias Wiesner tawiesn@sandia.gov
Date: Wed Oct 19 14:43:41 2016 -0600
Last good:
NaluCFD/Nalu SHA1: d2adbe8
Trilinos/m...*Created by: spdomin*
Bad test suite as of Trilinos SHA1:
commit 5320963fbe2ec9aeba1dc62871e4be4f44961970
Author: Tobias Wiesner tawiesn@sandia.gov
Date: Wed Oct 19 14:43:41 2016 -0600
Last good:
NaluCFD/Nalu SHA1: d2adbe8
Trilinos/master SHA1: 12fd285b28abeb7336a8d4f17913819b9ce7015b
This bisect is complicated in that the builds of Trilinos fail either due to configure errors or flat out build issues.
https://gitlab.osti.gov/jmwille/Trilinos/-/issues/118Performance of looping over Tpetra CrsMatrix rows2016-06-13T22:56:17ZJames WillenbringPerformance of looping over Tpetra CrsMatrix rows*Created by: aprokop*
@trilinos/tpetra @jhux2 @mhoemmen @crtrott
Let me first admit that I am very likely doing something wrong.
I wrote a simple driver (located at muelu/test/perf_test_kokkos, which essentially finds the number of n...*Created by: aprokop*
@trilinos/tpetra @jhux2 @mhoemmen @crtrott
Let me first admit that I am very likely doing something wrong.
I wrote a simple driver (located at muelu/test/perf_test_kokkos, which essentially finds the number of nonzeros in a CrsMatrix by looping through rows and adding lengths. It considers three scenarios:
- Looping through Xpetra layer abstraction (something MueLu is very interested in)
- Looping directly through Tpetra/Epetra
- Looping through the local Kokkos CrsMatrix
The results were somewhat unexpected for me. I was running with a single MPI rank with OpenMp OMP_NUM_THREADS=1, disabled HWLOC (so that Kokkos respects this). Here are some results:
For Tpetra
```
Loop #1: Xpetra/Tpetra 0.05980 (1)
Loop #2: Tpetra 0.05867 (1)
Loop #3: Kokkos-1 0.00274 (1)
Loop #4: Kokkos-2 0.00214 (1)
```
For Epetra
```
Loop #1: Xpetra/Epetra 0.01933 (1)
Loop #2: Epetra 0.01385 (1)
Loop #3: Kokkos-1 0.00427 (1)
Loop #4: Kokkos-2 0.00213 (1)
```
So it seems to me that using local Kokkos matrix has absolutely be the way, as it is ~30 times faster than through Tpetra, and ~6 times faster than through Epetra.
I would like to know if anybody done any performance studies like this, or what could be the reason.
If I am doing something that is completely wrong, I would also like to know that.
https://gitlab.osti.gov/jmwille/Trilinos/-/issues/115Add an option for switching between storing block inverse or block factors st...2016-03-03T17:36:36ZJames WillenbringAdd an option for switching between storing block inverse or block factors storage in block Jacobi and Gauss-Seidel*Created by: aprokop*
@trilinos/ifpack2 @mhoemmen
Ifpack2 now stores and applies the inverse of the matrix's block diagonal explicitly (#96). Theoretically, this is a tradeoff. If block size is large enough, you can get better perfor...*Created by: aprokop*
@trilinos/ifpack2 @mhoemmen
Ifpack2 now stores and applies the inverse of the matrix's block diagonal explicitly (#96). Theoretically, this is a tradeoff. If block size is large enough, you can get better performance out of linear solves rather than full inversion when doing a limited number of iterations. So a user should be able to decide the appropriate approach for it.