MueLu: Performance bug in non-threaded sparse matrix-matrix multiply in Tpetra (and likely also Epetra)
Created by: mhoemmen
@trilinos/muelu @trilinos/tpetra
@sthomas61 (thanks!) discovered the following issue, and asked me to report it to GitHub:
The issue is that the estimate for the number of entries per row in Tpetra_MatrixMatrix_def.hpp is incorrect. Detailed outputs showed that the nnz per row estimates were too low from Cestimate_nnz. Typical values were order 100 for a 50^3 Laplacian.
Compared with the ML ML_matmat_mult function, found that there was most likely a Cut and paste error:
Aest = (A.getNodeNumRows() > 0)? A.getNodeNumEntries()/A.getNodeNumEntries() == 1
(always)
Whereas this should be
Aest = (A.getNodeNumRows() > 0)? A.getNodeNumEntries()/A.getNodeNumRows()
And similar for B.
With this change the nnz per row estimate is order 10000, which is more reasonable.
The overall effect is to reduce the amount of new memory allocations for rows of the New C = A*B. And most important results in a 25% reduction in run time for the MueLu set-up:
sthomas1@n0292 performance]$ fgrep "Driver: 2 - MueLu Setup" MueLuNNZ.log*
MueLuNNZ.log1: Driver: 2 - MueLu Setup ==> 9.555 seconds
MueLuNNZ.log2: Driver: 2 - MueLu Setup ==> 9.530 seconds
MueLuNNZ.log3: Driver: 2 - MueLu Setup ==> 9.583 seconds
MueLuNNZ.log4: Driver: 2 - MueLu Setup ==> 9.482 seconds
[sthomas1@n0292 performance]$ fgrep "Setup (total)" MueLuNNZ.log*
MueLuNNZ.log1: MueLu: Hierarchy: Setup (total) ==> 6.872 seconds
MueLuNNZ.log2: MueLu: Hierarchy: Setup (total) ==> 6.877 seconds
MueLuNNZ.log3: MueLu: Hierarchy: Setup (total) ==> 6.902 seconds
MueLuNNZ.log4: MueLu: Hierarchy: Setup (total) ==> 6.756 seconds
we are getting a completely SOLID 25% performance improvement
original:
[sthomas1@n0292 performance]$ fgrep "Driver: 2 - MueLu Setup" MueLu200.log*
MueLu200.log1: Driver: 2 - MueLu Setup ==> 12.390 seconds
MueLu200.log2: Driver: 2 - MueLu Setup ==> 12.200 seconds
MueLu200.log3: Driver: 2 - MueLu Setup ==> 12.400 seconds
[sthomas1@n0292 performance]$ fgrep "Setup (total)" MueLuReg.log*
MueLuReg.log1: MueLu: Hierarchy: Setup (total) ==> 8.851 seconds
MueLuReg.log10: MueLu: Hierarchy: Setup (total) ==> 8.728 seconds
MueLuReg.log11: MueLu: Hierarchy: Setup (total) ==> 8.674 seconds
MueLuReg.log2: MueLu: Hierarchy: Setup (total) ==> 8.765 seconds
MueLuReg.log3: MueLu: Hierarchy: Setup (total) ==> 8.783 seconds
MueLuReg.log4: MueLu: Hierarchy: Setup (total) ==> 8.762 seconds
MueLuReg.log5: MueLu: Hierarchy: Setup (total) ==> 8.599 seconds
MueLuReg.log6: MueLu: Hierarchy: Setup (total) ==> 8.490 seconds
MueLuReg.log7: MueLu: Hierarchy: Setup (total) ==> 8.569 seconds
MueLuReg.log8: MueLu: Hierarchy: Setup (total) ==> 8.799 seconds
MueLuReg.log9: MueLu: Hierarchy: Setup (total) ==> 8.763 seconds
these are based on 200 calls to MueLu set-up