Performance of looping over Tpetra CrsMatrix rows
Created by: aprokop
@trilinos/tpetra @jhux2 @mhoemmen @crtrott
Let me first admit that I am very likely doing something wrong.
I wrote a simple driver (located at muelu/test/perf_test_kokkos, which essentially finds the number of nonzeros in a CrsMatrix by looping through rows and adding lengths. It considers three scenarios:
- Looping through Xpetra layer abstraction (something MueLu is very interested in)
- Looping directly through Tpetra/Epetra
- Looping through the local Kokkos CrsMatrix
The results were somewhat unexpected for me. I was running with a single MPI rank with OpenMp OMP_NUM_THREADS=1, disabled HWLOC (so that Kokkos respects this). Here are some results: For Tpetra
Loop #1: Xpetra/Tpetra 0.05980 (1)
Loop #2: Tpetra 0.05867 (1)
Loop #3: Kokkos-1 0.00274 (1)
Loop #4: Kokkos-2 0.00214 (1)
For Epetra
Loop #1: Xpetra/Epetra 0.01933 (1)
Loop #2: Epetra 0.01385 (1)
Loop #3: Kokkos-1 0.00427 (1)
Loop #4: Kokkos-2 0.00213 (1)
So it seems to me that using local Kokkos matrix has absolutely be the way, as it is ~30 times faster than through Tpetra, and ~6 times faster than through Epetra.
I would like to know if anybody done any performance studies like this, or what could be the reason. If I am doing something that is completely wrong, I would also like to know that.