Performance of looping over Tpetra CrsMatrix rows

Created by: aprokop

@trilinos/tpetra @jhux2 @mhoemmen @crtrott

Let me first admit that I am very likely doing something wrong.

I wrote a simple driver (located at muelu/test/perf_test_kokkos, which essentially finds the number of nonzeros in a CrsMatrix by looping through rows and adding lengths. It considers three scenarios:

Looping through Xpetra layer abstraction (something MueLu is very interested in)
Looping directly through Tpetra/Epetra
Looping through the local Kokkos CrsMatrix

The results were somewhat unexpected for me. I was running with a single MPI rank with OpenMp OMP_NUM_THREADS=1, disabled HWLOC (so that Kokkos respects this). Here are some results: For Tpetra

Loop #1: Xpetra/Tpetra  0.05980 (1)                 
Loop #2: Tpetra             0.05867 (1) 
Loop #3: Kokkos-1         0.00274 (1)               
Loop #4: Kokkos-2         0.00214 (1)

For Epetra

Loop #1: Xpetra/Epetra  0.01933 (1)                
Loop #2: Epetra             0.01385 (1)                
Loop #3: Kokkos-1         0.00427 (1)               
Loop #4: Kokkos-2         0.00213 (1)

So it seems to me that using local Kokkos matrix has absolutely be the way, as it is ~30 times faster than through Tpetra, and ~6 times faster than through Epetra.

I would like to know if anybody done any performance studies like this, or what could be the reason. If I am doing something that is completely wrong, I would also like to know that.