Skip to content
GitLab
Projects Groups Topics Snippets
  • /
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
  • Sign in
  • T Trilinos
  • Project information
    • Project information
    • Activity
    • Labels
    • Members
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributor statistics
    • Graph
    • Compare revisions
  • Issues 936
    • Issues 936
    • List
    • Boards
    • Service Desk
    • Milestones
  • Merge requests 22
    • Merge requests 22
  • CI/CD
    • CI/CD
    • Pipelines
    • Jobs
    • Artifacts
    • Schedules
  • Deployments
    • Deployments
    • Environments
    • Releases
  • Packages and registries
    • Packages and registries
    • Model experiments
  • Monitor
    • Monitor
    • Incidents
  • Analytics
    • Analytics
    • Value stream
    • CI/CD
    • Repository
  • Wiki
    • Wiki
  • Snippets
    • Snippets
  • Activity
  • Graph
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
Collapse sidebar
  • James Willenbring
  • Trilinos
  • Issues
  • #118

Performance of looping over Tpetra CrsMatrix rows

Created by: aprokop

@trilinos/tpetra @jhux2 @mhoemmen @crtrott

Let me first admit that I am very likely doing something wrong.

I wrote a simple driver (located at muelu/test/perf_test_kokkos, which essentially finds the number of nonzeros in a CrsMatrix by looping through rows and adding lengths. It considers three scenarios:

  • Looping through Xpetra layer abstraction (something MueLu is very interested in)
  • Looping directly through Tpetra/Epetra
  • Looping through the local Kokkos CrsMatrix

The results were somewhat unexpected for me. I was running with a single MPI rank with OpenMp OMP_NUM_THREADS=1, disabled HWLOC (so that Kokkos respects this). Here are some results: For Tpetra

Loop #1: Xpetra/Tpetra  0.05980 (1)                 
Loop #2: Tpetra             0.05867 (1) 
Loop #3: Kokkos-1         0.00274 (1)               
Loop #4: Kokkos-2         0.00214 (1)   

For Epetra

Loop #1: Xpetra/Epetra  0.01933 (1)                
Loop #2: Epetra             0.01385 (1)                
Loop #3: Kokkos-1         0.00427 (1)               
Loop #4: Kokkos-2         0.00213 (1)  

So it seems to me that using local Kokkos matrix has absolutely be the way, as it is ~30 times faster than through Tpetra, and ~6 times faster than through Epetra.

I would like to know if anybody done any performance studies like this, or what could be the reason. If I am doing something that is completely wrong, I would also like to know that.

Assignee
Assign to
Time tracking