Belos::MultiVecTraits: Tpetra specialization is slow on GPU, because it creates local MV & thus does extra CUDA allocations

Created by: mhoemmen

@trilinos/belos @trilinos/tpetra @cgcgcg

The Tpetra specialization of Belos::MultiVecTraits has two methods that need to create temporary "local" MultiVector instances. This creation does a new DualView allocation each time. This makes running on the GPU slow, especially for classical Gram-Schmidt in GMRES, or in general whenever doing X^T * Y or X^H * Y for either X or Y having multiple columns.

I am working on a fix. The idea is to maintain a static DualView "pool" that the local MultiVector can view. One must be careful not to let any DualView instance persist past Kokkos::finalize.