Belos::MultiVecTraits: Tpetra specialization is slow on GPU, because it creates local MV & thus does extra CUDA allocations
*Created by: mhoemmen* @trilinos/belos @trilinos/tpetra @cgcgcg The Tpetra specialization of `Belos::MultiVecTraits` has two methods that need to create temporary "local" MultiVector instances. This creation does a new DualView allocation each time. This makes running on the GPU slow, especially for classical Gram-Schmidt in GMRES, or in general whenever doing `X^T * Y` or `X^H * Y` for either X or Y having multiple columns. I am working on a fix. The idea is to maintain a static DualView "pool" that the local MultiVector can view. One must be careful not to let any DualView instance persist past `Kokkos::finalize`.
issue