Created by: kyungjoo-kim
This PR includes improvements on the Ifpack2 BlockTriDiContainer. Previously,
- the block tridiagonal solve is separated from conversion of multivector between packed and flat formats.
- the conversion routine is fused with norm calculation.
- this inefficiently use team parallelism.
- the block tridiagonal solve is fused with multivector conversion routine.
- norm caculation is done on a tpetra multi vector with a range policy.
Motivation and Context
This solves slowness in computing norms for stopping the fixed pointer iterations.
How Has This Been Tested?