Tpetra: Use new cuBLAS API for dense matrix-matrix multiply

Created by: mhoemmen

@trilinos/tpetra See #1194 (closed) for justification and discussion.

Use new cuBLAS API

The cuBLAS manual says that "Starting with version 4.0, the cuBLAS Library provides a new updated API, in addition to the existing legacy API. . . . The new cuBLAS library API can be used by including the header file 'cublas_v2.h'."

http://docs.nvidia.com/cuda/cublas/

The manual also says: "In general, new applications should not use the legacy cuBLAS API, and existing existing applications should convert to using the new API if it requires sophisticated and optimal stream parallelism or if it calls cuBLAS routines concurrently from multiple threads."

Support concurrent tasks

One Tpetra goal is to support use of multiple Kokkos execution spaces (e.g., CUDA streams) concurrently, in order to support task parallelism. Switching to the new cuBLAS API, that takes a "context" handle, will help with that.