Tpetra: Measure thread scaling of solver kernels, with MPI (single node)

Created by: mhoemmen

@trilinos/tpetra Epic: #820 (@jjellio, I can't assign you yet because you're not part of the Trilinos organization. See #819 (closed). After that's done, I'll assign you too :-) .)

Measure thread scaling of solver kernels (mainly OpenMP, Haswell and KNL, though other platforms are welcome), with MPI, on a single node. (See #823 (closed) for discussion of why measuring single-node performance both without and with MPI is important.)

This corresponds to the performance results that @jjellio showed at this week's Kokkos / Tpetra developers' meeting. @jjellio ran on a single node with a single problem, and kept (# MPI processes) * (# threads per process) constant.

For sparse matrix-vector multiply within a single NUMA domain, in the ideal case, I would expect to run a little bit faster with > 1 threads per process, than with only MPI. This is for two reasons:

With MPI, Tpetra needs to pack and unpack entries in the source vector for Import. With OpenMP, the memory hardware handles communication between threads, with no need to pack or unpack in software.
Tpetra has some unnecessary overheads that require fixing (see e.g., #435)