Tpetra: Clarify operating procedures for using MPI + multiple GPUs on a node

Created by: mhoemmen

@trilinos/tpetra Blocked by: #1673 (closed) (else can't build on relevant testbeds)

Tpetra's documentation needs to clarify how to use MPI with multiple GPUs on a node. It looks like Kokkos knows how to deal with this, as long as MPI_Init is called before Kokkos::initialize, and as long as one uses the --kokkos-ndevices argument correctly. I copied the documentation below out of Kokkos' --help output:

--kokkos-ndevices=INT[,INT] : used when running MPI jobs. Specify number of
                              devices per node to be used. Process to device
                              mapping happens by obtaining the local MPI rank
                              and assigning devices round-robin. The optional
                              second argument allows for an existing device
                              to be ignored. This is most useful on workstations
                              with multiple GPUs of which one is used to drive
                              screen output.

Just to clarify, it looks like all the MPI processes can take the same --kokkos-ndevices argument. Kokkos will use the local MPI process to do device mapping. See also https://github.com/kokkos/kokkos/issues/50 and https://github.com/kokkos/kokkos/issues/544 .