Use fence() instead of CUDA_LAUNCH_BLOCKING=1 in Tpetra

Created by: ambrad

My understanding is that, right now, we need to set CUDA_LAUNCH_BLOCKING=1 to use Tpetra and possibly other Trilinos packages on the GPU.

This flag synchronizes launch of kernels; that is, one kernel launches and completes, and then the next does.

In application development, we'd like to avoid that for two reasons.

First, we'd like to run with Tpetra but test that our own application code does not have a race condition between kernels. I believe blocking is being forced on the app by that env var, so that means we can't test for a race condition in our own code.

I'm not sure what effect the CUDA_LAUNCH_BLOCKING flag has in the case of having multiple GPUs per node. But, second, we'll certainly want to make sure we can have multiple kernels launched from one MPI rank running simultaneously across GPUs on a node.

I believe the proper fix for this is to track down places in Tpetra where fence() needs to be used. Once fence() is used in all the right spots, then CUDA_LAUNCH_BLOCKING can be set to 0.