Fixes required for Legion to be able to run cufinufft
There are two changes here:
- pycuda stores the context in a thread-local. Since Legion executes each task in a thread, this must be set up each time we use the GPU.
- Some precision changes to match MPI.