Add pybind11 wrappers for cuda <-> python memcpy
As #5 (closed) appears on Summit and not Cori-GPU one possibility is that it's a CUDA driver issue on Summit (their engineers admit that they only support the runtime API).
So my current theory is that this is related to PyCUDA's use of the CUDA driver (rather then the CUDA runtime API). A clean workaround would be to not use PyCUDA at all -- which we probably want to do some time in the future anyway (eg. when we want to experiment with pinned memory, or when we want to use non-NVIDIA GPUs).
Fortunately we only need the following CUDA runtime API functions:
-
cudaMalloc
-
cudaFree
-
cudaMemcpy
-
cudaMallocHost
(we also need versions of these that can target different devices, and streams)
I propose writing thin pybind11 wrappers for this -- see this example: https://github.com/JBlaschke/PyNVTX/blob/master/PyNVTX/PyNVTX_backend.cu
Note: python's setuptools needs help compiling cuda code -- see this example: https://github.com/JBlaschke/PyNVTX/blob/master/setup.py