A lot of the C++ stuff (in spinfel/device) will likely move to a seperate PyPI package soon. This MR makes two changes:

  1. Get the device count from the CUDA API. I checked and this respects CUDA_VISIBLE_DEVICES on Cori GPU, Perlmutter, and Summit.
  2. Remove the gpu.devices_per_node setting -- this is controlled by srun or CUDA_VISIBLE_DEVICES now.

RE point 2 above: you need to remove the

devices_per_node = ...

section of your own tomls.

This also fixed a bug where the orientation matching code would use the setting rather than context.dev_id

