spinifel issueshttps://gitlab.osti.gov/mtip/spinifel/-/issues2023-09-08T17:44:30Zhttps://gitlab.osti.gov/mtip/spinifel/-/issues/75Verify that autocorrelation.py results are the same for all ranks2023-09-08T17:44:30ZMonarin UervirojnangkoornVerify that autocorrelation.py results are the same for all ranksuvect_ADb and ugrid_conv are the same across all ranks.
@Seemah @jpblaschke @jjdonatelliuvect_ADb and ugrid_conv are the same across all ranks.
@Seemah @jpblaschke @jjdonatelliMonarin UervirojnangkoornMonarin Uervirojnangkoornhttps://gitlab.osti.gov/mtip/spinifel/-/issues/74Running in MPI mode fails on 32 ranks2023-09-06T16:53:38ZMonarin UervirojnangkoornRunning in MPI mode fails on 32 ranksMode: MPI/hdf5
Error:
```
Traceback (most recent call last):
File "/autofs/nccs-svm1_home1/monarin/frontier/spinifel/setup/conda/envs/myenv/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globa...Mode: MPI/hdf5
Error:
```
Traceback (most recent call last):
File "/autofs/nccs-svm1_home1/monarin/frontier/spinifel/setup/conda/envs/myenv/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/autofs/nccs-svm1_home1/monarin/frontier/spinifel/setup/conda/envs/myenv/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/autofs/nccs-svm1_home1/monarin/frontier/spinifel/spinifel/__main__.py", line 29, in <module>
main()
File "/autofs/nccs-svm1_home1/monarin/frontier/spinifel/setup/conda/envs/myenv/lib/python3.8/site-packages/PyNVTX/__init__.py", line 33, in wrapper
ret = func(*args, **kwargs)
File "/autofs/nccs-svm1_home1/monarin/frontier/spinifel/spinifel/mpi/main.py", line 452, in main
ac = mg.solve_ac(generation, orientations, ac_phased)
File "/autofs/nccs-svm1_home1/monarin/frontier/spinifel/setup/conda/envs/myenv/lib/python3.8/site-packages/PyNVTX/__init__.py", line 33, in wrapper
ret = func(*args, **kwargs)
File "/autofs/nccs-svm1_home1/monarin/frontier/spinifel/spinifel/mpi/autocorrelation.py", line 161, in solve_ac
ret, info = cg(W, d, x0=x0, maxiter=self.maxiter, callback=self.callback)
File "/autofs/nccs-svm1_home1/monarin/frontier/spinifel/setup/conda/envs/myenv/lib/python3.8/site-packages/cupyx/scipy/sparse/linalg/_iterative.py", line 76, in cg
q = matvec(p)
File "/autofs/nccs-svm1_home1/monarin/frontier/spinifel/setup/conda/envs/myenv/lib/python3.8/site-packages/cupyx/scipy/sparse/linalg/_interface.py", line 89, in matvec
y = self._matvec(x)
File "/autofs/nccs-svm1_home1/monarin/frontier/spinifel/setup/conda/envs/myenv/lib/python3.8/site-packages/cupyx/scipy/sparse/linalg/_interface.py", line 282, in _matvec
return self.__matvec_impl(x)
File "/autofs/nccs-svm1_home1/monarin/frontier/spinifel/spinifel/mpi/autocorrelation.py", line 105, in W_matvec
uvect_ADA = self.core_problem_convolution(uvect, F_ugrid_conv_, ac_support)
File "/autofs/nccs-svm1_home1/monarin/frontier/spinifel/spinifel/sequential/autocorrelation.py", line 179, in core_problem_convolution
assert xp.all(xp.isreal(uvect))
AssertionError
```
Step to reproduce the problem:
```
sbatch submit_frontier.sh
```
where submit_frontier.sh is:
```
#!/bin/bash
#SBATCH -A chm137
#SBATCH -t 0:29:59
#SBATCH -N 4
#SBATCH -c 32
#SBATCH -J RunSpinifel
#SBATCH -o RunSpinifel_o.%J
#SBATCH -e RunSpinifel_e.%J
set +x
t_start=`date +%s`
# spinifel
source setup/env.sh
# Spinifel's env vars
export test_data_dir="/lustre/orion/proj-shared/chm137/demo23/data"
export out_dir="/lustre/orion/chm137/scratch/${USER}/${CI_PIPELINE_ID}/spinifel_output"
export USE_CUPY=1
# Creates the output folder if not already exist.
if [ ! -d "${out_dir}" ]; then
mkdir -p ${out_dir}
fi
# Running Spinifel
FRONTIER_EXTRAS="runtime.use_pygpu=true"
N_IMAGES_PER_RANK=250
N_IMAGES_MAX=250
srun -N4 -n32 --gpus-per-task 1 python $DEBUG_FLAG -m spinifel --default-settings=test_mpi.toml --mode=mpi $FRONTIER_EXTRAS data.in_dir=${test_data_dir} data.name=3iyf_128x128pixels_2m.h5 runtime.N_images_per_rank=$N_IMAGES_PER_RANK algorithm.N_images_max=$N_IMAGES_MAX fsc.fsc_fraction_known_orientations=0
t_end=`date +%s`
echo PSJobCompleted TotalElapsed $((t_end-t_start)) $t_start $t_end
```
Note that this ran with srun -N2 -n16.
@Seemah @eslaught @jpblaschkeMonarin UervirojnangkoornMonarin Uervirojnangkoornhttps://gitlab.osti.gov/mtip/spinifel/-/issues/73Scaling Spinifel-Legion with Multiple Conformations on Frontier2023-08-10T21:43:42ZSeema MirchandaneyScaling Spinifel-Legion with Multiple Conformations on FrontierTrack Progress/Issues related to Scaling on Frontier
- Logs/results/scripts for 1000 images per rank (up to 1000 ranks and 1M diffraction patterns) are in
- /lustre/orion/proj-shared/chm137/seemah/spinifel_output_frontier_opt
- Log...Track Progress/Issues related to Scaling on Frontier
- Logs/results/scripts for 1000 images per rank (up to 1000 ranks and 1M diffraction patterns) are in
- /lustre/orion/proj-shared/chm137/seemah/spinifel_output_frontier_opt
- Logs/results/scripts for 125 images per rank + legion profiling (up to 8000 ranks and 1M diffraction patterns) are in
- /lustre/orion/proj-shared/chm137/seemah/spinifel_output_frontier_prof
- Runs for 4000 and 8000 ranks did not complete - i.e. made very little progress
- I tested 4000 ranks for just 2 generations + disabled 'control flow' + used libfabric related options for large buffers. (All run options are in the logs)
- export FI_CXI_DEFAULT_CQ_SIZE=13107200
- export FI_CXI_REQ_BUF_MIN_POSTED=10
- export FI_CXI_REQ_BUF_SIZE=25165824
- It ran and completed both generations and produced all the output files - I did kill the job since it was running for a while after all generations were done
- Logs/results for that are in /lustre/orion/proj-shared/chm137/seemah/spinifel_output_frontier_debug/result_4000tasks/Seema MirchandaneySeema Mirchandaneyhttps://gitlab.osti.gov/mtip/spinifel/-/issues/72eds/spack error on Perlmutter2023-07-27T19:34:24ZMonarin Uervirojnangkoorneds/spack error on PerlmutterThe installation of eds/spack was done on $SCRATCH. On an interactive node, running these commands produces the following error:
```
monarin@nid001584: source setup/env.sh
monarin@nid001584: srun -n1 -G1 python spinifel/tests/skopi_quat...The installation of eds/spack was done on $SCRATCH. On an interactive node, running these commands produces the following error:
```
monarin@nid001584: source setup/env.sh
monarin@nid001584: srun -n1 -G1 python spinifel/tests/skopi_quaternion.py
srun: warning: can't run 1 processes on 2 nodes, setting nnodes to 1
MPICH ERROR [Rank 0] [job id 12528426.1] [Thu Jul 27 12:33:12 2023] [nid001584] - Abort(-1) (rank 0 in comm 0): MPIDI_CRAY_init: GPU_SUPPORT_ENABLED is requested, but GTL library is not linked
(Other MPI error)
aborting job:
MPIDI_CRAY_init: GPU_SUPPORT_ENABLED is requested, but GTL library is not linked
srun: error: nid001584: task 0: Exited with exit code 255
srun: Terminating StepId=12528426.1
```Elliott SlaughterElliott Slaughterhttps://gitlab.osti.gov/mtip/spinifel/-/issues/70conformation algorithm2023-05-22T17:16:36ZSeema Mirchandaneyconformation algorithm@yoon82 @jjdonatelli - To discuss how to implement conformations using softmax which is computed during orientation matching.
The code for that is https://gitlab.osti.gov/mtip/spinifel/-/blob/sm/legion-fixes/spinifel/sequential/orientati...@yoon82 @jjdonatelli - To discuss how to implement conformations using softmax which is computed during orientation matching.
The code for that is https://gitlab.osti.gov/mtip/spinifel/-/blob/sm/legion-fixes/spinifel/sequential/orientation_matching.py#L126
I've added a method in autocorrelation/merging that can use the result of 'conformation_result'. This currently assumes 'max_likelihood' mode and selects slices/orientations based on whether they belong/don't belong to the current conformation.
https://gitlab.osti.gov/mtip/spinifel/-/blob/sm/legion-fixes/spinifel/sequential/autocorrelation.py#L235Seema MirchandaneySeema Mirchandaneyhttps://gitlab.osti.gov/mtip/spinifel/-/issues/68Legion fails on Frontier (djh/crusher branch)2023-05-18T17:35:39ZMonarin UervirojnangkoornLegion fails on Frontier (djh/crusher branch)@Seemah @eslaught I've been seeing this error running Legion test on Frontier using djh/crusher branch.
```
monarin@frontier07812:~/frontier/spinifel> ./test_main.sh
+ export CI_PIPELINE_ID=000
+ CI_PIPELINE_ID=000
++ hostname --fqdn
...@Seemah @eslaught I've been seeing this error running Legion test on Frontier using djh/crusher branch.
```
monarin@frontier07812:~/frontier/spinifel> ./test_main.sh
+ export CI_PIPELINE_ID=000
+ CI_PIPELINE_ID=000
++ hostname --fqdn
+ target=frontier07812.frontier.olcf.ornl.gov
+ export SPINIFEL_TEST_FLAG=1
+ SPINIFEL_TEST_FLAG=1
+ export PYCUDA_CACHE_DIR=/tmp
+ PYCUDA_CACHE_DIR=/tmp
+ [[ frontier07812.frontier.olcf.ornl.gov = *\s\u\m\m\i\t* ]]
+ [[ frontier07812.frontier.olcf.ornl.gov = *\a\s\c\e\n\t* ]]
+ [[ frontier07812.frontier.olcf.ornl.gov = *\c\r\u\s\h\e\r* ]]
+ [[ frontier07812.frontier.olcf.ornl.gov = *\f\r\o\n\t\i\e\r* ]]
+ export all_proxy=socks://proxy.ccs.ornl.gov:3128/
+ all_proxy=socks://proxy.ccs.ornl.gov:3128/
+ export ftp_proxy=ftp://proxy.ccs.ornl.gov:3128/
+ ftp_proxy=ftp://proxy.ccs.ornl.gov:3128/
+ export http_proxy=http://proxy.ccs.ornl.gov:3128/
+ http_proxy=http://proxy.ccs.ornl.gov:3128/
+ export https_proxy=http://proxy.ccs.ornl.gov:3128/
+ https_proxy=http://proxy.ccs.ornl.gov:3128/
+ export 'no_proxy=localhost,127.0.0.0/8,*.ccs.ornl.gov'
+ no_proxy='localhost,127.0.0.0/8,*.ccs.ornl.gov'
+ [[ frontier07812.frontier.olcf.ornl.gov = *\a\s\c\e\n\t* ]]
+ [[ frontier07812.frontier.olcf.ornl.gov = *\s\u\m\m\i\t* ]]
+ [[ frontier07812.frontier.olcf.ornl.gov = *\f\r\o\n\t\i\e\r* ]]
+ export test_data_dir=/lustre/orion/chm137/proj-shared/testdata
+ test_data_dir=/lustre/orion/chm137/proj-shared/testdata
+ export out_dir=/lustre/orion/chm137/scratch/monarin/000/spinifel_output
+ out_dir=/lustre/orion/chm137/scratch/monarin/000/spinifel_output
+ [[ frontier07812.frontier.olcf.ornl.gov = *\s\u\m\m\i\t* ]]
+ [[ frontier07812.frontier.olcf.ornl.gov = *\a\s\c\e\n\t* ]]
+ [[ frontier07812.frontier.olcf.ornl.gov = *\c\g\p\u* ]]
+ [[ frontier07812.frontier.olcf.ornl.gov = *\p\e\r\l\m\u\t\t\e\r* ]]
+ [[ frontier07812.frontier.olcf.ornl.gov = *\f\r\o\n\t\i\e\r* ]]
+ export 'SPINIFEL_TEST_LAUNCHER=srun -n1 -G1'
+ SPINIFEL_TEST_LAUNCHER='srun -n1 -G1'
+ export 'SPINIFEL_PSANA2_LAUNCHER=srun -n3 -G3'
+ SPINIFEL_PSANA2_LAUNCHER='srun -n3 -G3'
+ '[' '!' -d /lustre/orion/chm137/scratch/monarin/000/spinifel_output ']'
+ [[ frontier07812.frontier.olcf.ornl.gov = *\f\r\o\n\t\i\e\r* ]]
+ FRONTIER_EXTRAS='runtime.use_pygpu=true runtime.chk_convergence=false algorithm.N_generations=3 fsc.pdb_path='
+ PYTHONPATH=/autofs/nccs-svm1_home1/monarin/frontier/spinifel/setup/install/lib/python3.8/site-packages:/autofs/nccs-svm1_home1/monarin/frontier/spinifel/setup/lcls2/install/lib/python3.8/site-packages::/ccs/home/monarin/frontier/spinifel/mpi4py_poison_wrapper
+ srun -n1 -G1 legion_python -ll:py 1 -ll:csize 8192 legion_main.py --default-settings=summit_ci.toml --mode=legion runtime.use_pygpu=true runtime.chk_convergence=false algorithm.N_generations=3 fsc.pdb_path=
*** FATAL ERROR (proc 0): in gasnetc_ofi_init() at l/setup/gasnet/GASNet-2022.9.2/ofi-conduit/gasnet_ofi.c:946: fi_domain failed: -38(Function not implemented)
*** NOTICE (proc 0): Before reporting bugs, run with GASNET_BACKTRACE=1 in the environment to generate a backtrace.
*** NOTICE (proc 0): We recommend linking the debug version of GASNet to assist you in resolving this application issue.
srun: error: frontier07812: task 0: Aborted
srun: launch/slurm: _step_signal: Terminating StepId=1323140.2
```https://gitlab.osti.gov/mtip/spinifel/-/issues/67Unit test and psana2 test fail in djh/crusher branch2023-05-17T23:37:27ZMonarin UervirojnangkoornUnit test and psana2 test fail in djh/crusher branch**Quick solution**: rebuild psana2
Two errors that are probably related and were resolved with the quick solution above:
1. https://code.olcf.ornl.gov/ci/chm137/dev/spinifel/-/jobs/9666
2. https://code.ornl.gov/ecpcitest/chm137/spin...**Quick solution**: rebuild psana2
Two errors that are probably related and were resolved with the quick solution above:
1. https://code.olcf.ornl.gov/ci/chm137/dev/spinifel/-/jobs/9666
2. https://code.ornl.gov/ecpcitest/chm137/spinifel/-/jobs/1781243
We don't see this in development.
@Seemah @eslaught @yoon82Monarin UervirojnangkoornMonarin Uervirojnangkoornhttps://gitlab.osti.gov/mtip/spinifel/-/issues/62Error running on summit2023-04-18T23:56:41ZChun Hong YoonError running on summit```
7fffea830000-7fffea870000 rw-p 00000000 00:00 0 [stack]
Size: 256 kB
KernelPageSize: 64 kB
MMUPageSize: 64 kB
Rss: 256 kB
Pss: 256 kB
Shared_Clean...```
7fffea830000-7fffea870000 rw-p 00000000 00:00 0 [stack]
Size: 256 kB
KernelPageSize: 64 kB
MMUPageSize: 64 kB
Rss: 256 kB
Pss: 256 kB
Shared_Clean: 0 kB
Shared_Dirty: 0 kB
Private_Clean: 0 kB
Private_Dirty: 256 kB
Referenced: 256 kB
Anonymous: 256 kB
LazyFree: 0 kB
AnonHugePages: 0 kB
ShmemPmdMapped: 0 kB
Shared_Hugetlb: 0 kB
Private_Hugetlb: 0 kB
Swap: 0 kB
SwapPss: 0 kB
Locked: 0 kB
VmFlags: rd wr mr mw me gd ac
VmFlags: rd wr mr mw me gd ac
MemTotal: 630978176 kB
MemFree: 552515008 kB
MemAvailable: 582776320 kB
Buffers: 640 kB
Cached: 43211136 kB
SwapCached: 0 kB
Active: 13815872 kB
Inactive: 33122304 kB
Active(anon): 13233280 kB
Inactive(anon): 1677760 kB
Active(file): 582592 kB
Inactive(file): 31444544 kB
Unevictable: 16777216 kB
Mlocked: 16777216 kB
SwapTotal: 0 kB
SwapFree: 0 kB
Dirty: 2624 kB
Writeback: 0 kB
AnonPages: 20521024 kB
Mapped: 1839424 kB
Shmem: 11180736 kB
KReclaimable: 514368 kB
Slab: 5428544 kB
SReclaimable: 514368 kB
SUnreclaim: 4914176 kB
KernelStack: 37056 kB
PageTables: 22912 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
WritebackTmp: 0 kB
CommitLimit: 315489088 kB
Committed_AS: 74855168 kB
VmallocTotal: 549755813888 kB
VmallocUsed: 0 kB
VmallocChunk: 0 kB
Percpu: 596992 kB
HardwareCorrupted: 128 kB
AnonHugePages: 393216 kB
ShmemHugePages: 0 kB
ShmemPmdMapped: 0 kB
CmaTotal: 26853376 kB
CmaFree: 26853376 kB
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB
Hugetlb: 0 kB
Hugetlb: 0 kB
ibv_reg_mr returns 13, addr=0x160d0000, bytes=262144
python: /__SMPI_build_dir__________________________/ibmsrc/pami/ibm-pami/buildtools/pami_build_port/../pami/components/devices/ibvdevice/IBVMRManager.h:230: pami_result_t PAMI::Device::IBV::IBVMRManager::create_mr(void*, size_t, PAMI::Device::IBV::IBVMRElem*): Assertion `mr[n] != __null' failed.
[g05n05:1401055] *** Process received signal ***
[g05n05:1401055] Signal: Aborted (6)
[g05n05:1401055] Signal code: (-6)
[g05n05:1401055] [ 0] linux-vdso64.so.1(__kernel_sigtramp_rt64+0x0)[0x2000000504d8]
[g05n05:1401055] [ 1] /lib64/power9/libc.so.6(gsignal+0xd8)[0x200000353618]
[g05n05:1401055] [ 2] /lib64/power9/libc.so.6(abort+0x164)[0x200000333a2c]
[g05n05:1401055] [ 3] /lib64/power9/libc.so.6(+0x36f70)[0x200000346f70]
[g05n05:1401055] [ 4] /lib64/power9/libc.so.6(__assert_fail+0x64)[0x200000347014]
[g05n05:1401055] [ 5] /sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-9.1.0/spectrum-mpi-10.4.0.3-20210112-6jbupg3thjwhsabgevk6xmwhd2bbyxdc/container/../lib/pami_port/libpami.so.3(_ZN4PAMI6Device3IBV12IBVMRManager9create_mrEPvmPNS1_9IBVMRElemE+0x2b4)[0x200061b08554]
[g05n05:1401055] [ 6] /sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-9.1.0/spectrum-mpi-10.4.0.3-20210112-6jbupg3thjwhsabgevk6xmwhd2bbyxdc/container/../lib/pami_port/libpami.so.3(_ZN4PAMI14MemregionCache15createMemregionEPNS_9MemregionEPmmPvm+0x930)[0x200061b31af0]
[g05n05:1401055] [ 7] /sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-9.1.0/spectrum-mpi-10.4.0.3-20210112-6jbupg3thjwhsabgevk6xmwhd2bbyxdc/container/../lib/pami_port/libpami.so.3(_ZN4PAMI7Context21memregion_create_implEPvmPmPA24_m+0x17c)[0x200061b323cc]
[g05n05:1401055] [ 8] /sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-9.1.0/spectrum-mpi-10.4.0.3-20210112-6jbupg3thjwhsabgevk6xmwhd2bbyxdc/container/../lib/pami_port/libpami.so.3(PAMI_Memregion_create+0x18)[0x200061adc1f8]
[g05n05:1401055] [ 9] /sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-9.1.0/spectrum-mpi-10.4.0.3-20210112-6jbupg3thjwhsabgevk6xmwhd2bbyxdc/container/../lib/pami_port/libpami.so.3(_ZN4PAMI8Protocol4Send9EagerRgetINS_6Device5Shmem11PacketModelINS3_11ShmemDeviceINS_4Fifo8WrapFifoINS7_10FifoPacketILj64ELj4096EEENS_7Counter15IndirectBoundedINS_6Atomic12NativeAtomicEEELj256EEENSB_8IndirectINSB_6NativeEEENS4_9CMAShaddrELj256ELj512EEEEENS4_8DmaModelISL_Lb0EEEE6simpleEP11pami_send_t+0x158)[0x200061b99858]
[g05n05:1401055] [10] /sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-9.1.0/spectrum-mpi-10.4.0.3-20210112-6jbupg3thjwhsabgevk6xmwhd2bbyxdc/container/../lib/pami_port/libpami.so.3(_ZN4PAMI8Protocol4Send9CompositeINS1_4SendES3_E6simpleEP11pami_send_t+0x40)[0x200061aec500]
[g05n05:1401055] [11] /sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-9.1.0/spectrum-mpi-10.4.0.3-20210112-6jbupg3thjwhsabgevk6xmwhd2bbyxdc/container/../lib/pami_port/libpami.so.3(PAMI_Send+0x58)[0x200061ad4a38]
[g05n05:1401055] [12] /sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-9.1.0/spectrum-mpi-10.4.0.3-20210112-6jbupg3thjwhsabgevk6xmwhd2bbyxdc/container/../lib/pami_port/libcollectives.so.3(_ZN7LibColl18NativeInterfaceP2PILb1ELb0EE19multicast_over_sendEPNS1_20p2p_multicast_send_tEm+0x25c)[0x2000823cde7c]
[g05n05:1401055] [13] /sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-9.1.0/spectrum-mpi-10.4.0.3-20210112-6jbupg3thjwhsabgevk6xmwhd2bbyxdc/container/../lib/pami_port/libcollectives.so.3(_ZN7LibColl18NativeInterfaceP2PILb1ELb0EE9multicastEPNS_14ni_multicast_tE+0x36c)[0x2000823ce3dc]
[g05n05:1401055] [14] /sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-9.1.0/spectrum-mpi-10.4.0.3-20210112-6jbupg3thjwhsabgevk6xmwhd2bbyxdc/container/../lib/pami_port/libcollectives.so.3(_ZN4CCMI8Executor9BroadcastINS_17ConnectionManager14CommSeqConnMgrE20_gather_cheader_dataE5startEv+0x78)[0x200082308798]
[g05n05:1401055] [15] /sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-9.1.0/spectrum-mpi-10.4.0.3-20210112-6jbupg3thjwhsabgevk6xmwhd2bbyxdc/container/../lib/pami_port/libcollectives.so.3(_ZN4CCMI9Protocols9Broadcast22AsyncBroadcastFactoryTINS1_15AsyncBroadcastTINS_17ConnectionManager14CommSeqConnMgrENS_8Executor9BroadcastIS5_20_gather_cheader_dataEELN7LibColl15topologyIndex_tE0ELj256EEEE8generateEPvSE_+0x53c)[0x20008235df3c]
[g05n05:1401055] [16] /sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-9.1.0/spectrum-mpi-10.4.0.3-20210112-6jbupg3thjwhsabgevk6xmwhd2bbyxdc/container/../lib/pami_port/libcollectives.so.3(_ZN7LibColl7Adapter20StaticCollSelAdviser10autoSelectE19libcoll_xfer_type_tP14libcoll_xfer_tb+0x1dc)[0x2000822de6fc]
[g05n05:1401055] [17] /sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-9.1.0/spectrum-mpi-10.4.0.3-20210112-6jbupg3thjwhsabgevk6xmwhd2bbyxdc/container/../lib/pami_port/libcollectives.so.3(LIBCOLL_AutoSelect+0x58)[0x2000822d9fe8]
[g05n05:1401055] [18] /sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-9.1.0/spectrum-mpi-10.4.0.3-20210112-6jbupg3thjwhsabgevk6xmwhd2bbyxdc/container/../lib/spectrum_mpi/mca_coll_ibm.so(start_libcoll_blocking_collective+0x478)[0x2000720001a8]
[g05n05:1401055] [19] /sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-9.1.0/spectrum-mpi-10.4.0.3-20210112-6jbupg3thjwhsabgevk6xmwhd2bbyxdc/container/../lib/spectrum_mpi/mca_coll_ibm.so(mca_coll_ibm_bcast+0x208)[0x2000720056c8]
[g05n05:1401055] [20] /sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-8.3.1/spectrum-mpi-10.4.0.3-20210112-i64lccn3bi6ibfavxamdhjoaqvdvbo5b/lib/libmpi_ibm.so.3(MPI_Bcast+0x198)[0x200000651158]
[g05n05:1401055] [21] /autofs/nccs-svm1_home1/chuck/spinifel/setup/conda/envs/myenv/lib/python3.8/site-packages/mpi4py/MPI.cpython-38-powerpc64le-linux-gnu.so(+0x110eec)[0x200030ef0eec]
[g05n05:1401055] [22] python[0x10043a14]
[g05n05:1401055] [23] python(_PyObject_MakeTpCall+0xec)[0x100444cc]
[g05n05:1401055] [24] python(_PyEval_EvalFrameDefault+0x87b0)[0x1002cc40]
[g05n05:1401055] [25] python(PyEval_EvalFrameEx+0x34)[0x1012a9f4]
[g05n05:1401055] [26] python[0x10022cd4]
[g05n05:1401055] [27] python(PyVectorcall_Call+0xa0)[0x10046be0]
[g05n05:1401055] [28] python(PyObject_Call+0x1ec)[0x10046fbc]
[g05n05:1401055] [29] python(_PyEval_EvalFrameDefault+0x3314)[0x100277a4]
[g05n05:1401055] *** End of error message ***
ERROR: One or more process (first noticed rank 0) terminated with signal 6
PSJobCompleted TotalElapsed 22 1674153562 1674153584
```Elliott SlaughterElliott Slaughterhttps://gitlab.osti.gov/mtip/spinifel/-/issues/60FY22 cleaning list2022-12-16T18:51:17ZChun Hong YoonFY22 cleaning list- [x] Merge legion demo branch to development: seema
- [x] Merge psana2 demo branch to development: mona
- [x] Master update from development + tag: chuck
- [x] black migration: chuck
- [x] Describe trailing underscore convention in read...- [x] Merge legion demo branch to development: seema
- [x] Merge psana2 demo branch to development: mona
- [x] Master update from development + tag: chuck
- [x] black migration: chuck
- [x] Describe trailing underscore convention in readme
- [ ] improve CI test (double checking precision (single vs double (finufft precision: Johannes): seema, mode cleanup: mona [DONE])
- [ ] streaming mode soft reset of variables (define streaming, reuse snm, mg and nufft): mona+seema, make table of variables: chuck [DONE]
- [ ] check FFT normalization convention. Mtot scaling consistency in sequential/phasing.py and AC_solver: chuck+jeff
- [ ] Try different grid size: chuck
- [ ] Create bigger datasets: chuck
- [ ] Create more types of protein datasets: chuck
- [ ] Try more nodes: chuck
- [ ] scaling beyond 8 ranks for legion (on Perlmutter only): CXI (part of libfabric): johannes
- [ ] swap out pycuda with pybindgpu library (portable to AMD): johannes
- [ ] dynamically updating toml file (updating run numbers to process in streaming mode): johannes
- [ ] improve performance of legion: seema
- [ ] add legion in lcls2 repo (adding legion features): seema+mona
- [ ] documentation (wiki, describe option arguments): all2022-12-20https://gitlab.osti.gov/mtip/spinifel/-/issues/59Finufft3d fails with Incompatible function arguments for single precision set...2022-12-01T18:26:52ZChun Hong YoonFinufft3d fails with Incompatible function arguments for single precision setting (runtime.use_single_prec=true)Error in streaming mode on cori gpu when various GPU related flags are switched off
```
[runtime]
use_cuda = true
use_cufinufft = false
use_cupy = false
```
```
Traceback (most recent call last):
File "/global/project/projectdirs/lc...Error in streaming mode on cori gpu when various GPU related flags are switched off
```
[runtime]
use_cuda = true
use_cufinufft = false
use_cupy = false
```
```
Traceback (most recent call last):
File "/global/project/projectdirs/lcls/chuck/temp/spinifel/setup/conda/envs/myenv/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/global/project/projectdirs/lcls/chuck/temp/spinifel/setup/conda/envs/myenv/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/global/project/projectdirs/lcls/chuck/temp/spinifel/spinifel/__main__.py", line 25, in <module>
main_psana2()
File "/global/project/projectdirs/lcls/chuck/temp/spinifel/setup/conda/envs/myenv/lib/python3.8/site-packages/PyNVTX/__init__.py", line 33, in wrapper
ret = func(*args, **kwargs)
File "/global/project/projectdirs/lcls/chuck/temp/spinifel/spinifel/mpi/main_psana2.py", line 361, in main
ac = mg.solve_ac(generation)
File "/global/project/projectdirs/lcls/chuck/temp/spinifel/setup/conda/envs/myenv/lib/python3.8/site-packages/PyNVTX/__init__.py", line 33, in wrapper
ret = func(*args, **kwargs)
File "/global/project/projectdirs/lcls/chuck/temp/spinifel/spinifel/mpi/autocorrelation.py", line 165, in solve_ac
W, d = self.setup_linops(H, K, L, ac_support, x0)
File "/global/project/projectdirs/lcls/chuck/temp/spinifel/setup/conda/envs/myenv/lib/python3.8/site-packages/PyNVTX/__init__.py", line 33, in wrapper
ret = func(*args, **kwargs)
File "/global/project/projectdirs/lcls/chuck/temp/spinifel/spinifel/mpi/autocorrelation.py", line 93, in setup_linops
ugrid_conv = self.nufft.adjoint(
File "/global/project/projectdirs/lcls/chuck/temp/spinifel/setup/conda/envs/myenv/lib/python3.8/site-packages/PyNVTX/__init__.py", line 33, in wrapper
ret = func(*args, **kwargs)
File "/global/project/projectdirs/lcls/chuck/temp/spinifel/spinifel/extern/nufft_ext.py", line 490, in adjoint
assert not finufft.nufft3d1(
File "/global/project/projectdirs/lcls/chuck/temp/spinifel/setup/conda/envs/myenv/lib/python3.8/site-packages/finufftpy/_interfaces.py", line 457, in nufft3d1
return finufftpy_cpp.finufft3d1_cpp(x,y,z,c,isign,eps,ms,mt,mu,f,debug,spread_debug,spread_sort,fftw,modeord,chkbnds,upsampfac)
TypeError: finufft3d1_cpp(): incompatible function arguments. The following argument types are supported:
1. (xj: numpy.ndarray[numpy.float64], yj: numpy.ndarray[numpy.float64], zj: numpy.ndarray[numpy.float64], cj: numpy.ndarray[numpy.complex128], iflag: int, eps: float, ms: int, mt: int, mu: int, fk: numpy.ndarray[numpy.complex128], debug: int, spread_debug: int, spread_sort: int, fftw: int, modeord: int, chkbnds: int, upsampfac: float) -> int
Invoked with: array([-2.48019188, -2.47531443, -2.4704406 , ..., 2.66082175,
2.69253805, 2.72423416]), array([-0.74826405, -0.72593415, -0.70365277, ..., -1.26888202,
-1.25659757, -1.24433789]), array([-1.77717593, -1.75186601, -1.72646421, ..., -0.9427877 ,
-0.94572608, -0.94856553]), array([1.+0.j, 1.+0.j, 1.+0.j, ..., 1.+0.j, 1.+0.j, 1.+0.j]), -1, 1e-12, 162, 162, 162, array([[[0.+0.j, 0.+0.j, 0.+0.j, ..., 0.+0.j, 0.+0.j, 0.+0.j],
[0.+0.j, 0.+0.j, 0.+0.j, ..., 0.+0.j, 0.+0.j, 0.+0.j],
[0.+0.j, 0.+0.j, 0.+0.j, ..., 0.+0.j, 0.+0.j, 0.+0.j],
...,
[0.+0.j, 0.+0.j, 0.+0.j, ..., 0.+0.j, 0.+0.j, 0.+0.j],
[0.+0.j, 0.+0.j, 0.+0.j, ..., 0.+0.j, 0.+0.j, 0.+0.j],
[0.+0.j, 0.+0.j, 0.+0.j, ..., 0.+0.j, 0.+0.j, 0.+0.j]],
[[0.+0.j, 0.+0.j, 0.+0.j, ..., 0.+0.j, 0.+0.j, 0.+0.j],
[0.+0.j, 0.+0.j, 0.+0.j, ..., 0.+0.j, 0.+0.j, 0.+0.j],
[0.+0.j, 0.+0.j, 0.+0.j, ..., 0.+0.j, 0.+0.j, 0.+0.j],
...,
[0.+0.j, 0.+0.j, 0.+0.j, ..., 0.+0.j, 0.+0.j, 0.+0.j],
[0.+0.j, 0.+0.j, 0.+0.j, ..., 0.+0.j, 0.+0.j, 0.+0.j],
[0.+0.j, 0.+0.j, 0.+0.j, ..., 0.+0.j, 0.+0.j, 0.+0.j]],
[[0.+0.j, 0.+0.j, 0.+0.j, ..., 0.+0.j, 0.+0.j, 0.+0.j],
[0.+0.j, 0.+0.j, 0.+0.j, ..., 0.+0.j, 0.+0.j, 0.+0.j],
[0.+0.j, 0.+0.j, 0.+0.j, ..., 0.+0.j, 0.+0.j, 0.+0.j],
...,
[0.+0.j, 0.+0.j, 0.+0.j, ..., 0.+0.j, 0.+0.j, 0.+0.j],
[0.+0.j, 0.+0.j, 0.+0.j, ..., 0.+0.j, 0.+0.j, 0.+0.j],
[0.+0.j, 0.+0.j, 0.+0.j, ..., 0.+0.j, 0.+0.j, 0.+0.j]],
...,
[[0.+0.j, 0.+0.j, 0.+0.j, ..., 0.+0.j, 0.+0.j, 0.+0.j],
[0.+0.j, 0.+0.j, 0.+0.j, ..., 0.+0.j, 0.+0.j, 0.+0.j],
[0.+0.j, 0.+0.j, 0.+0.j, ..., 0.+0.j, 0.+0.j, 0.+0.j],
...,
[0.+0.j, 0.+0.j, 0.+0.j, ..., 0.+0.j, 0.+0.j, 0.+0.j],
[0.+0.j, 0.+0.j, 0.+0.j, ..., 0.+0.j, 0.+0.j, 0.+0.j],
[0.+0.j, 0.+0.j, 0.+0.j, ..., 0.+0.j, 0.+0.j, 0.+0.j]],
[[0.+0.j, 0.+0.j, 0.+0.j, ..., 0.+0.j, 0.+0.j, 0.+0.j],
[0.+0.j, 0.+0.j, 0.+0.j, ..., 0.+0.j, 0.+0.j, 0.+0.j],
[0.+0.j, 0.+0.j, 0.+0.j, ..., 0.+0.j, 0.+0.j, 0.+0.j],
...,
[0.+0.j, 0.+0.j, 0.+0.j, ..., 0.+0.j, 0.+0.j, 0.+0.j],
[0.+0.j, 0.+0.j, 0.+0.j, ..., 0.+0.j, 0.+0.j, 0.+0.j],
[0.+0.j, 0.+0.j, 0.+0.j, ..., 0.+0.j, 0.+0.j, 0.+0.j]],
[[0.+0.j, 0.+0.j, 0.+0.j, ..., 0.+0.j, 0.+0.j, 0.+0.j],
[0.+0.j, 0.+0.j, 0.+0.j, ..., 0.+0.j, 0.+0.j, 0.+0.j],
[0.+0.j, 0.+0.j, 0.+0.j, ..., 0.+0.j, 0.+0.j, 0.+0.j],
...,
[0.+0.j, 0.+0.j, 0.+0.j, ..., 0.+0.j, 0.+0.j, 0.+0.j],
[0.+0.j, 0.+0.j, 0.+0.j, ..., 0.+0.j, 0.+0.j, 0.+0.j],
[0.+0.j, 0.+0.j, 0.+0.j, ..., 0.+0.j, 0.+0.j, 0.+0.j]]],
dtype=complex64), 0, 0, 2, 0, 0, 1, 2.0
```Monarin UervirojnangkoornMonarin Uervirojnangkoornhttps://gitlab.osti.gov/mtip/spinifel/-/issues/57legion test on development branch fails on corigpu2022-09-08T18:02:13ZMonarin Uervirojnangkoornlegion test on development branch fails on corigpuTo reproduce this error, follow steps on Quick Start on Cori GPU then run legion test. An example script for running this test is shown here:
```bash
#!/bin/bash
set -xe
export CI_PIPELINE_ID=000
target=${SPINIFEL_TARGET:-${NERSC_HOS...To reproduce this error, follow steps on Quick Start on Cori GPU then run legion test. An example script for running this test is shown here:
```bash
#!/bin/bash
set -xe
export CI_PIPELINE_ID=000
target=${SPINIFEL_TARGET:-${NERSC_HOST:-$(hostname --fqdn)}}
export PYCUDA_CACHE_DIR="/tmp"
if [[ ${target} = *"ascent"* ]]; then
export all_proxy=socks://proxy.ccs.ornl.gov:3128/
export ftp_proxy=ftp://proxy.ccs.ornl.gov:3128/
export http_proxy=http://proxy.ccs.ornl.gov:3128/
export https_proxy=http://proxy.ccs.ornl.gov:3128/
export no_proxy='localhost,127.0.0.0/8,*.ccs.ornl.gov'
export test_data_dir="/gpfs/wolf/chm137/proj-shared/spinifel_data/testdata"
export OUT_DIR="/gpfs/wolf/chm137/proj-shared/ci/${CI_PIPELINE_ID}/spinifel_output"
elif [[ ${target} = *"summit"* ]]; then
export test_data_dir="/gpfs/alpine/proj-shared/chm137/data/testdata"
export OUT_DIR="/gpfs/alpine/proj-shared/chm137/test_main/${CI_PIPELINE_ID}/spinifel_output"
else
export test_data_dir="${CFS}/m2859/data/testdata"
export OUT_DIR="${SCRATCH}/spinifel_output"
fi
if [[ ${target} = *"summit"* || ${target} = *"ascent"* ]]; then
export SPINIFEL_TEST_LAUNCHER="jsrun -n1 -a1 -g1"
export SPINIFEL_PSANA2_LAUNCHER="jsrun -n4 -g1"
else
export SPINIFEL_TEST_LAUNCHER="srun -n1 -G1"
export SPINIFEL_PSANA2_LAUNCHER="srun -n4 -G1"
fi
PYTHONPATH="$PYTHONPATH:$EXTERNAL_WORKDIR:$PWD/mpi4py_poison_wrapper" $SPINIFEL_TEST_LAUNCHER legion_python -ll:py 1 -ll:csize 8192 legion_main.py --default-settings=summit_ci.toml --mode=legion
```
Error:
```bash
WARNING: Found 9 IB HCAs, but GASNet was configured with '--with-ibv-max-hcas=4'. To utilize all your HCAs, you should reconfigure GASNet using '--with-ibv-max-hcas=9'. You can silence this warning by setting the environment variable GASNET_IBV_PORTS as described in the file 'gasnet/ibv-conduit/README' to specify the desired HCA(s), or by setting the environment variable GASNET_IBV_PORTS_VERBOSE=0 to use the default.
[0 - 15551599e840] 2.473322 {6}{python}: python exception occurred within task:
Traceback (most recent call last):
File "/global/u2/m/monarin/spinifel_development/setup/install/lib/python3.8/site-packages/legion_top.py", line 434, in legion_python_main
run_path(args[start], run_name='__main__')
File "/global/u2/m/monarin/spinifel_development/setup/install/lib/python3.8/site-packages/legion_top.py", line 255, in run_path
exec(code, module.__dict__, module.__dict__)
File "legion_main.py", line 2, in <module>
from spinifel.legion import main
File "/global/u2/m/monarin/spinifel_development/spinifel/legion/__init__.py", line 1, in <module>
from .main import main
File "/global/u2/m/monarin/spinifel_development/spinifel/legion/main.py", line 12, in <module>
from .autocorrelation import solve_ac
File "/global/u2/m/monarin/spinifel_development/spinifel/legion/autocorrelation.py", line 3, in <module>
import skopi as skp
File "/global/u2/m/monarin/spinifel_development/setup/skopi/skopi/__init__.py", line 1, in <module>
from skopi.diffraction import *
File "/global/u2/m/monarin/spinifel_development/setup/skopi/skopi/diffraction.py", line 3, in <module>
from scipy.interpolate import CubicSpline
File "/global/u2/m/monarin/spinifel_development/setup/conda/envs/myenv/lib/python3.8/site-packages/scipy/interpolate/__init__.py", line 166, in <module>
from .interpolate import *
File "/global/u2/m/monarin/spinifel_development/setup/conda/envs/myenv/lib/python3.8/site-packages/scipy/interpolate/interpolate.py", line 21, in <module>
from .interpnd import _ndim_coords_from_arrays
File "interpnd.pyx", line 1, in init scipy.interpolate.interpnd
File "/global/u2/m/monarin/spinifel_development/setup/conda/envs/myenv/lib/python3.8/site-packages/scipy/spatial/__init__.py", line 96, in <module>
from .kdtree import *
File "/global/u2/m/monarin/spinifel_development/setup/conda/envs/myenv/lib/python3.8/site-packages/scipy/spatial/kdtree.py", line 5, in <module>
from .ckdtree import cKDTree, cKDTreeNode
ImportError: /usr/common/software/sles15_cgpu/gcc/8.3.0/lib64/libstdc++.so.6: version `GLIBCXX_3.4.29' not found (required by /global/u2/m/monarin/spinifel_development/setup/conda/envs/myenv/lib/python3.8/site-packages/scipy/spatial/ckdtree.cpython-38-x86_64-linux-gnu.so)
WARNING! The environment variable OUT_DIR supersedes all other inputs for this setting. If this is unintensional unset OUT_DIR.
legion_python: /global/u2/m/monarin/spinifel_development/setup/legion/runtime/realm/python/python_module.cc:992: virtual void Realm::LocalPythonProcessor::execute_task(Realm::Processor::TaskFuncID, const Realm::ByteArrayRef&): Assertion `0' failed.
*** Caught a fatal signal (proc 0): SIGABRT(6)
NOTICE: Before reporting bugs, run with GASNET_BACKTRACE=1 in the environment to generate a backtrace.
NOTICE: We recommend linking the debug version of GASNet to assist you in resolving this application issue.
WARNING: ODP shutdown in signal context
srun: error: cgpu18: task 0: Aborted
srun: launch/slurm: _step_signal: Terminating StepId=3459493.3
```Elliott SlaughterElliott Slaughterhttps://gitlab.osti.gov/mtip/spinifel/-/issues/55[Spring Cleaning] MPI Call after Finalize2022-10-04T22:39:00ZMonarin Uervirojnangkoorn[Spring Cleaning] MPI Call after FinalizeChanges in PSANA2 branch seems to invoke this error.
Error message below is reported when running with psana2 (xtc2).
*** The MPI_Comm_rank() function was called after MPI_FINALIZE was invoked.
*** This is disallowed by the MPI sta...Changes in PSANA2 branch seems to invoke this error.
Error message below is reported when running with psana2 (xtc2).
*** The MPI_Comm_rank() function was called after MPI_FINALIZE was invoked.
*** This is disallowed by the MPI standard.
*** Your MPI job will now abort.
[h49n18:1058242] Local abort after MPI_FINALIZE started completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!Monarin UervirojnangkoornMonarin Uervirojnangkoornhttps://gitlab.osti.gov/mtip/spinifel/-/issues/48[Spring Cleaning] psana2 cannot be found with the build script2022-10-04T22:36:58ZIris Chang[Spring Cleaning] psana2 cannot be found with the build scriptOccasionally, psana cannot be found with the build script on psana2 branch.Occasionally, psana cannot be found with the build script on psana2 branch.Elliott SlaughterElliott Slaughterhttps://gitlab.osti.gov/mtip/spinifel/-/issues/47FSC code not working on Cori2022-03-28T16:24:24ZIris ChangFSC code not working on CoriI wasn't able to run the FSC code on Cori by following the instructions on wiki.
```
(myenv) ihchang@cgpu06:/global/cfs/cdirs/lcls/ihchang/spinifel/spinifel/eval> python fsc.py -m /global/cscratch1/sd/ihchang/spinifel_output/rho-8.mrc -d...I wasn't able to run the FSC code on Cori by following the instructions on wiki.
```
(myenv) ihchang@cgpu06:/global/cfs/cdirs/lcls/ihchang/spinifel/spinifel/eval> python fsc.py -m /global/cscratch1/sd/ihchang/spinifel_output/rho-8.mrc -d /global/cfs/cdirs/m2859/data/3iyf/clean/3iyf_dp005_128x128pixels_500k.h5 -p ../../setup/skopi/examples/input/pdb/3iyf.pdb -o /global/cscratch1/sd/ihchang/spinifel_output/eval --zoom 0.6
[1648255885.433943] [cgpu06:74299:0] ucp_context.c:690 UCX WARN transports 'cuda_copy','gdr_copy','cuda_ipc' are not available, please use one or more of: cma, dc, dc_mlx5, dc_x, ib, mm, posix, rc, rc_mlx5, rc_v, rc_verbs, rc_x, self, shm, sm, sysv, tcp, ud, ud_mlx5, ud_v, ud_verbs, ud_x
```
According @jpblaschke cori gpu is a low priority, it might get fixed, but currently no staff time to maintain this properly.Ariana PeckAriana Peckhttps://gitlab.osti.gov/mtip/spinifel/-/issues/45Error running at high numbers on Perlmutter2023-08-11T23:41:56ZJohannes Paul BlaschkeError running at high numbers on Perlmutter@cahrens has seen the following when running at 64 nodes on Perlmutter:
```
cahrens@perlmutter:login38:/pscratch/sd/c/cahrens/spinifel_output/spinifel_pm_2022-02-19T2311-0800_TRIAL_clean_3iyf_ensemble_dev> more output_1450602_nodes_64_ni...@cahrens has seen the following when running at 64 nodes on Perlmutter:
```
cahrens@perlmutter:login38:/pscratch/sd/c/cahrens/spinifel_output/spinifel_pm_2022-02-19T2311-0800_TRIAL_clean_3iyf_ensemble_dev> more output_1450602_nodes_64_nimages_1500_norient_10000_nbinning_0_nbatchsize_100.log
BATCHSIZE: 100
root_dir: /global/homes/c/cahrens/Projects/exafel/pm/spinifel-dev
NTASKS_PER_NODE: 4
NCPUS_PER_NODE: 4
RUN_MODE: mpi
LAUNCH_SCRIPT: spinifel
SRUN_COMMAND: srun -n 256 --ntasks-per-node=4 -c 10 --gpus-per-task=1 python -m spinifel --default-settings=pm_gpu_mpi.toml --mode=mpi runtime.N_images_per_rank=1500 algorithm.N_bin
ning=0 algorithm.N_orientations=10000 algorithm.N_batch_size=100 data.out_dir=/pscratch/sd/c/cahrens/spinifel_output/spinifel_pm_2022-02-19T2311-0800_TRIAL_clean_3iyf_ensemble_dev/n
odes_64_nimages_1500_norient_10000_nbinning_0_nbatchsize_100 data.name=3iyf_sim_400k.h5 data.in_dir=/global/cfs/cdirs/m2859/data/3iyf/clean
WARNING! The environment variable VERBOSE supersedes all other inputs for this setting. If this is unintensional unset VERBOSE.
WARNING! The environment variable DATA_DIR supersedes all other inputs for this setting. If this is unintensional unset DATA_DIR.
WARNING! The environment variable DATA_FILENAME supersedes all other inputs for this setting. If this is unintensional unset DATA_FILENAME.
WARNING! The environment variable OUT_DIR supersedes all other inputs for this setting. If this is unintensional unset OUT_DIR.
SpinifelSettings:
+ M = 81
(derived data)
+ M_ups = 162
(derived data)
+ Mquat = 20
(derived data)
+ N_batch_size = 100
source: algorithm.N_batch_size
description: N_batch_size parameter for slicing in batches
+ N_binning = 0
source: algorithm.N_binning
description: N_binning parameter for dataset preprocessing
+ N_binning_tot = 0
(derived data)
+ N_clipping = 0
source: algorithm.N_clipping
description: N_clipping parameter for dataset preprocessing
+ N_generations = 10
source: algorithm.N_generations
description: max generations
+ N_images_max = 10000
source: algorithm.N_images_max
description: max images
+ N_images_per_rank = 1500
source: runtime.N_images_per_rank
description: no. of images per rank
+ N_orientations = 10000
source: algorithm.N_orientations
description: N_orientations parameter for orientation matching
+ N_phase_loops = 10
source: algorithm.N_phase_loops
description: number of loops for phasing
+ beta = 0.3
source: algorithm.beta
description: negative feedback in HIO
+ chk_convergence = False
source: runtime.chk_convergence
description: if false, no check if output density converges
+ cutoff = 0.05
source: algorithm.cutoff
description: cutoff in shrinkwrap
+ data_dir = /global/cfs/cdirs/m2859/data/3iyf/clean
source: data.in_dir
description: data dir
+ data_field_name = intensities
source: detector.data_field_name
description: name of data field in the detector output files
+ data_filename = 3iyf_sim_400k.h5
source: data.name
description: data file name
+ data_path = /global/cfs/cdirs/m2859/data/3iyf/clean/3iyf_sim_400k.h5
(derived data)
+ data_type_str = float32
source: detector.data_type_str
description: type string (numpy) for the detector output
+ det_shape = (1, 128, 128)
source: detector.shape
description: detector shape
+ load_generation = 0
source: algorithm.load_generation
description: start from output of this generation
+ nER = 50
source: algorithm.nER
description: number of iterations in ER
+ nHIO = 25
source: algorithm.nHIO
description: number of iterations in HIO
+ orientation_type_str = float32
source: algorithm.orientation_type_str
description: type string (numpy) for the orientation array
+ out_dir = /pscratch/sd/c/cahrens/spinifel_output/spinifel_pm_2022-02-19T2311-0800_TRIAL_clean_3iyf_ensemble_dev/nodes_64_nimages_1500_norient_10000_nbinning_0_nbatchsize_100
source: data.out_dir
description: output dir
+ oversampling = 1
source: algorithm.oversampling
description: oversampling rate
+ pixel_index_shape = (2, 1, 128, 128)
(derived data)
+ pixel_index_shape_0 = (2,)
source: algorithm.pixel_index_shape_0
description: pixel_index_shape = pixel_index_shape_0 + det_shape
+ pixel_index_type_str = int32
source: algorithm.pixel_index_type_str
description: type string (numpy) for the pixel_index array
+ pixel_position_shape = (3, 1, 128, 128)
(derived data)
+ pixel_position_shape_0 = (3,)
source: algorithm.pixel_position_shape_0
description: pixel_position_shape = pixel_position_shape_0 + det_shape
+ pixel_position_type_str = float32
source: algorithm.pixel_position_type_str
description: type string (numpy) for the pixel_position array
+ ps_eb_nodes = 1
source: psana.ps_eb_nodes
description: no. of eventbuilder cores
+ ps_exp = xpptut1
source: psana.exp
description: PSANA experiment name
+ ps_runnum = 1
source: psana.runnum
description: PSANA experiment number
+ ps_smd_n_events = 10000
source: psana.ps_smd_n_events
description: no. of events to be sent to an EventBuilder core
+ ps_srv_nodes = 0
source: psana.ps_srv_nodes
description: no. of server cores
+ reduced_det_shape = (1, 128, 128)
(derived data)
+ reduced_pixel_index_shape = (2, 1, 128, 128)
(derived data)
+ reduced_pixel_position_shape = (3, 1, 128, 128)
(derived data)
+ solve_ac_maxiter = 100
source: algorithm.solve_ac_maxiter
description: max number of iterations in the CG solver
+ test = Quickstart settings for Perlmutter
source: debug.test
description: test field used for debugging
+ use_callmonitor = False
source: debug.use_callmonitor
description: enable call-monitor
+ use_cuda = True
source: runtime.use_cuda
description: use cuda wherever possible
+ use_cufinufft = True
source: runtime.use_cufinufft
description: use cufinufft for nufft support
+ use_cupy = True
source: runtime.use_cupy
description: use cupy wherever possible
+ use_psana = False
source: psana.enable
description: enable PSANA
+ use_single_prec = False
source: runtime.use_single_prec
description: if true, spinifel will use single-precision floating point
+ verbose = True
source: debug.verbose
description: is verbosity > 0
+ verbosity = 1
source: debug.verbosity
description: reporting verbosity
+ volume_shape = (151, 151, 151)
source: algorithm.volume_shape
description: shape of volume array
+ volume_type_str = complex64
source: algorithm.volume_type_str
description: type string (numpy) for the volume array
…
Traceback (most recent call last):
File “/global/u2/c/cahrens/Projects/exafel/pm/spinifel-dev/setup/conda/envs/myenv/lib/python3.8/runpy.py”, line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File “/global/u2/c/cahrens/Projects/exafel/pm/spinifel-dev/setup/conda/envs/myenv/lib/python3.8/runpy.py”, line 87, in _run_code
exec(code, run_globals)
File “/global/u2/c/cahrens/Projects/exafel/pm/spinifel-dev/spinifel/__main__.py”, line 24, in <module>
main()
File “/global/u2/c/cahrens/Projects/exafel/pm/spinifel-dev/setup/conda/envs/myenv/lib/python3.8/site-packages/PyNVTX/__init__.py”, line 33, in wrapper
ret = func(*args, **kwargs)
File “/global/u2/c/cahrens/Projects/exafel/pm/spinifel-dev/spinifel/mpi/main.py”, line 137, in main
ac = solve_ac(
File “/global/u2/c/cahrens/Projects/exafel/pm/spinifel-dev/setup/conda/envs/myenv/lib/python3.8/site-packages/PyNVTX/__init__.py”, line 33, in wrapper
ret = func(*args, **kwargs)
File “/global/u2/c/cahrens/Projects/exafel/pm/spinifel-dev/spinifel/mpi/autocorrelation.py”, line 209, in solve_ac
ret, info = cg(W, d, x0=x0, maxiter=maxiter, callback=callback)
File “<decorator-gen-7>“, line 2, in cg
File “/global/u2/c/cahrens/Projects/exafel/pm/spinifel-dev/setup/conda/envs/myenv/lib/python3.8/site-packages/scipy/_lib/_threadsafety.py”, line 44, in caller
return func(*a, **kw)
File “/global/u2/c/cahrens/Projects/exafel/pm/spinifel-dev/setup/conda/envs/myenv/lib/python3.8/site-packages/scipy/sparse/linalg/isolve/iterative.py”, line 329, in cg
work[slice2] += sclr1*matvec(work[slice1])
File “/global/u2/c/cahrens/Projects/exafel/pm/spinifel-dev/setup/conda/envs/myenv/lib/python3.8/site-packages/scipy/sparse/linalg/interface.py”, line 232, in matvec
y = self._matvec(x)
File “/global/u2/c/cahrens/Projects/exafel/pm/spinifel-dev/setup/conda/envs/myenv/lib/python3.8/site-packages/scipy/sparse/linalg/interface.py”, line 530, in _matvec
return self.__matvec_impl(x)
File “/global/u2/c/cahrens/Projects/exafel/pm/spinifel-dev/spinifel/mpi/autocorrelation.py”, line 119, in W_matvec
uvect_ADA = autocorrelation.core_problem_convolution(
File “/global/u2/c/cahrens/Projects/exafel/pm/spinifel-dev/spinifel/autocorrelation.py”, line 167, in core_problem_convolution
assert np.all(np.isreal(uvect))
```https://gitlab.osti.gov/mtip/spinifel/-/issues/43Need latest psana2 to get list of mpi ranks to exclude in spinifel2022-02-07T19:43:32ZChun Hong YoonNeed latest psana2 to get list of mpi ranks to exclude in spinifelfrom psana.psexp.tools import get_excl_ranksfrom psana.psexp.tools import get_excl_ranksMonarin UervirojnangkoornMonarin Uervirojnangkoornhttps://gitlab.osti.gov/mtip/spinifel/-/issues/42propose to deprecate sequential code2022-02-02T06:12:49ZChun Hong Yoonpropose to deprecate sequential codeI'm proposing to deprecate sequential code. Reasons are:
1) I am not aware of anyone using the code
2) there's a lot of duplication of code, i.e. sequential/autocorrelation.py vs mpi/autocorrelation.pyI'm proposing to deprecate sequential code. Reasons are:
1) I am not aware of anyone using the code
2) there's a lot of duplication of code, i.e. sequential/autocorrelation.py vs mpi/autocorrelation.pyhttps://gitlab.osti.gov/mtip/spinifel/-/issues/41latest finufft does not build on cori gpu2022-02-02T19:27:21ZChun Hong Yoonlatest finufft does not build on cori gpuWe need the latest finufft to compare cmtip and spinifel on Cori gpu.
Adding @jpblaschke's email message to summarize current status:
finufft needs the single-precision fftw3 symbols -- which we don't build on cori gpu. We have a few op...We need the latest finufft to compare cmtip and spinifel on Cori gpu.
Adding @jpblaschke's email message to summarize current status:
finufft needs the single-precision fftw3 symbols -- which we don't build on cori gpu. We have a few options here: i) I have a PR to disable single-precision finufft symbols: https://github.com/flatironinstitute/finufft/pull/179/files (there is some discussion, so this lost momentum, but we should probably pick this up again); ii) build your own fftw3; iii) build the single-precision symbols in our cori-gpu fftw3 module.Johannes Paul BlaschkeJohannes Paul Blaschkehttps://gitlab.osti.gov/mtip/spinifel/-/issues/40problem building/installing spinifel development branch on Perlmutter2022-01-25T23:40:19ZChristine Marie Sweeneyproblem building/installing spinifel development branch on PerlmutterWhile doing build_from_scratch, it errors out building cupy. In the attached file, first I load a script that makes some aliases for me, not setting any environment variables. Then, I module load cudatoolkit, then build_from_scratch.
...While doing build_from_scratch, it errors out building cupy. In the attached file, first I load a script that makes some aliases for me, not setting any environment variables. Then, I module load cudatoolkit, then build_from_scratch.
It used to work a couple weeks ago. Vinay has been able to do the install manually after this error by activating the myenv conda just created then doing a conda install cupy and the rest of the build_from_scratch steps manually.
I haven't been able to do that, but once Perlmutter comes back up, I can try. However if anyone has any suggestions on why this is erroring out, let me know.
[pm-build-error.txt](/uploads/70648b42b4fee21c71291d207d4f4a99/pm-build-error.txt)
Thanks!https://gitlab.osti.gov/mtip/spinifel/-/issues/39Clean up unit tests and integrate with CI2021-12-16T22:42:40ZIris ChangClean up unit tests and integrate with CIThis is a reminder to myself to clean up unit tests, organize them in two separate folders (spinifel/tests and tests) and integrate with CI.This is a reminder to myself to clean up unit tests, organize them in two separate folders (spinifel/tests and tests) and integrate with CI.