Scaling Spinifel-Legion with Multiple Conformations on Frontier
Track Progress/Issues related to Scaling on Frontier
-
Logs/results/scripts for 1000 images per rank (up to 1000 ranks and 1M diffraction patterns) are in
- /lustre/orion/proj-shared/chm137/seemah/spinifel_output_frontier_opt
-
Logs/results/scripts for 125 images per rank + legion profiling (up to 8000 ranks and 1M diffraction patterns) are in
- /lustre/orion/proj-shared/chm137/seemah/spinifel_output_frontier_prof
-
Runs for 4000 and 8000 ranks did not complete - i.e. made very little progress
-
I tested 4000 ranks for just 2 generations + disabled 'control flow' + used libfabric related options for large buffers. (All run options are in the logs)
- export FI_CXI_DEFAULT_CQ_SIZE=13107200 - export FI_CXI_REQ_BUF_MIN_POSTED=10 - export FI_CXI_REQ_BUF_SIZE=25165824
- It ran and completed both generations and produced all the output files - I did kill the job since it was running for a while after all generations were done
- Logs/results for that are in /lustre/orion/proj-shared/chm137/seemah/spinifel_output_frontier_debug/result_4000tasks/
- It ran and completed both generations and produced all the output files - I did kill the job since it was running for a while after all generations were done