Monarin Uervirojnangkoorn (0daa7137) at 12 Oct 00:50
update frontier job script
... and 1 more commit
Monarin Uervirojnangkoorn (55a20f59) at 09 Oct 22:47
Legion multi-conformation script for demo 2023
... and 341 more commits
Monarin Uervirojnangkoorn (55a20f59) at 09 Oct 22:43
Legion multi-conformation script for demo 2023
... and 2 more commits
Monarin Uervirojnangkoorn (2c4bd0b8) at 09 Oct 17:11
allow checking convergence at a specific generation
Monarin Uervirojnangkoorn (a5f2cca9) at 13 Sep 21:39
apply tmp fix for running with high no. of ranks (issue 45)
uvect_ADb and ugrid_conv are the same across all ranks. @Seemah @jpblaschke @jjdonatelli
Proposed solution (Meeting Johannes/Jeff/Mona/Seema)
Monarin Uervirojnangkoorn (5aa21741) at 07 Sep 22:38
all worker ranks print only when verbosity > 1
@jjdonatelli From the error above, do you have any thoughts?
This also fails on Perlmutter. I looked at last year milestone and we only tried up to 16 ranks. It's likely that we just hit this bug the first time.
Running on 4 nodes fails for any no. of images. I think this is new.
Mode: MPI/hdf5 Error:
Traceback (most recent call last):
File "/autofs/nccs-svm1_home1/monarin/frontier/spinifel/setup/conda/envs/myenv/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/autofs/nccs-svm1_home1/monarin/frontier/spinifel/setup/conda/envs/myenv/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/autofs/nccs-svm1_home1/monarin/frontier/spinifel/spinifel/__main__.py", line 29, in <module>
main()
File "/autofs/nccs-svm1_home1/monarin/frontier/spinifel/setup/conda/envs/myenv/lib/python3.8/site-packages/PyNVTX/__init__.py", line 33, in wrapper
ret = func(*args, **kwargs)
File "/autofs/nccs-svm1_home1/monarin/frontier/spinifel/spinifel/mpi/main.py", line 452, in main
ac = mg.solve_ac(generation, orientations, ac_phased)
File "/autofs/nccs-svm1_home1/monarin/frontier/spinifel/setup/conda/envs/myenv/lib/python3.8/site-packages/PyNVTX/__init__.py", line 33, in wrapper
ret = func(*args, **kwargs)
File "/autofs/nccs-svm1_home1/monarin/frontier/spinifel/spinifel/mpi/autocorrelation.py", line 161, in solve_ac
ret, info = cg(W, d, x0=x0, maxiter=self.maxiter, callback=self.callback)
File "/autofs/nccs-svm1_home1/monarin/frontier/spinifel/setup/conda/envs/myenv/lib/python3.8/site-packages/cupyx/scipy/sparse/linalg/_iterative.py", line 76, in cg
q = matvec(p)
File "/autofs/nccs-svm1_home1/monarin/frontier/spinifel/setup/conda/envs/myenv/lib/python3.8/site-packages/cupyx/scipy/sparse/linalg/_interface.py", line 89, in matvec
y = self._matvec(x)
File "/autofs/nccs-svm1_home1/monarin/frontier/spinifel/setup/conda/envs/myenv/lib/python3.8/site-packages/cupyx/scipy/sparse/linalg/_interface.py", line 282, in _matvec
return self.__matvec_impl(x)
File "/autofs/nccs-svm1_home1/monarin/frontier/spinifel/spinifel/mpi/autocorrelation.py", line 105, in W_matvec
uvect_ADA = self.core_problem_convolution(uvect, F_ugrid_conv_, ac_support)
File "/autofs/nccs-svm1_home1/monarin/frontier/spinifel/spinifel/sequential/autocorrelation.py", line 179, in core_problem_convolution
assert xp.all(xp.isreal(uvect))
AssertionError
Step to reproduce the problem:
sbatch submit_frontier.sh
where submit_frontier.sh is:
#!/bin/bash
#SBATCH -A chm137
#SBATCH -t 0:29:59
#SBATCH -N 4
#SBATCH -c 32
#SBATCH -J RunSpinifel
#SBATCH -o RunSpinifel_o.%J
#SBATCH -e RunSpinifel_e.%J
set +x
t_start=`date +%s`
# spinifel
source setup/env.sh
# Spinifel's env vars
export test_data_dir="/lustre/orion/proj-shared/chm137/demo23/data"
export out_dir="/lustre/orion/chm137/scratch/${USER}/${CI_PIPELINE_ID}/spinifel_output"
export USE_CUPY=1
# Creates the output folder if not already exist.
if [ ! -d "${out_dir}" ]; then
mkdir -p ${out_dir}
fi
# Running Spinifel
FRONTIER_EXTRAS="runtime.use_pygpu=true"
N_IMAGES_PER_RANK=250
N_IMAGES_MAX=250
srun -N4 -n32 --gpus-per-task 1 python $DEBUG_FLAG -m spinifel --default-settings=test_mpi.toml --mode=mpi $FRONTIER_EXTRAS data.in_dir=${test_data_dir} data.name=3iyf_128x128pixels_2m.h5 runtime.N_images_per_rank=$N_IMAGES_PER_RANK algorithm.N_images_max=$N_IMAGES_MAX fsc.fsc_fraction_known_orientations=0
t_end=`date +%s`
echo PSJobCompleted TotalElapsed $((t_end-t_start)) $t_start $t_end
Note that this ran with srun -N2 -n16. @Seemah @eslaught @jpblaschke
Monarin Uervirojnangkoorn (fb5f9874) at 25 Aug 21:25
job script for generating demo23 data
Monarin Uervirojnangkoorn (fbd5f84b) at 23 Aug 23:17
fix memory problem
Monarin Uervirojnangkoorn (cdfc336d) at 23 Aug 18:31
faster random indices; add timing info.
Monarin Uervirojnangkoorn (9afedc9a) at 22 Aug 19:24
add script to combine different conformation h5 files
Monarin Uervirojnangkoorn (c43d4482) at 22 Aug 17:30
update skopi repo
Monarin Uervirojnangkoorn (feb8b607) at 18 Aug 22:57
test script for running legion demo23
The installation of eds/spack was done on $SCRATCH. On an interactive node, running these commands produces the following error:
monarin@nid001584: source setup/env.sh
monarin@nid001584: srun -n1 -G1 python spinifel/tests/skopi_quaternion.py
srun: warning: can't run 1 processes on 2 nodes, setting nnodes to 1
MPICH ERROR [Rank 0] [job id 12528426.1] [Thu Jul 27 12:33:12 2023] [nid001584] - Abort(-1) (rank 0 in comm 0): MPIDI_CRAY_init: GPU_SUPPORT_ENABLED is requested, but GTL library is not linked
(Other MPI error)
aborting job:
MPIDI_CRAY_init: GPU_SUPPORT_ENABLED is requested, but GTL library is not linked
srun: error: nid001584: task 0: Exited with exit code 255
srun: Terminating StepId=12528426.1
Monarin Uervirojnangkoorn (b534a8c8) at 20 Jul 03:05
remove duplicate build for psana2