Zoltan2: Possible scaling issue with MueLu/Z2multijagged for MueLu coarse level repartitioning?

Created by: pwxy

I observed the following scaling of "MueLu: Zoltan2Interface: Zoltan2 multijagged (sub, total, level=2)" (the time for Zoltan2::PartitioningProblem->solve()) on the LLNL IBM BG/Q platform for strong scaling for the Drekar Poisson test case. Started with a 2.4B row matrix, but Zoltan2 reparitioning not called until after two levels of MueLu aggregation (~700x factor reduction). So have the case with few rows of the matrix per MPI process (probably not the standard usage of Z2 in most apps):

MPI "MueLu: Zoltan2Interface: Zoltan2 multijagged (sub, total, level=2)" time in sec 131072 2.10 262144 524288 12.25 1048576 26.7 1572864 66.9

I built the muelu driver on solo and ran with 256, 512, 1024, 2048, 4096 and 8192 MPI processes and could see that the Zoltan2 multijagged isn't scaling as well as hoped (but it is definitely easier to see the problem at much larger scales).

This is strong scaling with "Matrix type: Brick3D" (27 nnz per row) with problem size of 81M rows. Zoltan2 is not called until after two levels of coarsening (each coarsening reduces the rows by factor of roughly 27), so for example the 1024 MPI case, the matrix Z2 gets is 118,000 rows.

Times are the max over MPI processes for "MueLu: Zoltan2Interface: Zoltan2 multijagged (sub, total, level=2)" (this is the time for Z2 MJ to construct the new partitioning; MueLu tells Z2 how many partitions are needed and MueLu migrates the data afterwards) for both "mj_migration_type"=0 and "mj_migration_type"=1 performed 3 runs of each and reported the lowest time below

MPI MJ=0 MJ=1 256 0.0060 0.0060 512 0.0091 0.0090 1024 0.0144 0.0142 2048 0.0247 0.0244 4096 0.0607 0.0605 8192 0.1091 0.1089

So unless I screwed up, there doesn't seem to be much difference between "mj_migration_type"=0 and "mj_migration_type"=1

On solo the only module change I made was "module swap intel intel/17.0.4.196"

cmake file attached; muelu xml file attached

Here are my input arguments to the muelu driver:

MueLu_Driver.exe --matrixType=Brick3D --nx=433 --ny=433 --nz=433 --mx={xproc} --my={yproc} --mz=${zproc} --xml="muelu_scaling.xml"

MPI xproc yproc zproc

256 8 8 4 512 8 8 8 1024 16 8 8 2048 16 16 8 4096 16 16 16 8192 32 16 16

cmake_muelu_kokkos_serial_serrano_icc17.txt muelu_scaling.xml-z2mj_mj0_lev2minpp1024-c1000-t_exp-remap_rebpr-1vcyc11.txt