KokkosKernels: Gauss-Seidel threaded setup performance issues with Ifpack2 and MueLu
Created by: pwxy
I am trying to use the KokkosKernels threaded Gauss-Seidel. I'm calling it through ifpack2 ("MT Gauss-Seidel") as a smoother for MueLu, so the problem could be a bad interaction between KokkosKernels and ifpack2 or MueLu.
I'm running drekar on a single KNL of mutrino, with 1 MPI process, and I increase the OMP threads from 1 to 64 (1 OMP thread per core):
setup smoother (ifpack2 "MT Gauss-Seidel")
t | solve time(s) | GS setup time(s) |
---|---|---|
1 | 33.27 | 493.10 |
2 | 24.67 | 286.50 |
4 | 12.26 | 157.80 |
8 | 6.97 | 79.82 |
16 | 3.97 | 36.61 |
32 | 3.50 | 24.06 |
64 | 3.16 | 16.01 |
For reference, here are the times if I use the standard, non-threaded Gauss-Seidel (but if it really is non-threaded, why is the setup time going down as the number of OMP threads is increased?)
setup smoother (ifpack2 "Gauss-Seidel")
t | solve time(s) | GS setup time(s) |
---|---|---|
1 | 27.04 | 0.36 |
2 | 25.13 | 0.21 |
4 | 24.09 | 0.13 |
8 | 23.58 | 0.09 |
16 | 23.38 | 0.06 |
32 | 23.32 | 0.05 |
64 | 23.33 | 0.05 |
drekar/Trilinos was built with intel 17.0.2 and gnu 6.1.0 (Trilinos repo as of August 16, 2017)
I ran vtune on ellis for the 1 OMP case (the 493.1s case above). According to vtune, all the time is the two Kokkos::parallel_for calls in KokkosKernels::Experimental::Util::symmetrize_graph_symbolic_hashmap (lines 1097 and 1139 of KokkosKernels_Utils.hpp) the time is pretty much equally split between the two Kokkos::parallel_for calls
The following is the stack trace from Ifpack2:
Ifpack2::Relaxation::initialize()
KokkosKernels::Experimental::Graph::gauss_seidel_symbolic
KokkosKernels::Experimental::Graph::Impl::GaussSeidel
KokkosKernels::Experimental::Util::symmetrize_graph_symbolic_hashmap (Kokkos::parallel_for on line 1097 and 1139)
Kokkos::parallel_for
Kokkos::parallel_for
Edit (@aprokop): formatting