Tpetra MV putScalar does not thread scale and is less performant than scaling a MV by a constant
Created by: jjellio
This is an issue to track a performance problem identified on a Cray/KNL machine.
TLDR: putScalar on a block of multivectors takes over twice as long as computing an inner product or vector update on the same block of vectors. Furthermore, putScalar is taking substantially longer than scale, yet they should scale similarly.
I've constructed a small slide deck with details (email me if you would like to see it).
To perform the test, I use Belos' MultiVectorTraits, and profiled MvInit, MvScale, Update and Inner Product. I gather data for block sizes of 1 to 100. This is effectively the building blocks for GMRES solver.
Timers have the format: "MVT::Operation:<block_size>"
There seems to be two issues:
- putScalar performance is much slower than scale
- putScalar scales more poorly when hardware threads are enabled.
Under a profiler, the time in putScalar is attributed to Kokkos::deep_copy(), the timings below are from production runs.
For example: Running 64 processes per KNL node, with 1 HT per core enabled (64x1x1):
Operation | MinOverProcs |
---|---|
MVT::InnerProduct::100 | 4.042 |
MVT::MVScale::100 | 0.000998 |
MVT::MvInit::100 | 9.146 |
MVT::Update::100 | 4.101 |
Running 64 processes per KNL node, with 4 HT per core enabled (64x1x4)
Operation | MinOverProcs |
---|---|
MVT::InnerProduct::100 | 4.122 |
MVT::MVScale::100 | 0.001268 |
MVT::MvInit::100 | 55.53 |
MVT::Update::100 | 4.138 |
(full 64x1x1 data)
Operation | MinOverProcs |
---|---|
MVT::InnerProduct::1 | 0.09228 |
MVT::InnerProduct::10 | 0.4413 |
MVT::InnerProduct::100 | 4.042 |
MVT::InnerProduct::15 | 0.6721 |
MVT::InnerProduct::2 | 0.1694 |
MVT::InnerProduct::20 | 0.8192 |
MVT::InnerProduct::25 | 1.028 |
MVT::InnerProduct::3 | 0.1989 |
MVT::InnerProduct::30 | 1.244 |
MVT::InnerProduct::35 | 1.445 |
MVT::InnerProduct::4 | 0.1983 |
MVT::InnerProduct::40 | 1.593 |
MVT::InnerProduct::45 | 1.823 |
MVT::InnerProduct::5 | 0.2641 |
MVT::InnerProduct::50 | 2 |
MVT::InnerProduct::6 | 0.2976 |
MVT::InnerProduct::60 | 2.394 |
MVT::InnerProduct::7 | 0.3759 |
MVT::InnerProduct::70 | 2.813 |
MVT::InnerProduct::8 | 0.3371 |
MVT::InnerProduct::80 | 3.189 |
MVT::InnerProduct::9 | 0.4084 |
MVT::InnerProduct::90 | 3.625 |
------------------------ | ----------- |
MVT::MVScale::1 | 0.0008667 |
MVT::MVScale::10 | 0.0009456 |
MVT::MVScale::100 | 0.000998 |
MVT::MVScale::15 | 0.000947 |
MVT::MVScale::2 | 0.0009308 |
MVT::MVScale::20 | 0.0009654 |
MVT::MVScale::25 | 0.0009418 |
MVT::MVScale::3 | 0.0009527 |
MVT::MVScale::30 | 0.001003 |
MVT::MVScale::35 | 0.000972 |
MVT::MVScale::4 | 0.0009472 |
MVT::MVScale::40 | 0.000983 |
MVT::MVScale::45 | 0.0009711 |
MVT::MVScale::5 | 0.000968 |
MVT::MVScale::50 | 0.0009766 |
MVT::MVScale::6 | 0.0009365 |
MVT::MVScale::60 | 0.0009604 |
MVT::MVScale::7 | 0.0009186 |
MVT::MVScale::70 | 0.0009918 |
MVT::MVScale::8 | 0.0009317 |
MVT::MVScale::80 | 0.001002 |
MVT::MVScale::9 | 0.0009496 |
MVT::MVScale::90 | 0.001032 |
------------------------ | ----------- |
MVT::MvInit::1 | 0.06442 |
MVT::MvInit::10 | 0.6943 |
MVT::MvInit::100 | 9.146 |
MVT::MvInit::15 | 0.9989 |
MVT::MvInit::2 | 0.2652 |
MVT::MvInit::20 | 1.32 |
MVT::MvInit::25 | 1.631 |
MVT::MvInit::3 | 0.3155 |
MVT::MvInit::30 | 1.923 |
MVT::MvInit::35 | 2.285 |
MVT::MvInit::4 | 0.4194 |
MVT::MvInit::40 | 2.645 |
MVT::MvInit::45 | 3.044 |
MVT::MvInit::5 | 0.4686 |
MVT::MvInit::50 | 4.089 |
MVT::MvInit::6 | 0.5292 |
MVT::MvInit::60 | 5.047 |
MVT::MvInit::7 | 0.5619 |
MVT::MvInit::70 | 6.44 |
MVT::MvInit::8 | 0.612 |
MVT::MvInit::80 | 7.344 |
MVT::MvInit::9 | 0.645 |
MVT::MvInit::90 | 8.386 |
------------------------ | ----------- |
MVT::Update::1 | 0.1297 |
MVT::Update::10 | 0.4311 |
MVT::Update::100 | 4.101 |
MVT::Update::15 | 0.618 |
MVT::Update::2 | 0.1138 |
MVT::Update::20 | 0.8033 |
MVT::Update::25 | 1.031 |
MVT::Update::3 | 0.1424 |
MVT::Update::30 | 1.226 |
MVT::Update::35 | 1.406 |
MVT::Update::4 | 0.1721 |
MVT::Update::40 | 1.608 |
MVT::Update::45 | 1.838 |
MVT::Update::5 | 0.2438 |
MVT::Update::50 | 2.025 |
MVT::Update::6 | 0.271 |
MVT::Update::60 | 2.401 |
MVT::Update::7 | 0.2982 |
MVT::Update::70 | 2.853 |
MVT::Update::8 | 0.331 |
MVT::Update::80 | 3.232 |
MVT::Update::9 | 0.4017 |
MVT::Update::90 | 3.692 |
(full 64x1x4 data)
Operation | MinOverProcs |
---|---|
MVT::InnerProduct::1 | 0.101 |
MVT::InnerProduct::10 | 0.4407 |
MVT::InnerProduct::100 | 4.122 |
MVT::InnerProduct::15 | 0.6858 |
MVT::InnerProduct::2 | 0.2098 |
MVT::InnerProduct::20 | 0.8349 |
MVT::InnerProduct::25 | 1.042 |
MVT::InnerProduct::3 | 0.2041 |
MVT::InnerProduct::30 | 1.246 |
MVT::InnerProduct::35 | 1.467 |
MVT::InnerProduct::4 | 0.2042 |
MVT::InnerProduct::40 | 1.618 |
MVT::InnerProduct::45 | 1.865 |
MVT::InnerProduct::5 | 0.265 |
MVT::InnerProduct::50 | 2.048 |
MVT::InnerProduct::6 | 0.2972 |
MVT::InnerProduct::60 | 2.446 |
MVT::InnerProduct::7 | 0.3803 |
MVT::InnerProduct::70 | 2.871 |
MVT::InnerProduct::8 | 0.3409 |
MVT::InnerProduct::80 | 3.243 |
MVT::InnerProduct::9 | 0.4084 |
MVT::InnerProduct::90 | 3.687 |
------------------------ | ----------- |
MVT::MVScale::1 | 0.001116 |
MVT::MVScale::10 | 0.001141 |
MVT::MVScale::100 | 0.001268 |
MVT::MVScale::15 | 0.001144 |
MVT::MVScale::2 | 0.001169 |
MVT::MVScale::20 | 0.001157 |
MVT::MVScale::25 | 0.001182 |
MVT::MVScale::3 | 0.00119 |
MVT::MVScale::30 | 0.001184 |
MVT::MVScale::35 | 0.001155 |
MVT::MVScale::4 | 0.001174 |
MVT::MVScale::40 | 0.001208 |
MVT::MVScale::45 | 0.001172 |
MVT::MVScale::5 | 0.00122 |
MVT::MVScale::50 | 0.00117 |
MVT::MVScale::6 | 0.00115 |
MVT::MVScale::60 | 0.001222 |
MVT::MVScale::7 | 0.001136 |
MVT::MVScale::70 | 0.001207 |
MVT::MVScale::8 | 0.00117 |
MVT::MVScale::80 | 0.001201 |
MVT::MVScale::9 | 0.001175 |
MVT::MVScale::90 | 0.001189 |
------------------------ | ----------- |
MVT::MvInit::1 | 0.06317 |
MVT::MvInit::10 | 0.6489 |
MVT::MvInit::100 | 55.53 |
MVT::MvInit::15 | 0.9798 |
MVT::MvInit::2 | 0.2715 |
MVT::MvInit::20 | 1.291 |
MVT::MvInit::25 | 1.629 |
MVT::MvInit::3 | 0.3176 |
MVT::MvInit::30 | 1.975 |
MVT::MvInit::35 | 2.268 |
MVT::MvInit::4 | 0.3524 |
MVT::MvInit::40 | 4.683 |
MVT::MvInit::45 | 13.34 |
MVT::MvInit::5 | 0.4013 |
MVT::MvInit::50 | 22.06 |
MVT::MvInit::6 | 0.4454 |
MVT::MvInit::60 | 32.64 |
MVT::MvInit::7 | 0.4939 |
MVT::MvInit::70 | 38.67 |
MVT::MvInit::8 | 0.5283 |
MVT::MvInit::80 | 44.46 |
MVT::MvInit::9 | 0.5911 |
MVT::MvInit::90 | 50.02 |
------------------------ | ----------- |
MVT::Update::1 | 0.0894 |
MVT::Update::10 | 0.4455 |
MVT::Update::100 | 4.138 |
MVT::Update::15 | 0.6346 |
MVT::Update::2 | 0.1286 |
MVT::Update::20 | 0.8219 |
MVT::Update::25 | 1.045 |
MVT::Update::3 | 0.1589 |
MVT::Update::30 | 1.236 |
MVT::Update::35 | 1.425 |
MVT::Update::4 | 0.1867 |
MVT::Update::40 | 1.615 |
MVT::Update::45 | 1.846 |
MVT::Update::5 | 0.2577 |
MVT::Update::50 | 2.049 |
MVT::Update::6 | 0.2854 |
MVT::Update::60 | 2.449 |
MVT::Update::7 | 0.3129 |
MVT::Update::70 | 2.868 |
MVT::Update::8 | 0.3438 |
MVT::Update::80 | 3.27 |
MVT::Update::9 | 0.4171 |
MVT::Update::90 | 3.713 |
@trilinos/tpetra @trilinos/kokkos-kernels @mhoemmen @crtrott @tjfulle