Tpetra: Avoid atomic ops for unpack if no duplicate LIDs

Created by: mhoemmen

@trilinos/tpetra Epic: #796 This story matters to other epics besides #796.

Thread-parallel implementations of Tpetra::DistObject::unpackAndCombine(New) do not need to do atomic updates if the result LIDs have no duplicates (meaning that at most one thread will ever modify any one value of the destination DistObject at a time). "Result LIDs have no duplicates" is a property of the Import / Export object, so the Import / Export object should remember this at construction time for reuse.

This matters because atomic updates have a run-time cost, even if there is no contention between threads. Sparse matrix-vector multiply does an Import on the input MultiVector. In the common case, the result LIDs in this Import should have no duplicates. (This is because the column Map is constructed that way, if users let CrsGraph or CrsMatrix construct the column Map.) Thus, atomic updates add unnecessary run time to this important kernel.

@jjellio and @tjfulle might be interested in this. I would like to fix this for MultiVector in a way that neatly encapsulates its unpack kernels (and ideally also its pack kernels). Fixing this issue should have the side effect of improving MPI-only performance, which is an important use case for many customers (who do not have hardware that requires MPI + threads for good performance).