Teuchos::Time screwing up OpenMP/Threads performance
Created by: etphipp
I recently wrote a performance test in Sacado that tests the performance of a Kokkos kernel using Sacado. I used Teuchos::Time to measure the time the kernel took to run and was seeing very bad performance on CPUs using both the OpenMP and Threads execution spaces in Kokkos. For a single Sandy Bridge socket I was expecting my kernel to get about 9 GFLOP/s, and using 8 threads I was only seeing about 1 GFLOP/s. Furthermore, using the Serial execution space, I was seeing about 3 GFLOP/s. However I was seeing the level of expected throughput on KNC (using OpenMP) and Cuda. Purely by accident, I changed the timer from Teuchos::Time to Kokkos::Impl::Timer and suddenly my CPU performance jumped up to about what I expected. This behavior was reproducible on two different machines (my local development machine and Shannon) and with multiple compilers (GCC 4.9.3 and Intel 15.0.3). Has anyone else seen this before or have an idea why Teuchos::Time might be screwing up performance on CPUs with OpenMP or Threads? Teuchos::Time doesn’t have any threaded code in it, does it? I’m a little bit concerned the impact this might have based on the pervasive usage of Teuchos::TimeMonitor throughout Trilinos.