Performance of Intrepid2 for PIC

Created by: bathmatt

@trilinos/intrepid2 @rppawlo

In PIC methods we have lagrangian particles which migrate throughout the system. We have these particles with position x_i, velocity v_i. We can then derive a current which is the J_i = q v_i(x_i). This is a delta function, we will integrate J for the source of a Maxwell solve. Since J is just a sum of delta functions this integration is nothing more than an evaluation of the basis function at the point x_i.

The current typically use the HCurl basis, but for electrostatics we use the HGrad basis..

What we need is a fast way to perform a basis evaluation.

For example, if one wants the tet hcurl basis value at order 2 or higher one calls

Basis_HCURL_TET_In_FEM<Scalar, ArrayScalar>::getValues

which creates a temp array ArrayScalar phisCur( scalarBigN , numPts ); and calls functions in a new basis functoin Basis_HGRAD_TET_Cn_FEM_ORTH<Scalar,ArrayScalar > Phis_;

which in this constructor we create more temps.

      Phis_.getValues( phisCur , inputPoints , OPERATOR_VALUE );

which creates a field container of sfads FieldContainerSacado::Fad::SFad<Scalar,3 > dZ( z.dimension(0) , z.dimension(1) );

FieldContainer<Sacado::Fad::SFad<Scalar,3> > dResult(card,np);

which calls tabulate recursively.

Then once this is all done it multiplies some output with some local array coeffs_.

This chain skips some steps (I think their may be a human sacrifice in the code as well), but clearly this isn't really a performant call stack if I want to know what basis value is at point (3.1,4.5)...

What I need is a shallow call stack for this which doesn't allocate any memory and can be called on GPUs.

What I've done in the past is to make local versions of this call stack and optimize it with stack vars and other tricks to speed it up. It gave me a factor of 10x improvements on CPU and allowed me to run on GPU.