Panzer/Intrepid2/Sacado: getSideNormals seg faults on cuda with sacado dfad hierarchic parallelism
Created by: rppawlo
I'm working on transitioning panzer kernels to use the sacado hierarchic parallelism. Only a handful of panzer tests don't pass on cuda at this point. The first issue is in a call to Intrepid2::CellTools<>::getPhysicalSideNormals in a unit test. Wondering if I could get some help debugging this from sacado and/or intrepid teams?
This test has always passed on normal builds, but now with the build flag enabled:
-DSacado_ENABLE_HIERARCHICAL_DFAD=ON \
it is failing. This is for CUDA build on waterman. This flag really results in changing the layout of Fad Views. To reproduce, you will need the branch "panzer-hierarchic" on my fork of Trilinos (github.com:rppawlo/Trilinos).
Error occurs in allocation of a DynRankView:
0. normals_JacobianType_test2d_UnitTest ... terminate called after throwing an instance of 'std::runtime_error'
what(): cudaDeviceSynchronize() error( cudaErrorIllegalAddress): an illegal memory access was encountered ../packages/kokkos/core/src/Cuda/Kokkos_Cuda_Impl.cpp:119
Backtrace is:
#0 0x00007fffea7efaf0 in raise () from /lib64/libc.so.6
#1 0x00007fffea7f1e6c in abort () from /lib64/libc.so.6
#2 0x00007fffeaa90774 in __gnu_cxx::__verbose_terminate_handler () at ../../.././libstdc++-v3/libsupc++/vterminate.cc:95
#3 0x00007fffeaa8b504 in __cxxabiv1::__terminate (handler=<optimized out>) at ../../.././libstdc++-v3/libsupc++/eh_terminate.cc:47
#4 0x00007fffeaa89928 in __cxa_call_terminate (ue_header=0x1bb64090) at ../../.././libstdc++-v3/libsupc++/eh_call.cc:54
#5 0x00007fffeaa8aaec in __cxxabiv1::__gxx_personality_v0 (version=<optimized out>, actions=<optimized out>, exception_class=<optimized out>, ue_header=0x1bb64090, context=0x7fffffff5da0)
at ../../.././libstdc++-v3/libsupc++/eh_personality.cc:676
#6 0x00007fffea9ac084 in _Unwind_RaiseException_Phase2 (exc=0x1bb64090, context=0x7fffffff5da0) at ../.././libgcc/unwind.inc:62
#7 0x00007fffea9acc04 in _Unwind_Resume (exc=0x1bb64090) at ../.././libgcc/unwind.inc:230
#8 0x00000000108d31c8 in Kokkos::Impl::cuda_internal_error_throw (e=cudaErrorIllegalAddress, name=0x10f550e8 "cudaDeviceSynchronize()", file=0x10f550b0 "../packages/kokkos/core/src/Cuda/Kokkos_Cuda_Impl.cpp",
line=119) at ../packages/kokkos/core/src/Cuda/Kokkos_Cuda_Impl.cpp:124
#9 0x0000000010018d4c in Kokkos::Impl::cuda_internal_safe_call (e=cudaErrorIllegalAddress, name=0x10f550e8 "cudaDeviceSynchronize()", file=0x10f550b0 "../packages/kokkos/core/src/Cuda/Kokkos_Cuda_Impl.cpp",
line=119) at ../packages/kokkos/core/src/Cuda/Kokkos_Cuda_Error.hpp:58
#10 0x00000000108d2fd0 in Kokkos::Impl::cuda_device_synchronize () at ../packages/kokkos/core/src/Cuda/Kokkos_Cuda_Impl.cpp:119
#11 0x00000000108d5a80 in Kokkos::Cuda::fence () at ../packages/kokkos/core/src/Cuda/Kokkos_Cuda_Impl.cpp:820
#12 0x00000000108d01fc in Kokkos::CudaUVMSpace::deallocate (this=0x1bb6a018, arg_alloc_ptr=0x7ff340000000) at ../packages/kokkos/core/src/Cuda/Kokkos_CudaSpace.cpp:220
#13 0x00000000108d0e20 in Kokkos::Impl::SharedAllocationRecord<Kokkos::CudaUVMSpace, void>::~SharedAllocationRecord (this=0x1bb69fd0, __in_chrg=<optimized out>)
at ../packages/kokkos/core/src/Cuda/Kokkos_CudaSpace.cpp:400
#14 0x000000001007a15c in Kokkos::Impl::SharedAllocationRecord<Kokkos::CudaUVMSpace, Kokkos::Impl::ViewValueFunctor<Kokkos::Cuda, double, true> >::~SharedAllocationRecord (this=0x1bb69fd0, __in_chrg=<optimized out>)
at ../packages/kokkos/core/src/impl/Kokkos_SharedAlloc.hpp:214
#15 0x000000001007a1ac in Kokkos::Impl::SharedAllocationRecord<Kokkos::CudaUVMSpace, Kokkos::Impl::ViewValueFunctor<Kokkos::Cuda, double, true> >::~SharedAllocationRecord (this=0x1bb69fd0, __in_chrg=<optimized out>)
at ../packages/kokkos/core/src/impl/Kokkos_SharedAlloc.hpp:214
#16 0x000000001007a248 in Kokkos::Impl::(anonymous namespace)::deallocate<Kokkos::CudaUVMSpace, Kokkos::Impl::ViewValueFunctor<Kokkos::Cuda, double, true> > (record_ptr=0x1bb69fd0)
at ../packages/kokkos/core/src/impl/Kokkos_SharedAlloc.hpp:198
#17 0x00000000108c65ec in Kokkos::Impl::SharedAllocationRecord<void, void>::decrement (arg_record=0x1bb69fd0) at ../packages/kokkos/core/src/impl/Kokkos_SharedAlloc.cpp:262
#18 0x0000000010295b74 in ~SharedAllocationTracker (this=0x7fffffff73f8, __in_chrg=<optimized out>) at ../packages/kokkos/core/src/impl/Kokkos_SharedAlloc.hpp:358
#19 Kokkos::DynRankView<Sacado::Fad::DFad<double>, Kokkos::Cuda>::~DynRankView (this=0x7fffffff73f8, __in_chrg=<optimized out>) at ../packages/kokkos/containers/src/Kokkos_DynRankView.hpp:923
#20 0x00000000103fabf8 in Intrepid2::CellTools<Kokkos::Cuda>::getPhysicalEdgeTangents<Sacado::Fad::DFad<double>, Kokkos::Cuda, double, Kokkos::LayoutLeft, Kokkos::Cuda> (edgeTangents=..., worksetJacobians=...,
worksetEdgeOrd=1, parentCell=...) at ../packages/intrepid2/src/Cell/Intrepid2_CellToolsDefNodeInfo.hpp:573
#21 0x00000000103f69b0 in Intrepid2::CellTools<Kokkos::Cuda>::getPhysicalSideNormals<Sacado::Fad::DFad<double>, Kokkos::LayoutContiguous<Kokkos::LayoutLeft, 32u>, Kokkos::Cuda, double, Kokkos::LayoutLeft, Kokkos::Cuda> (sideNormals=..., worksetJacobians=..., worksetSideOrd=1, parentCell=...) at ../packages/intrepid2/src/Cell/Intrepid2_CellToolsDefNodeInfo.hpp:673
#22 0x00000000103f3ad0 in panzer::Normals<panzer::Traits::Jacobian, panzer::Traits>::evaluateFields (this=0x1bb65030, workset=...) at ../packages/panzer/disc-fe/src/evaluators/Panzer_Normals_impl.hpp:105
#23 0x00000000100a7f70 in PHX::DagManager<panzer::Traits>::evaluateFields (this=0x1bb684c8, d=...) at ../packages/phalanx/src/Phalanx_DAG_Manager_Def.hpp:450
#24 0x00000000100a404c in PHX::EvaluationContainer<panzer::Traits::Jacobian, panzer::Traits>::evaluateFields (this=0x1bb684c0, d=...) at ../packages/phalanx/src/Phalanx_EvaluationContainer_Def.hpp:193
#25 0x0000000010030158 in PHX::FieldManager<panzer::Traits>::evaluateFields<panzer::Traits::Jacobian> (this=0x1bb674e0, d=...) at ../packages/phalanx/src/Phalanx_FieldManager_Def.hpp:327
#26 0x00000000100208b4 in panzer::normals_test2d_UnitTest<panzer::Traits::Jacobian>::runUnitTestImpl (this=0x1272a330 <panzer::instance_normals_JacobianType_test2d_UnitTest>, out=..., success=@0x7fffffff95b8: true)
at ../packages/panzer/disc-fe/test/evaluator_tests/normals.cpp:172
#27 0x00000000108ae968 in Teuchos::UnitTestBase::runUnitTest (this=0x1272a330 <panzer::instance_normals_JacobianType_test2d_UnitTest>, out=...) at ../packages/teuchos/core/src/Teuchos_UnitTestBase.cpp:62
#28 0x00000000108b1f98 in Teuchos::UnitTestRepository::runUnitTestImpl (unitTest=..., out=...) at ../packages/teuchos/core/src/Teuchos_UnitTestRepository.cpp:539
#29 0x00000000108b0104 in Teuchos::UnitTestRepository::runUnitTests (out=...) at ../packages/teuchos/core/src/Teuchos_UnitTestRepository.cpp:317
#30 0x00000000108b16d4 in Teuchos::UnitTestRepository::runUnitTestsFromMain (argc=1, argv=0x7fffffffa7b8) at ../packages/teuchos/core/src/Teuchos_UnitTestRepository.cpp:423
#31 0x00000000100df8ec in main (argc=1, argv=0x7fffffffa7b8) at ../packages/phalanx/test/Utilities/Phalanx_UnitTestMain.cpp:19
(gdb)
@trilinos/intrepid2 @trilinos/sacado @mperego @kyungjoo-kim @etphipp