MueLu segfaults in DeterminePartitionPlacement for RefMaxwell on serrano for >= 16 nodes
Created by: pwxy
MueLu segfaults in DeterminePartitionPlacement for RefMaxwell on serrano for >= 16 nodes
MueLu segfaults in DeterminePartitionPlacement for RefMaxwell on serrano. Unfortunately I am unable to reproduce this issue on mutrino. Unfortunately the problem is somewhat intermittent. Can first see it on 16 nodes (but run is more likely to succeed than fail on 16 nodes). By 64 nodes it is more likely to fail than succeed. Test case is the awesome blob medium size mesh (refinement=1) with 256 MPI. For 16 node case have 2 OMP threads per MPI. For 64 node case have 8 OMP threads per MPI.
It is failing at line 535 of MueLu_RepartitionFactory_def.hpp in DeterminePartitionPlacement()
// Step 4: Assign unassigned partitions if necessary.
// We do that through random matching for remaining partitions. Not all part numbers are valid, but valid parts are a
// subset of [0, numProcs). The reason it is done this way is that we don't need any extra communication, as we don't
// need to know which parts are valid.
// TODO The cost of this loop is numprocs*log(numprocs), as match is a std::set(). Can this cost be reduced?
if (numPartitions - numMatched > 0) {
for (int part = 0, matcher = 0; part < numProcs; part++) {
if (match.count(part) == 0) {
// Find first non-matched rank that accepts partitions
535 while (matchedRanks[matcher] || !procWillAcceptPartition[matcher])
matcher++;
match[part] = matcher++;
}
}
}
stack trace:
#0 MueLu::RepartitionFactory<double, int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::OpenMP, Kokkos::HostSpace> >::DeterminePartitionPlacement (this=0x7ffffffe7b18, A=..., decomposition=..., numPartitions=10, willAcceptPartition=248, allSubdomainsAcceptPartitions=false) at ../../packages/muelu/src/Rebalancing/MueLu_RepartitionFactory_def.hpp:535 #1 0x0000000004fc094d in MueLu::RepartitionFactory<double, int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::OpenMP, Kokkos::HostSpace> >::Build (this=0x7ffffffe7b18, currentLevel=...) at ../../packages/muelu/src/Rebalancing/MueLu_RepartitionFactory_def.hpp:219 #2 0x0000000004620e69 in MueLu::RefMaxwell<double, int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::OpenMP, Kokkos::HostSpace> >::compute() () #3 0x0000000003416562 in MueLu::RefMaxwell<double, int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::OpenMP, Kokkos::HostSpace> >::RefMaxwell(Teuchos::RCP<Xpetra::Matrix<double, int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::OpenMP, Kokkos::HostSpace> > > const&, Teuchos::ParameterList&, bool) () #4 0x000000000346131f in Thyra::MueLuRefMaxwellPreconditionerFactory<double, int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::OpenMP, Kokkos::HostSpace> >::initializePrec(Teuchos::RCP<Thyra::LinearOpSourceBase const> const&, Thyra::PreconditionerBase*, Thyra::ESupportSolveUse) const () #5 0x0000000005e74234 in Thyra::BelosLinearOpWithSolveFactory::initializeOpImpl(Teuchos::RCP<Thyra::LinearOpSourceBase const> const&, Teuchos::RCP<Thyra::LinearOpSourceBase const> const&, Teuchos::RCP<Thyra::PreconditionerBase const> const&, bool, Thyra::LinearOpWithSolveBase*, Thyra::ESupportSolveUse) const () #6 0x0000000005e74c23 in Thyra::BelosLinearOpWithSolveFactory::initializeOp(Teuchos::RCP<Thyra::LinearOpSourceBase const> const&, Thyra::LinearOpWithSolveBase*, Thyra::ESupportSolveUse) const () #7 0x0000000001e8575d in void Thyra::initializeOp(Thyra::LinearOpWithSolveFactoryBase const&, Teuchos::RCP<Thyra::LinearOpBase const> const&, Teuchos::Ptr<Thyra::LinearOpWithSolveBase > const&, Thyra::ESupportSolveUse) ()