Trilinos issueshttps://gitlab.osti.gov/jmwille/Trilinos/-/issues2019-05-02T17:33:07Zhttps://gitlab.osti.gov/jmwille/Trilinos/-/issues/5036Zoltan2: Tests failing on ATDM cuda 10 build2019-05-02T17:33:07ZJames WillenbringZoltan2: Tests failing on ATDM cuda 10 build*Created by: fryeguy52*
## Bug Report
CC: @trilinos/zoltan2, @kddevin (Trilinos Data Services Product Lead), @bartlettroscoe, @fryeguy52
## Next Action Status
<status-and-or-first-action>
## Description
As shown in [this q...*Created by: fryeguy52*
## Bug Report
CC: @trilinos/zoltan2, @kddevin (Trilinos Data Services Product Lead), @bartlettroscoe, @fryeguy52
## Next Action Status
<status-and-or-first-action>
## Description
As shown in [this query](https://testing.sandia.gov/cdash/queryTests.php?project=Trilinos&filtercombine=and&filtercombine=&filtercombine=and&filtercombine=and&filtercombine=and&filtercount=6&showfilters=1&filtercombine=and&field1=buildname&compare1=61&value1=Trilinos-atdm-white-ride-cuda-10.1-gnu-7.2.0-release-debug&field2=testname&compare2=65&value2=Zoltan2&field3=status&compare3=61&value3=Failed&field4=site&compare4=61&value4=white&field5=buildstarttime&compare5=84&value5=2019-04-29T00%3A00%3A00&field6=buildstarttime&compare6=83&value6=2019-03-30T00%3A00%3A00) the tests:
* Zoltan2_directoryTest_Kokkos_MPI_4
* Zoltan2_directoryTest_KokkosSimple_MPI_4
* Zoltan2_directoryTest_findUniqueGids.cpp_MPI_4
are failing in the build:
* Trilinos-atdm-white-ride-cuda-10.1-gnu-7.2.0-release-debug
## Current Status on CDash
[Failing zoltan2 tests for the current testing day](https://testing.sandia.gov/cdash/queryTests.php?project=Trilinos&filtercombine=and&filtercombine=&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercount=6&showfilters=1&filtercombine=and&field1=buildname&compare1=61&value1=Trilinos-atdm-white-ride-cuda-10.1-gnu-7.2.0-release-debug&field2=testname&compare2=65&value2=Zoltan2&field3=status&compare3=61&value3=Failed&field4=site&compare4=61&value4=white&field5=buildstarttime&compare5=84&value5=today&field6=buildstarttime&compare6=83&value6=yesterday)
## Steps to Reproduce
One should be able to reproduce this failure on ride or white as described in:
* https://github.com/trilinos/Trilinos/blob/develop/cmake/std/atdm/README.md
More specifically, the commands given for ride or white are provided at:
* https://github.com/trilinos/Trilinos/blob/develop/cmake/std/atdm/README.md#ridewhite
The exact commands to reproduce this issue should be:
```
$ cd <some_build_dir>/
$ source $TRILINOS_DIR/cmake/std/atdm/load-env.sh Trilinos-atdm-white-ride-cuda-10.1-gnu-7.2.0-release-debug
$ cmake \
-GNinja \
-DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnv.cmake \
-DTrilinos_ENABLE_TESTS=ON -DTrilinos_ENABLE_Zoltan2=ON \
$TRILINOS_DIR
$ make NP=16
$ bsub -x -Is -q rhel7F -n 16 ctest -j16
```
Initial cleanup of new ATDM builds of Trilinoshttps://gitlab.osti.gov/jmwille/Trilinos/-/issues/4289Zoltan2: Create testing for useDegreeAsWeight in Zoltan2_ImbalanceMetricsUtil...2019-01-29T13:04:23ZJames WillenbringZoltan2: Create testing for useDegreeAsWeight in Zoltan2_ImbalanceMetricsUtility.hpp*Created by: MicheldeMessieres*
@trilinos/zoltan2
## Current Behavior
None of the zoltan2 tests call the code in Zoltan2_ImbalanceMetricsUtility.hpp for the option useDegreeAsWeight.
## Motivation and Context
A bug was fixed i...*Created by: MicheldeMessieres*
@trilinos/zoltan2
## Current Behavior
None of the zoltan2 tests call the code in Zoltan2_ImbalanceMetricsUtility.hpp for the option useDegreeAsWeight.
## Motivation and Context
A bug was fixed in #4271 but this could have been caught earlier if we had test coverage.
## Definition of Done
A unit test is added which fails if there are problems with useDegreeAsWeight code in Zoltan2_ImbalanceMetricsUtility.hpp. We might extend some current testing in the metric test.
https://gitlab.osti.gov/jmwille/Trilinos/-/issues/4111Zoltan2_AlgPuLP: Closing old PR #1632 "resolved conflict with offset_t"2018-12-20T19:10:46ZJames WillenbringZoltan2_AlgPuLP: Closing old PR #1632 "resolved conflict with offset_t"*Created by: william76*
@trilinos/zoltan2
Pull Request #1632 "resolved conflict with offset_t" is really old and appears to be stale and/or abandoned. I'm closing that PR and creating this issue to link to it.
I'm doing this be...*Created by: william76*
@trilinos/zoltan2
Pull Request #1632 "resolved conflict with offset_t" is really old and appears to be stale and/or abandoned. I'm closing that PR and creating this issue to link to it.
I'm doing this because we need to close out some of the old PR's due to some GitHub limitations on the number of checks/hour that are allowed. The pull request autotester uses a polling model to check existing pull requests' status flags, etc. and we have occasionally hit that limit, which causes GitHub to reject the queries and can cause the Autotester to fail until the counter resets at the start of the next hour. If this PR needs to be brought back to life it can easily be reopened on the pull request page.
If this PR is truly dead, please close out this issue ticket.
FYI: @kddevin @mhoemmen @jwillenbring https://gitlab.osti.gov/jmwille/Trilinos/-/issues/3286Zoltan2 test zoltanCompare may fail with Tpetra_INST_INT_INT=OFF2018-08-17T22:45:21ZJames WillenbringZoltan2 test zoltanCompare may fail with Tpetra_INST_INT_INT=OFF*Created by: kddevin*
<!---
Provide a general summary of the issue in the Title above. If this issue
pertains to a particular package in Trilinos, it's worthwhile to start the
title with "PackageName: ".
-->
<!---
Note that an...*Created by: kddevin*
<!---
Provide a general summary of the issue in the Title above. If this issue
pertains to a particular package in Trilinos, it's worthwhile to start the
title with "PackageName: ".
-->
<!---
Note that anything between these delimiters is a comment that will not appear
in the issue description once created. Click on the Preview tab to see what
everything will look like when you submit.
-->
<!---
Feel free to delete anything from this template that is not applicable to the
issue you are submitting.
-->
<!---
Replace <teamName> below with the appropriate Trilinos package/team name.
-->
@trilinos/zoltan2
<!---
Assignees: If you know anyone who should likely tackle this issue, select them
from the Assignees drop-down on the right.
-->
<!---
Lables: Choose any applicable package names from the Labels drop-down on the
right. Additionally, choose a label to indicate the type of issue, for
instance, bug, build, documentation, enhancement, etc.
-->
## Expectations
<!---
Tell us what you think should happen, how you think things should work, what
you would like to see in the documentation, etc.
-->
## Current Behavior
<!---
Tell us how the current behavior fails to meet your expectations in some way.
-->
Zoltan2 test zoltanCompare may fail with Tpetra_INST_INT__INT=OFF due to differences in the way 64-bit are represented in the test’s interface to Zoltan and Zoltan2’s interface to Zoltan. Failing component tests are PHG. The different representations of 64-bit ids lead to different hash values and different visit orders and, thus, different partitions in the test. We will need to fix the test to use the same representation as the Zoltan2 interface.
## Motivation and Context
<!---
How has this expectation failure affected you? What are you trying to
accomplish? Why do we need to address this? What does it have to do with
anything? Providing context helps us come up with a solution that is most
useful in the real world.
-->
## Definition of Done
<!---
Tell us what needs to happen. If necessary, give us a task list along the
lines of:
- [ ] First do this.
- [ ] Then do that.
- [ ] Also this other thing.
-->
## Possible Solution
<!---
Not obligatory, but suggest a fix for the bug or documentation, or suggest
ideas on how to implement the addition or change.
-->
## Steps to Reproduce
<!---
Provide a link to a live example, or an unambiguous set of steps to reproduce
this issue. Include code to reproduce, if relevant.
1. Do this.
1. Do that.
1. Shake fist angrily at computer.
-->
I saw this failure on mac with Tpetra_INST_INT__INT=OFF
## Your Environment
<!---
Include relevant details about your environment such that we can replicate this
issue.
-->
- **Relevant repo SHA1s:**
- **Relevant configure flags or configure script:**
- **Operating system and version:**
- **Compiler and TPL versions:**
## Related Issues
* Related to #3194
## Additional Information
<!---
Anything else that might be helpful for us to know in addressing this issue:
* Configure log file:
* Build log file:
* Test log file:
* When was the last time everything worked (date/time; SHA1s; etc.)?
* What did you do that made the bug rear its ugly head?
* Have you tried turning it off and on again?
-->https://gitlab.osti.gov/jmwille/Trilinos/-/issues/2052Test Zoltan2_orderingTestDriverExample_MPI_1 failing when trying to enable Sc...2017-12-06T14:19:37ZJames WillenbringTest Zoltan2_orderingTestDriverExample_MPI_1 failing when trying to enable Scotch TPL*Created by: bartlettroscoe*
**CC:** @trilinos/zoltan2
## Description:
The test `Zoltan2_orderingTestDriverExample_MPI_1` fails when trying to enable the SEMS Scotch TPL (see #2051). It appears that this test never runs in automa...*Created by: bartlettroscoe*
**CC:** @trilinos/zoltan2
## Description:
The test `Zoltan2_orderingTestDriverExample_MPI_1` fails when trying to enable the SEMS Scotch TPL (see #2051). It appears that this test never runs in automated testing on CDash from looking at the query:
* https://testing.sandia.gov/cdash/queryTests.php?project=Trilinos&date=2017-12-05&filtercount=2&showfilters=1&filtercombine=and&field1=testname&compare1=65&value1=Zoltan2_orderingTestDriverExample&field2=buildstarttime&compare2=84&value2=now
My guess is that is because Zoltan2 is never tested with Scotch enabled in any automated builds of Trilinos that post to CDash. Therefore this failing test is not really a regression because it is never run.
* Related to: #1400, #2051
https://gitlab.osti.gov/jmwille/Trilinos/-/issues/1904Zoltan2: Use MPI_Alltoallv for all-to-all, instead of hand-rolling it from se...2017-10-25T21:59:39ZJames WillenbringZoltan2: Use MPI_Alltoallv for all-to-all, instead of hand-rolling it from sends and receives*Created by: mhoemmen*
@trilinos/zoltan2
See `Trilinos/packages/zoltan2/src/util/Zoltan2_AlltoAll.[ch]pp`. If necessary, add a Teuchos wrapper for `MPI_Alltoallv`.*Created by: mhoemmen*
@trilinos/zoltan2
See `Trilinos/packages/zoltan2/src/util/Zoltan2_AlltoAll.[ch]pp`. If necessary, add a Teuchos wrapper for `MPI_Alltoallv`.https://gitlab.osti.gov/jmwille/Trilinos/-/issues/1538Zoltan2: Possible scaling issue with MueLu/Z2multijagged for MueLu coarse lev...2018-01-12T16:32:01ZJames WillenbringZoltan2: Possible scaling issue with MueLu/Z2multijagged for MueLu coarse level repartitioning?*Created by: pwxy*
I observed the following scaling of "MueLu: Zoltan2Interface: Zoltan2 multijagged (sub, total, level=2)"
(the time for Zoltan2::PartitioningProblem->solve())
on the LLNL IBM BG/Q platform for strong scaling for the ...*Created by: pwxy*
I observed the following scaling of "MueLu: Zoltan2Interface: Zoltan2 multijagged (sub, total, level=2)"
(the time for Zoltan2::PartitioningProblem->solve())
on the LLNL IBM BG/Q platform for strong scaling for the Drekar Poisson test case.
Started with a 2.4B row matrix, but Zoltan2 reparitioning not called until after two levels of MueLu aggregation
(~700x factor reduction). So have the case with few rows of the matrix per MPI process
(probably not the standard usage of Z2 in most apps):
MPI "MueLu: Zoltan2Interface: Zoltan2 multijagged (sub, total, level=2)" time in sec
131072 2.10
262144
524288 12.25
1048576 26.7
1572864 66.9
I built the muelu driver on solo and ran with 256, 512, 1024, 2048, 4096 and 8192 MPI processes
and could see that the Zoltan2 multijagged isn't scaling as well as hoped (but it is definitely easier
to see the problem at much larger scales).
This is strong scaling with "Matrix type: Brick3D" (27 nnz per row) with problem size of 81M rows.
Zoltan2 is not called until after two levels of coarsening (each coarsening reduces the rows by
factor of roughly 27), so for example the 1024 MPI case, the matrix Z2 gets is 118,000 rows.
Times are the max over MPI processes for "MueLu: Zoltan2Interface: Zoltan2 multijagged (sub, total, level=2)"
(this is the time for Z2 MJ to construct the new partitioning; MueLu tells Z2 how many partitions are needed
and MueLu migrates the data afterwards)
for both "mj_migration_type"=0 and "mj_migration_type"=1
performed 3 runs of each and reported the lowest time below
MPI MJ=0 MJ=1
256 0.0060 0.0060
512 0.0091 0.0090
1024 0.0144 0.0142
2048 0.0247 0.0244
4096 0.0607 0.0605
8192 0.1091 0.1089
So unless I screwed up, there doesn't seem to be much difference between "mj_migration_type"=0 and "mj_migration_type"=1
On solo the only module change I made was "module swap intel intel/17.0.4.196"
cmake file attached; muelu xml file attached
Here are my input arguments to the muelu driver:
MueLu_Driver.exe --matrixType=Brick3D --nx=433 --ny=433 --nz=433 --mx=${xproc} --my=${yproc} --mz=${zproc} --xml="muelu_scaling.xml"
MPI xproc yproc zproc
256 8 8 4
512 8 8 8
1024 16 8 8
2048 16 16 8
4096 16 16 16
8192 32 16 16
[cmake_muelu_kokkos_serial_serrano_icc17.txt](https://github.com/trilinos/Trilinos/files/1178021/cmake_muelu_kokkos_serial_serrano_icc17.txt)
[muelu_scaling.xml-z2mj_mj0_lev2minpp1024-c1000-t_exp-remap_rebpr-1vcyc11.txt](https://github.com/trilinos/Trilinos/files/1178022/muelu_scaling.xml-z2mj_mj0_lev2minpp1024-c1000-t_exp-remap_rebpr-1vcyc11.txt)
https://gitlab.osti.gov/jmwille/Trilinos/-/issues/1312Zoltan2 partitioning metrics return scalar_t; should return double2019-01-21T23:08:51ZJames WillenbringZoltan2 partitioning metrics return scalar_t; should return double*Created by: kddevin*
Zoltan2's partitioning metrics class EvaluatePartition returns, among other things, imbalance values. Imbalance is inherently a float or double. While scalar_t is often float or double, it could be int. When sca...*Created by: kddevin*
Zoltan2's partitioning metrics class EvaluatePartition returns, among other things, imbalance values. Imbalance is inherently a float or double. While scalar_t is often float or double, it could be int. When scalar_t is int, returning imbalance as an int is inaccurate.
It appears that currently, all metrics are stored as scalar_t. We could store them as double instead.
@trilinos/zoltan2 https://gitlab.osti.gov/jmwille/Trilinos/-/issues/1280Zoltan2: TPL_Traits::ASSIGN_ARRAY is not unit-tested2017-05-01T21:53:04ZJames WillenbringZoltan2: TPL_Traits::ASSIGN_ARRAY is not unit-tested*Created by: kddevin*
Some parts of TPL_Traits are unit-tested, but ASSIGN_ARRAY, DELETE_ARRAY, and SAVE_ARRAYRCP are not. *Created by: kddevin*
Some parts of TPL_Traits are unit-tested, but ASSIGN_ARRAY, DELETE_ARRAY, and SAVE_ARRAYRCP are not. https://gitlab.osti.gov/jmwille/Trilinos/-/issues/1024Zoltan2 Hypergraph model assumes contiguous GID number from base 12017-01-29T04:16:14ZJames WillenbringZoltan2 Hypergraph model assumes contiguous GID number from base 1*Created by: kddevin*
Zoltan2's hypergraph model assumes global IDs of entities are numbered contiguously from base 1. Variable numGlobalVertices_ is set to the global maximum GID value. When global IDs are not numbered contiguously o...*Created by: kddevin*
Zoltan2's hypergraph model assumes global IDs of entities are numbered contiguously from base 1. Variable numGlobalVertices_ is set to the global maximum GID value. When global IDs are not numbered contiguously or from base 1, this computation is incorrect.
Also, the Tpetra::Maps used in building the hypergraph model use indexBase=0. I think they could be more efficient for the non-contiguous case if the minimum ID were computed and used in the Map constructors.
@trilinos/zoltan2
https://gitlab.osti.gov/jmwille/Trilinos/-/issues/885Zoltan2_ImbalanceMetricsUtility.hpp targetNumParts is comm size2016-12-14T23:07:18ZJames WillenbringZoltan2_ImbalanceMetricsUtility.hpp targetNumParts is comm size*Created by: mndevec*
imbalanceMetrics function in Zoltan2_ImbalanceMetricsUtility.hpp seems to obtain targetNumParts from the communicatior size around line ~454.
Is this the way that was decided? Shouldn't it obtain those from Sol...*Created by: mndevec*
imbalanceMetrics function in Zoltan2_ImbalanceMetricsUtility.hpp seems to obtain targetNumParts from the communicatior size around line ~454.
Is this the way that was decided? Shouldn't it obtain those from Solution using getTargetGlobalNumberOfParts?
This prints invalid imbalances for the mapping problems where I have serial input adapters.
@trilinos/zoltan2
https://gitlab.osti.gov/jmwille/Trilinos/-/issues/881Zoltan2::EvaluatePartition - calculates metric inside the constructor2016-11-28T19:36:10ZJames WillenbringZoltan2::EvaluatePartition - calculates metric inside the constructor*Created by: mndevec*
EvaluatePartition has a shared constructor in which the all available metrics are calculated. This created some issues when I inherit a new class from EvaluatePartition.
I am currently working on the mapping pr...*Created by: mndevec*
EvaluatePartition has a shared constructor in which the all available metrics are calculated. This created some issues when I inherit a new class from EvaluatePartition.
I am currently working on the mapping problem. Many classes in mapping problem inherit from the classes of partitioning problem, since a mapping problem is also a partitioning problem with extra considerations including the underlying network topology. MappingSolution inherits PartitioningSolution, EvaluateMapping inherits EvaluatePartition, so that I can add new functionalities with minimal code.
I was adding virtual metric calculation functions to EvaluatePartition so that EvaluateMapping would over-write them. However, since these functions are called within the constructor, virtual functions do not work, as the constructor of the child class hasn't overwrite the virtual functions.
This is not an issue for EvaluatePartition class itself, but moving the metric calculation out of EvaluatePartition would make the life much easier for EvaluateMapping class. This will require addition of evaluate function (as 3rd line below), which requires updates all over the code. I am willing to do this change, if you agree on this.
```c++
typedef Zoltan2::EvaluatePartition<my_adapter_t> quality_t;
RCP<quality_t> metricObject_1 = rcp(new quality_t(input_adapter,problemParams,comm,solution));
metricObject_1->evaluate();
metricObject_1->printMetrics(cout);
```
@trilinos/zoltan2 https://gitlab.osti.gov/jmwille/Trilinos/-/issues/595Zoltan2: Add 2D Matrix partitioning and MatrixPartitioningProblem or MatrixP...2016-09-07T20:20:13ZJames WillenbringZoltan2: Add 2D Matrix partitioning and MatrixPartitioningProblem or MatrixPartitioningSolution*Created by: kddevin*
We have demonstrated the effectiveness of 2D matrix partitioning for data-oriented problems
(SC13 Boman, Rajamanickam, Devine), but we do not have that capability in Zoltan2.
This work will include providing capab...*Created by: kddevin*
We have demonstrated the effectiveness of 2D matrix partitioning for data-oriented problems
(SC13 Boman, Rajamanickam, Devine), but we do not have that capability in Zoltan2.
This work will include providing capability for Cartesian partitions of the matrix and associated vector partitions.
It will also include matrix distribution.
May include matrix sampling to accelerate hypergraph partitioning and experiments with PuLP.
https://gitlab.osti.gov/jmwille/Trilinos/-/issues/537Zoltan2 definition of part_t = int causes problems with ETI when GO != int2016-09-01T16:47:56ZJames WillenbringZoltan2 definition of part_t = int causes problems with ETI when GO != int*Created by: kddevin*
@trilinos/zoltan2 @trilinos/tpetra
Zoltan2 defines part_t (part numbers) to be int, consistent with MPI ranks.
For task mapping, we map MPI ranks to allocated cores; thus, all entities are labeled with part_t. ...*Created by: kddevin*
@trilinos/zoltan2 @trilinos/tpetra
Zoltan2 defines part_t (part numbers) to be int, consistent with MPI ranks.
For task mapping, we map MPI ranks to allocated cores; thus, all entities are labeled with part_t. For some operations, we use Tpetra::Map<part_t, part_t>, Tpetra::CrsMatrix<double, part_t, part_t>, etc.
When ETI is enabled and Tpetra_INST_INT_INT=OFF, the task mapping code will not link, as Tpetra doesn't instantiate Tpetra::Map<part_t, part_t>, Tpetra::Import<part_t, part_t>, etc.
Possible solutions are below; are there others?
1. Always require LocalOrdinal=int and GlobalOrdinal=int in ETI when Zoltan2 is enabled. This solution would not be popular with the folks trying to reduce code size for Sierra. Sierra needs Zoltan2.
2. Template Tpetra classes on lno_t and gno_t, even though part_t would be more natural and self-documenting; cast part_t data to gno_t or lno_t in the task mapping code whenever needed for Tpetra. This solution would be awkward and less descriptive in the code, as it is needed only to satisfy ETI. But it may be necessary.
3. Change part_t to be GlobalOrdinal. This solves half the ETI problem, but it breaks backward compatibility, as Zoltan2's interface returns values of type part_t (e.g., PartitioningSolution.getPartListView()).
Other ideas, anyone? Meanwhile, I'll start looking at the 2nd idea.
https://gitlab.osti.gov/jmwille/Trilinos/-/issues/281Panzer: Add gid generator for edge and face entities2017-07-29T16:54:32ZJames WillenbringPanzer: Add gid generator for edge and face entities*Created by: rppawlo*
Given node gids and element connectivity, add a function to generate edge and face gids. This will allow more flexibility in connecting to less capable mesh databases.
Required functionality has been added to zolt...*Created by: rppawlo*
Given node gids and element connectivity, add a function to generate edge and face gids. This will allow more flexibility in connecting to less capable mesh databases.
Required functionality has been added to zoltan. Just need to add this to panzer dof manager library.
@trilinos/panzer