The balanced graph partitioning problem asks for a partition of the vertices of a graph into $k$ disjoint blocks of roughly equal weight while minimizing the total weight of cut edges.
This is a fundamental problem in computer science with a wide range of applications.
For example, in parallel processing, one can often model computation and communication of an application as a graph.
A balanced partition of this graph then distributes workload evenly across processing elements (PEs), while a small cut reduces inter-processor communication.
Unfortunately, the graph partitioning problem is NP-hard and even NP-hard to approximate within any constant factor.
At the same time, the graphs that arise in practice and require partitioning are often massive, which makes exact methods infeasible and motivates the use of heuristics.
Among these, the multilevel scheme has proven particularly successful for obtaining high-quality partitions.
It proceeds in three phases.
During coarsening, the algorithm builds a hierarchy of successively coarser graphs, typically by contracting vertex sets.
Next, an initial partition is computed on the coarsest level.
... mehr
Finally, during uncoarsening, the contractions are undone level by level, projecting the coarse partition onto the next finer graph.
On each level, the partition is improved using local search algorithms (also called refinement).
The scheme can partition a graph directly into $k$ blocks (called direct $k$-way partitioning) or compute a $k$-way partition via recursive bipartitioning.
A vast body of literature studies heuristic algorithms for graph partitioning, and a wide range of sequential and parallel partitioners (both shared-memory and distributed-memory) has emerged from this work.
Nevertheless, several challenges remain.
Most of the research targets partitioning into a relatively small number of blocks $k$, typically $k \le 256$.
With modern parallel architectures offering ever more processing elements, substantially larger values of $k$ are increasingly relevant.
Unfortunately, state-of-the-art partitioners often struggle in this regime, producing highly imbalanced and thus infeasible or otherwise low-quality partitions, or suffering from excessive running times.
Moreover, although modern parallel partitioners are fast in practice, they require considerable memory to store the input graph along with expensive auxiliary data structures, limiting the size of graphs that can be handled on a single machine.
In addition, while linear running time is frequently observed empirically, they lack worst-case performance guarantees on arbitrary input graphs.
All of these issues arise in both shared-memory and distributed-memory settings.
In this dissertation, we propose several techniques to address these limitations.
We introduce deep multilevel graph partitioning (Deep MGP), a multilevel scheme that extends the coarsening phase deep into the initial partitioning phase.
Deep MGP coarsens the input graph to a small size independent of $k$ and then computes an initial bipartition.
During uncoarsening, as blocks grow, it applies recursive bipartitioning to create additional blocks, keeping block sizes roughly constant until $k$ blocks are obtained.
In this way, Deep MGP combines the merits of direct $k$-way partitioning and recursive bipartitioning, yielding a stronger, more flexible, and arguably more elegant multilevel framework.
We further present KaMinPar, a shared-memory parallel implementation of Deep MGP that deliberately relies on simple, highly efficient building blocks.
In particular, we use size-constrained label propagation both to compute clusterings during coarsening and as a refinement algorithm during uncoarsening.
Empirically, KaMinPar is at least $2\times$ faster than current state-of-the-art shared-memory partitioners - even for small $k$ - while obtaining competitive partition quality.
Somewhat surprisingly, it can even outperform a single-level partitioner that attains reasonable quality.
For very large $k$, KaMinPar is up to $\approx 8\times$ faster than the fastest competing algorithm based on direct $k$-way partitioning.
For unweighted input graphs, KaMinPar can further guarantee balanced partitions by introducing rebalancing as an explicit component in multilevel partitioning.
During the development of KaMinPar, we found that memory consumption is the primary bottleneck limiting scalability to larger graphs.
We therefore present and evaluate a set of optimizations that substantially reduce its memory footprint.
Concretely, we design parallel label propagation clustering and graph contraction algorithms that use only $O(n)$ auxiliary space rather than $O(np)$, where $n$ is the number of vertices and $p$ is the number of processing elements.
Combined with integrating an existing compressed graph representation, these changes reduce peak memory usage by up to $16\times$ while preserving partition quality and maintaining similar running times.
In particular, our optimizations enable partitioning a randomly generated graph with one trillion edges in under 8 minutes on a single machine, using about 900 GiB of memory.
We call the resulting algorithm TeraPart.
Subsequently, we propose a configuration of KaMinPar that runs in strictly linear time and performs linear work, without making any assumptions about the input graph.
To the best of our knowledge, this is the first multilevel partitioner with worst-case linear running time, while prior algorithms require $O((n + m) \log n)$ work or more.
In addition to using linear time building blocks for the multilevel scheme, this requires the construction of a coarse graph hierarchy whose total size is linear in the size of the input graph.
To this end, we show that the coarsening scheme implemented in KaMinPar guarantees a geometric reduction of the number of vertices from level to level, and use graph sparsification to bound the density of coarse graphs.
By benchmarking on graphs that show a considerable increase in density on coarser levels, we show that sparsification yields a $1.5\times$ average speedup (up to $4\times$ on some graphs) over the non-sparsifying baseline, while reducing average partition quality by only 1%.
Moreover, this configuration outperforms state-of-the-art single-level partitioners, producing high-quality partitions in less time.
Lastly, we consider partitioning in distributed memory and propose dKaMinPar, the distributed-memory counterpart of KaMinPar.
Our distributed partitioner is likewise based on Deep MGP and uses a batch-synchronous implementation of size-constrained label propagation for both coarsening and uncoarsening.
In contrast to previous works, we enforce stricter balance constraints on the cluster weights during coarsening, and use explicit rebalancing to ensure balanced partition outputs.
We further integrate the same compressed graph representation as in TeraPart to obtain xTeraPart.
In addition, we adapt an unconstrained local search algorithm recently proposed for GPU partitioning to the distributed setting.
The resulting approach is effective on mesh-like graphs as well as highly skewed web and social graphs, while computing balanced, high-quality partitions on up to 8192 CPU cores and graphs with up to 16 trillion edges.
Moreover, through unconstrained refinement, it substantially improves partition quality over previous distributed partitioners, narrowing the long-standing quality gap between sequential and shared-memory partitioners on the one hand and distributed-memory partitioners on the other.