The research field of evolutionary bioinformatics describes the evolutionary origins of organisms, primarily using (mathematical) trees.
These trees are based on statistical models that account for variations among the observed genomic data.
While phylogenetic trees describe the shared evolutionary history among distinct species, genealogical trees consider distinct individuals of the same species.
The amount of available genomic data grows exponentially, and therefore presents major challenges for applications in evolutionary bioinformatics.
For instance, as single-core performance plateaus, we require both, work-efficient and highly-parallel algorithms.
However, existing petascale systems experience hardware failures 0.8 to 5.7 times per day, and future exascale systems are expected to fail every $30$ to $60$ minutes.
To avoid loosing progress, algorithms should mitigate from these failures.
Further, algorithms that yield bit-identical results across different hardware environments strengthen scientific claims.
Finally, software libraries facilitate efficient application development by providing abstractions to manage the growing complexity of scientific codes.
... mehr
In this thesis, we address four concrete instances of the challenges outlined above:
We address the reproducibility challenge by presenting the first parallel phylogenetic tree inference tool that yields bit-identical results under varying core counts.
To this end, we also present a bit-reproducible operation-agnostic distributed reduction algorithm.
Additionally, we address the algorithmic challenge by proposing to compress sets of genealogical trees into a Directed Acyclic Graph and thus encode shared subtrees only once.
This enables straight-forward memoization, thereby yielding substantial runtime reductions over the state-of-the-art.
Moreover, we address the hardware failure challenge by substantially accelerating the failure recovery of a fault-tolerant RAxML-NG version -- a widely-used tool for phylogenetic tree inference.
To this end, we present the, to the best of our knowledge, first general-purpose checkpointing solution allowing for in-memory recovery after a fail-stop hardware-failure without requiring spare cores (shrinking recovery).
Finally, we address the software complexity challenge by developing high-level C++ bindings for MPI.
Our C++ bindings enable a faster, less error-prone development of distributed applications, including access to MPI's failure-mitigation mechanisms and our bit-reproducible parallel reduce implementation.
Reproducibility Challenge: Maximum-Likelihood phylogenetic tree inference searches for the tree and evolutionary model that best explain the observed genomic data.
The likelihood score calculations at different genomic loci/sites are independent, and thus they are commonly computed in parallel and then reduced into a single tree score.
However, fundamental arithmetic operations on IEEE 754 floating-point numbers, such as addition, inherently introduce rounding errors, causing the order of floating-point operations to affect the result.
Consequently, how common parallel reduction algorithms re-associate operations under varying core counts, results in different round-off errors.
We observe this effect to cause phylogenetic tree searches to diverge under varying degrees of parallelism for 31 % of 10130 empirical datasets.
8 % of these divergent datasets even yield trees that are significantly worse than the best found tree (AU-test, $p<0.05$).
To alleviate these diverging results, we present a variant of RAxML-NG that ensures bit-reproducible results under varying core-counts, with a slowdown of only $0$ to $6.7$ % (median $0.93$ %) on up to $768$ cores.
To this end, we introduce the general-purpose distributed reduction algorithm ReproRed, which ensures bit-identical results under varying core counts by maintaining a fixed order of operations independent of the communication pattern.
Thus, ReproRed is applicable to all associative reduction operations -- in contrast to competitors, which are confined to summation only.
ReproRed exchanges only the theoretical minimum number of messages, overlaps communication with computation, and utilizes a fast base-case for conducting local reduction.
Further, ReproRed is able to all-reduce (via a subsequent broadcast) $4.1 \times 10^6$ operands across $48$ to $768$ cores in $19.7$ to $48.61$ us, exhibiting a slowdown of $13$ to $93$ % over a non-reproducible all-reduce algorithm and $8$ to $25$ % over an non-reproducible Reduce-Bcast algorithm.
ReproRed outperforms the state-of-the-art reproducible algorithm (summation only) beyond $10000$ elements per core.
Algorithmic Challenge: In population genetic studies of recombining organisms (e.g., humans), sets of genealogical trees provide a more comprehensive view on evolution than a single tree, as commonly used in phylogenetics.
Here, each of the (highly-correlated) trees describes the evolutionary history of a distinct genomic region.
The current state-of-the-art data structure compresses these trees via edit operations when moving from one tree to the next along the genome.
In contrast, we propose to compress the input trees into a Directed Acyclic Graph, such that subtrees are encoded only once, even if they appear multiple times in the input.
This allows for a straight-forward memoization of intermediate results.
Queries on our data structure are $2.1$ to $11.2$ (median $4.0$) times faster than the state-of-the-art for obtaining important population genetic statistics such as Patterson's $f$, Tajima's $D$, pairwise Lowest Common Ancestor, and others on empirical, and simulated datasets.
Hardware-Failure Challenge: When mitigating hardware failures, it is often impractical to either request replacement for failed cores, or to maintain spares.
Thus, algorithms must continue with the remaining cores, necessitating them to restore lost data and to redistribute the workload and respective data to the remaining cores.
However, current checkpointing libraries often do not support such a data redistribution.
Moreover, they typically write checkpoints to the Parallel File System instead of the main memory, resulting in slow recoveries due to low disk access speeds and disk/network congestion.
We present ReStore, which, in turn, supports the data redistribution that is required by shrinking recovery and backs-up the application data in memory.
ReStore reloads lost data in milliseconds on up to $24576$ cores and thereby accelerates the reloading of data during failure recovery for the failure-tolerant version of RAxML-NG by one to two orders of magnitude.
Software Complexity Challenge: The Message Passing Interface (MPI) abstracts over message exchanges in parallel distributed-memory machines.
However, although high-level programming languages such as C++ yield software development faster and less error-prone, MPI does not provide C++ bindings.
To this end, we propose such bindings along an extensively tested, open-source implementation, KaMPIng, that covers all abstraction levels from low-level MPI calls to convenient STL-style bindings.
Through a configurable inference of parameter defaults, fine-grained memory allocation control, enhanced safety guarantees, and a flexible plugin system, KaMPIng enables both, rapid prototyping and careful fine-tuning of distributed algorithms.
By utilizing template-metaprogramming, KaMPIng incurs (near) zero running time overhead.
We also integrate ReproRed as well as abstractions over MPI's failure-mitigation interface into KaMPIng and showcase KaMPIng via multiple application benchmarks.
Various scientific C++ projects already utilize KaMPIng and we are working with the MPI language binding working group towards the development of novel MPI C++ bindings with KaMPIng as a reference implementation.
Impact: Our work allows researchers to analyze the exponentially growing phylogenetic and population genetic datasets, to perform novel, work-intensive, analyzes, publish reproducible results, and to develop complex distributed algorithms through appropriate abstractions.
We demonstrate this by developing the, to the best of our knowledge, first bit-reproducible phylogenetic tree inference tool as well as by substantially accelerating recovery of the fault-tolerant RAxML-NG version.
Thus, our contributions push the scalability frontier in evolutionary bioinformatics in particular and High Performance Computing in general.