Inferring phylogenies under the Maximum Likelihood (ML) criterion is $\mathcal{NP}$-hard. Therefore, ML tree inference tools deploy heuristics to navigate through the vast tree space. For a given multiple sequence alignment (MSA), phylogenetic inference tools strive to determine the statistical model comprising the tree topology, its branch lengths, and the evolutionary model parameters, that maximize the phylogenetic likelihood function. The shape of the likelihood space, however, varies substantially across datasets. For MSAs with strong phylogenetic signal, the tree space exhibits a single likelihood peak, associated with a unique topology, to which most, if not all, independent ML tree searches rapidly converge. For other datasets, distinct searches may yield contradicting—yet equally likely or statistically indistinguishable—topologies, reflecting a rugged space with multiple local optima.
At the same time, MSA input sequences are inherently noisy. This increases the chances that ML tools may experience overfitting, whereby the best-found topology incorrectly models noise alongside the true phylogenetic signal. However, even if overfitting does not occur, the de facto presence of noise raises the question whether thorough optimization is justified. ... mehrIn many cases, a reasonably "good" and statistically equivalent (to the best-found) solution may suffice, especially since thorough ML inference tools, such as RAxML-NG, often only attain marginal and statistically indistinguishable likelihood improvements during the final optimization stages.
An orthogonal challenge arises from multi-gene alignments with missing data, which occasionally give rise to the phenomenon of so-called stands in the phylogenetic tree space. Stands are sets of trees that are compatible with the incomplete per-gene trees. Under standard tree scoring criteria that are typically used, the trees on a stand have identical scores, and thereby constitute a terrace. Terraces magnify the number of plausible trees to be considered and undermine the robustness of a phylogenetic analysis.
All issues mentioned hitherto reveal the internal complexity of the ML tree inference problem, both from an engineering as well as a biological perspective. On top of this, modern sequencing technologies generate sequence data at a super-exponential rate, thereby imposing an additional, external pressure on ML-based methods, namely, the continuous increase in size of MSAs to be analyzed. Due to the miscellany of internal and external challenges, the need to devise novel, adaptive, data-informed heuristics that can identify equally plausible trees more rapidly has become imperative.
The purpose of this thesis is to offer an improved understanding of the structure of the likelihood-tree space, and of how ML tree inference heuristics operate on it. My first contribution is the development and implementation of an adaptive tree-search heuristic in RAxML-NG, which adjusts the thoroughness of the search as a function of the difficulty score predicted by Pythia. Pythia is a machine learning predictor that accurately estimates the phylogenetic signal in a given input MSA. Adaptive RAxML-NG infers plausible trees in 99.7% of cases, and achieves speedups of up to 16$\times$ compared to RAxML-NG v1.1, particularly on easy- and difficult-to-analyze MSAs.
The second contribution is the integration of the Kishino-Hasegawa (KH) test into RAxML-NG, which is then used as a reliable and fast-to-compute early stopping criterion, to effectively evade excessive and compute-intensive over-optimization. The tree search terminates early when improvements become statistically insignificant, thereby adapting to the convergence behavior of an inference in a dataset-specific manner. Early stopping yields speedups of up to 5$\times$, while preserving statistical plausibility in up to 98% of the cases, compared to RAxML-NG v1.2.
Third, I conduct a systematic investigation of overfitting in ML tree inference tools, specifically focusing on the potential impact of overfitting the tree topology. The results show that ML tools, and, more importantly, the thorough ones, do not exhibit overfitting. Specifically, overfitting trends are detected in less than 1% of the tested datasets. Further, holdout-validation techniques that are commonly used in deep learning offer no practical benefit in the context of ML-based tree inference.
Fourth, I propose the default tree search heuristic for RAxML-NG v2.0, by synthesizing the insights attained during the Adaptive RAxML-NG and early stopping developments, as well as the deeper understanding of heuristic behavior obtained from the overfitting investigation. Alongside the default heuristic, I propose an accelerated fast heuristic alternative that practitioners may invoke when analyzing very large phylogenies comprising thousands of taxa, on which thorough inferences become prohibitive. The proposed fast heuristic achieves substantially higher accuracy than competing tools with comparable runtimes, such as FastTree.
Finally, I provide an efficient parallelization of Gentrius, a branch-and-bound algorithm that enumerates all stand trees of a given partitioned MSA with missing data. The parallelization deploys a thread-pooling mechanism to reduce load imbalance, and thereby achieves high parallel efficiency with linear scalability on up to 16 cores.