KIT | KIT-Bibliothek | Impressum

Phylogeny-aware identification and correction of taxonomically mislabeled sequences

Kozlov, Alexey M.; Zhang, Jiajie; Yilmaz, Pelin; Gloeckner, Frank Oliver; Stamatakis, Alexandros



Abstract (englisch): Molecular sequences in public databases are mostly annotated by the submitting authors without further validation. This procedure can generate erroneous taxonomic sequence labels. Mislabeled sequences are hard to identify, and they can induce downstream errors because new sequences are typically annotated using existing ones. Furthermore, taxonomic mislabelings in reference sequence databases can bias metagenetic studies which rely on the taxonomy. Despite significant efforts to improve the quality of taxonomic annotations, the curation rate is low because of the labor-intensive manual curation process. Here, we present SATIVA, a phylogeny-aware method to automatically identify taxonomically mislabeled sequences (‘mislabels’) using statistical models of evolution. We use the Evolutionary Placement Algorithm (EPA) to detect and score sequences whose taxonomic annotation is not supported by the underlying phylogenetic signal, and automatically propose a corrected taxonomic classification for those. Using simulated data, we show that our method attains high accuracy for identification (96.9% sensitivity/91.7% precision) as well as correction (94.9% sensitivity/89.9% precision) of mislabels. Furthermore, an analysis of four widely used microbial 16S reference databases (Greengenes, LTP, RDP and SILVA) indicates that they currently contain between 0.2% and 2.5% mislabels. Finally, we use SATIVA to perform an in-depth evaluation of alternative taxonomies for Cyanobacteria. SATIVA is freely available at https://github.com/amkozlov/sativa.


Zugehörige Institution(en) am KIT Institut für Theoretische Informatik (ITI)
Publikationstyp Zeitschriftenaufsatz
Jahr 2016
Sprache Englisch
Identifikator DOI: 10.1093/nar/gkw396
ISSN: 0305-1048
URN: urn:nbn:de:swb:90-577369
KITopen ID: 1000057736
Erschienen in Nucleic acids research
Band 44
Heft 11
Seiten 5022-5033
KIT – Die Forschungsuniversität in der Helmholtz-Gemeinschaft KITopen Landing Page