Phylogeny-aware identification and correction of taxonomically mislabeled sequences

Kozlov, Alexey M.; Zhang, Jiajie; Yilmaz, Pelin; Gloeckner, Frank Oliver; Stamatakis, Alexandros

doi:10.1093/nar/gkw396

Phylogeny-aware identification and correction of taxonomically mislabeled sequences

Kozlov, Alexey M.; Zhang, Jiajie; Yilmaz, Pelin; Gloeckner, Frank Oliver; Stamatakis, Alexandros

¹
¹ Institut für Theoretische Informatik (ITI), Karlsruher Institut für Technologie (KIT)

Abstract (englisch):

Molecular sequences in public databases are mostly annotated by the submitting authors without further validation. This procedure can generate erroneous taxonomic sequence labels. Mislabeled sequences are hard to identify, and they can induce downstream errors because new sequences are typically annotated using existing ones. Furthermore, taxonomic mislabelings in reference sequence databases can bias metagenetic studies which rely on the taxonomy. Despite significant efforts to improve the quality of taxonomic annotations, the curation rate is low because of the labor-intensive manual curation process. Here, we present SATIVA, a phylogeny-aware method to automatically identify taxonomically mislabeled sequences (‘mislabels’) using statistical models of evolution. We use the Evolutionary Placement Algorithm (EPA) to detect and score sequences whose taxonomic annotation is not supported by the underlying phylogenetic signal, and automatically propose a corrected taxonomic classification for those. Using simulated data, we show that our method attains high accuracy for identification (96.9% sensitivity/91.7% precision) as well as correction (94.9% sensitivity/89.9% precision) of mislabels. ... mehr

KITopen-Download

Volltext

DOI: 10.5445/IR/1000057736

Externe Links

Originalveröffentlichung
DOI: 10.1093/nar/gkw396

Scopus
Zitationen: 70

Web of Science
Zitationen: 68

Dimensions
Zitationen: 99

Export

Statistiken

Seitenaufrufe: 127
seit 13.12.2018

Downloads: 197
seit 04.06.2017

Zugehörige Institution(en) am KIT	Institut für Theoretische Informatik (ITI)
Publikationstyp	Zeitschriftenaufsatz
Publikationsjahr	2016
Sprache	Englisch
Identifikator	ISSN: 0305-1048 urn:nbn:de:swb:90-577369 KITopen-ID: 1000057736
Erschienen in	Nucleic acids research
Verlag	Oxford University Press (OUP)
Band	44
Heft	11
Seiten	5022-5033
Nachgewiesen in	PubMed Dimensions Web of Science Scopus

Repository KITopen

Phylogeny-aware identification and correction of taxonomically mislabeled sequences

Abstract (englisch):