KIT | KIT-Bibliothek | Impressum | Datenschutz

Classification Methods for 16S rRNA Based Functional Annotation

Kulakowski, Rafal; Lausen, Adi; Low-Decarie, Etienne; Lausen, Berthold

Abstract:
Microbial communities play an essential role in Earth’s ecosystems. The goal of this study was to investigate whether the functional potential of microorganisms forming these diverse communities can be directly identified using a 16S rRNA marker gene with supervised learning methods. The recently developed FAPROTAX database has been used along with the SILVA database to produce a training set where 16S rRNA sequences are linked to a number of metabolic functions. Since gene sequences cannot be explicitly used as feature vectors by most classification algorithms, the present research aimed to investigate possible feature engineering approaches for 16S rRNA. Techniques based on Multiple Sequence Alignment (MSA) and N-grams are proposed and tested. The results showed that the feature representation based on the Ngrams outperformed MSA, especially when implemented with large and diverse functional groups. This suggests that a clustering-like alignment procedure results in a biased feature representation of the marker gene. Since classifiers trained using Random Forest and Support Vector Machines techniques were able to accurately detect a range of functional groups it is concluded that the 16S rRNA gene provides substantial information for the direct identification of functional capabilities.

Open Access Logo


Verlagsausgabe §
DOI: 10.5445/KSP/1000085951/17
Veröffentlicht am 13.05.2020
Cover der Publikation
Zugehörige Institution(en) am KIT Institut für Informationswirtschaft und Marketing (IISM)
Publikationstyp Zeitschriftenaufsatz
Publikationsjahr 2018
Sprache Englisch
Identifikator ISSN: 2363-9881
KITopen-ID: 1000119314
Erschienen in Archives of Data Science, Series A (Online First)
Band 4
Heft 1
Seiten A17, 23 S. online
KIT – Die Forschungsuniversität in der Helmholtz-Gemeinschaft
KITopen Landing Page