KIT | KIT-Bibliothek | Impressum | Datenschutz

The Effect of Preprocessing on Short Document Clustering

Koopman, Cynthia; Wilhelm, Adalbert


Natural Language Processing has become a common tool to extract relevant information from unstructured data. Messages in social media, customer reviews, and military messages are all very short and therefore harder to handle than longer texts. Document clustering is essential in gaining insight from these unlabeled texts and is typically performed after some preprocessing steps. Preprocessing often removes words. This can become risky in short texts, where the main message is made of only a few words. The effect of preprocessing and feature extraction on these short documents is therefore analyzed in this paper. Six different levels of text normalization are combined with four different feature extraction methods. These setting are all applied on K-means clustering and tested on three different datasets. Anticipated results can not be concluded, however other findings are insightful in terms of the connection between text cleaning and feature extraction.

Verlagsausgabe §
DOI: 10.5445/KSP/1000098011/01
Veröffentlicht am 16.07.2020
Cover der Publikation
Zugehörige Institution(en) am KIT Institut für Wirtschaftsinformatik und Marketing (IISM)
Publikationstyp Zeitschriftenaufsatz
Publikationsjahr 2020
Sprache Englisch
Identifikator ISSN: 2363-9881
KITopen-ID: 1000121376
Erschienen in Archives of Data Science, Series A
Band 6
Heft 1
Seiten P01, 16 S. online
KIT – Die Forschungsuniversität in der Helmholtz-Gemeinschaft
KITopen Landing Page