The Effect of Preprocessing on Short Document Clustering

Koopman, Cynthia; Wilhelm, Adalbert

doi:10.5445/KSP/1000098011/01

The Effect of Preprocessing on Short Document Clustering

Koopman, Cynthia; Wilhelm, Adalbert

Abstract:

Natural Language Processing has become a common tool to extract relevant information from unstructured data. Messages in social media, customer reviews, and military messages are all very short and therefore harder to handle than longer texts. Document clustering is essential in gaining insight from these unlabeled texts and is typically performed after some preprocessing steps. Preprocessing often removes words. This can become risky in short texts, where the main message is made of only a few words. The effect of preprocessing and feature extraction on these short documents is therefore analyzed in this paper. Six different levels of text normalization are combined with four different feature extraction methods. These setting are all applied on K-means clustering and tested on three different datasets. Anticipated results can not be concluded, however other findings are insightful in terms of the connection between text cleaning and feature extraction.

KITopen-Download

Verlagsausgabe

DOI: 10.5445/KSP/1000098011/01

Veröffentlicht am 16.07.2020

Export

Statistiken

Seitenaufrufe: 676
seit 16.07.2020

Downloads: 961
seit 21.07.2020

Zugehörige Institution(en) am KIT	Institut für Wirtschaftsinformatik und Marketing (IISM)
Publikationstyp	Zeitschriftenaufsatz
Publikationsjahr	2020
Sprache	Englisch
Identifikator	ISSN: 2363-9881 KITopen-ID: 1000121376
Erschienen in	Archives of Data Science, Series A
Band	6
Heft	1
Seiten	P01, 16 S. online
Nachgewiesen in	OpenAlex
Globale Ziele für nachhaltige Entwicklung

Repository KITopen

The Effect of Preprocessing on Short Document Clustering

Abstract: