Combining Cluster Validation Indices for Detecting Label Noise

Boeva, Veselka; Kohstall, Jan; Lundberg, Lars; Angelova, Milena

doi:10.5445/KSP/1000087327/18

Combining Cluster Validation Indices for Detecting Label Noise

Boeva, Veselka; Kohstall, Jan; Lundberg, Lars; Angelova, Milena

Abstract:

In this paper, we show that cluster validation indices can be used for filtering mislabeled instances or class outliers prior to training in supervised learning problems. We propose a technique, entitled Cluster Validation Index (CVI)-based Outlier Filtering, in which mislabeled instances are identified and eliminated from the training set, and a classification hypothesis is then built from the set of remaining instances. The proposed approach assigns each instance several cluster validation scores representing its potential of being an outlier with respect to the clustering properties the used validation measures assess. We examine CVI-based Outlier Filtering and compare it against the Local Outlier Factor (LOF) detection method on ten data sets from the UCI data repository using five well-known learning algorithms and three different cluster validation indices. In addition, we study and compare three different approaches for combining the selected cluster validation measures. Our results show that for most learning algorithms and data sets, the proposed CVI-based outlier filtering algorithm outperforms the baseline method (LOF). The greatest increase in classification accuracy has been achieved by using union or ranked-based median strategies to assemble the used cluster validation indices and global filtering of mislabeled instances.

KITopen-Download

Verlagsausgabe

DOI: 10.5445/KSP/1000087327/18

Veröffentlicht am 15.07.2020

Export

Statistiken

Seitenaufrufe: 425
seit 15.07.2020

Downloads: 372
seit 15.07.2020

Zugehörige Institution(en) am KIT	Fakultät für Wirtschaftswissenschaften – Institut für Informationswirtschaft und Marketing (IISM)
Publikationstyp	Zeitschriftenaufsatz
Publikationsjahr	2018
Sprache	Englisch
Identifikator	ISSN: 2363-9881 KITopen-ID: 1000121286
Erschienen in	Archives of Data Science, Series A (Online First)
Band	5
Heft	1
Seiten	A18, 16 S. online
Nachgewiesen in	OpenAlex

Repository KITopen

Combining Cluster Validation Indices for Detecting Label Noise

Abstract: