KIT | KIT-Bibliothek | Impressum | Datenschutz

On the Various Semantics of Similarity in Word Embedding Models

Elekes, Ábel; Schäler, Martin; Böhm, Klemens

Abstract:

Finding similar words with the help of word embedding models has yielded meaningful results in many cases. However, the no-tion of similarity has remained ambiguous. In this paper, we examine when exactly similarity values in word embedding mod-els are meaningful. To do so, we analyze the statistical distribu-tion of similarity values systematically, in two series of experi-ments. The first one examines how the distribution of similarity values depends on the different embedding-model algorithms and parameters. The second one starts by showing that intuitive simi-larity thresholds do not exist. We then propose a method stating which similarity values actually are meaningful for a given em-bedding model. In more abstract terms, our insights should give way to a better understanding of the notion of similarity in em-bedding models and to more reliable evaluations of such models.


Volltext §
DOI: 10.5445/IR/1000065330
Cover der Publikation
Zugehörige Institution(en) am KIT Institut für Programmstrukturen und Datenorganisation (IPD)
Publikationstyp Forschungsbericht/Preprint
Publikationsjahr 2017
Sprache Englisch
Identifikator ISSN: 2190-4782
urn:nbn:de:swb:90-653309
KITopen-ID: 1000065330
Verlag Karlsruher Institut für Technologie (KIT)
Serie Karlsruhe Reports in Informatics ; 2017,3
Schlagwörter Word embedding models; similarity values; semantic similarity
KIT – Die Forschungsuniversität in der Helmholtz-Gemeinschaft
KITopen Landing Page