KIT | KIT-Bibliothek | Impressum

On the Various Semantics of Similarity in Word Embedding Models

Elekes, Ábel; Schäler, Martin; Böhm, Klemens

Abstract: Finding similar words with the help of word embedding models has yielded meaningful results in many cases. However, the no-tion of similarity has remained ambiguous. In this paper, we examine when exactly similarity values in word embedding mod-els are meaningful. To do so, we analyze the statistical distribu-tion of similarity values systematically, in two series of experi-ments. The first one examines how the distribution of similarity values depends on the different embedding-model algorithms and parameters. The second one starts by showing that intuitive simi-larity thresholds do not exist. We then propose a method stating which similarity values actually are meaningful for a given em-bedding model. In more abstract terms, our insights should give way to a better understanding of the notion of similarity in em-bedding models and to more reliable evaluations of such models.

Zugehörige Institution(en) am KIT Institut für Programmstrukturen und Datenorganisation (IPD)
Publikationstyp Forschungsbericht
Jahr 2017
Sprache Englisch
Identifikator DOI(KIT): 10.5445/IR/1000065330
ISSN: 2190-4782
URN: urn:nbn:de:swb:90-653309
KITopen ID: 1000065330
Verlag Karlsruhe
Serie Karlsruhe Reports in Informatics ; 2017,3
Schlagworte Word embedding models; similarity values; semantic similarity
KIT – Die Forschungsuniversität in der Helmholtz-Gemeinschaft KITopen Landing Page