KIT | KIT-Bibliothek | Impressum | Datenschutz

The challenges of German archival document categorization on insufficient labeled data

Hoppe, Fabian; Tietz, Tabea; Dessì, Danilo; Meyer, Nils; Sprau, Mirjam; Alam, Mehwish; Sack, Harald

Abstract:
Document exploration in archives is often challenging due to the lack of organization in topic-based categories. Moreover, archival records only provide short text which is often insufficient for capturing the semantic. This paper proposes and explores a dataless categorization approach that utilizes word embeddings and TF-IDF to categorize archival documents. Additionally, it introduces a visual approach built on top of the word embeddings to enhance the exploration of data. Preliminary results suggest that current vector representations alone do not provide enough external knowledge to solve this task.

Open Access Logo


Verlagsausgabe §
DOI: 10.5445/IR/1000126469
Veröffentlicht am 20.11.2020
Cover der Publikation
Zugehörige Institution(en) am KIT Institut für Angewandte Informatik und Formale Beschreibungsverfahren (AIFB)
Publikationstyp Proceedingsbeitrag
Publikationsjahr 2020
Sprache Englisch
Identifikator ISSN: 1613-0073
KITopen-ID: 1000126469
Erschienen in WHiSe 2020: Workshop on Humanities in the Semantic Web 2020 - Proceedings of the Third Workshop on Humanities in the Semantic Web (WHiSe 2020), co-located with 15th Extended Semantic Web Conference (ESWC 2020), Heraklion, Greece, June 2, 2020 (online). Ed.: A. Adamou
Veranstaltung 3rd Workshop on Humanities in the Semantic Web (WHiSe 2020), Online, 02.06.2020
Verlag CEUR-WS
Seiten 15-20
Serie CEUR Workshop Proceedings ; 2695
Bemerkung zur Veröffentlichung Die Veranstaltung fand wegen der Corona-Pandemie als Online-Event statt
Schlagwörter Dataless Categorization, Text Categorization, Document Exploration, Cultural Heritage
Nachgewiesen in Scopus
KIT – Die Forschungsuniversität in der Helmholtz-Gemeinschaft
KITopen Landing Page