Sequence Labeling for Citation Field Extraction from Cyrillic Script References

Shapiro, Igor; Saier, Tarek; Färber, Michael

Sequence Labeling for Citation Field Extraction from Cyrillic Script References

Shapiro, Igor ¹; Saier, Tarek ¹; Färber, Michael

¹
¹ Institut für Angewandte Informatik und Formale Beschreibungsverfahren (AIFB), Karlsruher Institut für Technologie (KIT)

Abstract:

Extracting structured data from bibliographic references is a crucial task for the creation of scholarly databases. While approaches, tools, and evaluation data sets for the task exist, there is a distinct lack of support for languages other than English and scripts other than the Latin alphabet. A significant portion of the scientific literature that is thereby excluded consists of publications written in Cyrillic script languages. To address this problem, we introduce a new multilingual and multidisciplinary data set of over 100,000 labeled reference strings. The data set covers multiple Cyrillic languages and contains over 700 manually labeled references, while the remaining are generated synthetically. With random samples of varying size of this data, we train multiple well performing sequence labeling BERT models and thus show the usability of our proposed data set. To this end, we showcase an implementation of a multilingual BERT model trained on the synthetic data and evaluated on the manually labeled references. Our model achieves an F1 score of 0.93 and thereby significantly outperforms a state-of-the-art model we retrain and evaluate on our data.

KITopen-Download

Verlagsausgabe

DOI: 10.5445/IR/1000149455

Veröffentlicht am 04.08.2022

Externe Links

Scopus

Export

Statistiken

Seitenaufrufe: 224
seit 05.08.2022

Downloads: 209
seit 05.08.2022

Zugehörige Institution(en) am KIT	Institut für Angewandte Informatik und Formale Beschreibungsverfahren (AIFB)
Publikationstyp	Proceedingsbeitrag
Publikationsjahr	2022
Sprache	Englisch
Identifikator	ISSN: 1613-0073 KITopen-ID: 1000149455
Erschienen in	SDU 2022: Scientific Document Understanding 2022 ; Proceedings of the Workshop on Scientific Document Understanding ; co-located with 36th AAAI Conference on Artificial Inteligence (AAAI 2022) ; Remote, March 1, 2022. Ed.: A. P. Ben Veyseh
Veranstaltung	Workshop on Scientific Document Understanding (SDU 2022), Online, 01.03.2022
Verlag	CEUR-WS.org
Serie	CEUR Workshop Proceedings ; 3164
Externe Relationen	Abstract/Volltext
Schlagwörter	reference extraction, reference parsing, sequence labeling, Cyrillic script
Nachgewiesen in	Scopus

Repository KITopen

Sequence Labeling for Citation Field Extraction from Cyrillic Script References

Abstract: