Automatic Layout Analysis and Visual Exploration of Multidimensional Datasets with Applications in the Digital Humanities

Chandna, Swati

doi:10.5445/IR/1000089239

Abstract:

Die rasanten Entwicklungen der letzten Jahre in den Bereichen Speicherkapazität, Rechenleistung und komplexen Algorithmen werden von Wissenschaftlern nahezu aller Disziplinen genutzt, Informationen aus ihren wissenschaftlichen Daten zu gewinnen. Auch den Digital Humanities, die computergestützte Methoden in geisteswissenschaftlichen Disziplinen anwenden, stehen vermehrt handschriftliche historische Dokumente zur Analyse und auf diese Weise zum Erkenntnisgewinn zur Verfügung.

Durch eine Dokumentlayoutanalyse werden die physischen Regionen in Bildern des Dokuments identifiziert und zur Bestimmung präziser Informationen über diese Regionen verwendet. ... mehrTraditionelle Methoden sind jedoch auf eine eingeschränkte Menge von Dokumentstrukturen festgelegt, produzieren proprietäre Datenformate und bieten keine Möglichkeit, die identifizierten physischen Regionen zu erkunden und Informationen abzuleiten. Gegenstand der vorliegenden Dissertation ist daher die Erforschung und Entwicklung einer generischen Methode, die auf eine Vielzahl von Dokumenten angewendet werden kann, reproduzierbare und deterministische Ergebnisse erzeugt und geisteswissenschaftlichen Forschen die Datenerkundung und das Ableiten wertvoller Erkenntnisse ermöglicht.

Die erste Komponente der Methode ist ein generischer und vollautomatischer Ansatz zur Identifizierung physischer Regionen wie Text- und Bildregionen auf Dokumentenbildern sowie zur Extraktion vielfältiger Layoutmerkmale der Regionen. Die Ergebnisse sind auf Grund der Charakteristik des Ansatzes sowohl deterministisch als auch reproduzierbar und im Standformat der Dokumentenrepräsentation gespeichert, das Informationen über die Eigenschaften des Dokumentenbildes, die Layoutstruktur sowie den Seiteninhalt bereitstellt. Die Evaluation an Hand von Ground Truth Daten belegt qualitative Vergleichbarkeit von traditionellen Methoden mit dem vorgestellten Ansatz.

Die zweite Komponente ist die Anwendung der Layoutanalyse und Merkmalsextraktion auf den großen und heterogenen Datensatz des „Virtuellen Skriptoriums St. Matthias“ mit 150.000 handgeschriebenen Manuskriptseiten. Die Anwendung bei gedruckten, spanischen Magazinen, PDF Dokumenten, Aristoteles Dokumenten, dem Parzival sowie Dokumenten der Sankt Gallen Datenbank zeigt die Übertragbarkeit und Allgemeingültigkeit des Ansatzes.

Die dritte Komponente der Methode ist eine generische Designstrategie, die Entwicklern die effiziente Auswahl und Kombination von Techniken der Informationsvisualisierung abgestimmt auf den jeweiligen Anwendungsfall ermöglicht. In dieser Arbeit wird die Strategie verwendet, passende Techniken der Informationsvisualisierung für multidimensionale Textdokumentdaten abzuleiten.

Die vierte Komponente ist das entwickelte Informationsvisualisierungsdesign, dessen vielfältige Elemente aufeinander abgestimmt sind und sich gegenseitig beeinflussen. Diese Komponente ermöglicht esWissenschaftlern, ihre Daten zu erkunden und wertvolle Informationen abzuleiten, die äußerliche Struktur zahlreicher Dokumente auf einen Blick zu erfassen sowie Korrelationen, Ausreißer, Cluster undWertebereiche zu bestimmen. Die qualitative Evaluierung und die Rückmeldungen der geisteswissenschaftlichen Forscher belegen, dass das Visualisierungsdesign die Untersuchung heterogener Informationen der handschriftlichen historischen Dokumente ermöglicht und wertvolle Informationen für eine präzisere physische Layoutanalyse bereitstellen kann.

Zusammengefasst ermöglicht es diese Dissertation Fachwissenschaftlern aus dem Gebiet der Digital Humanities, die identifizierten physischen Regionen und Informationen zu erforschen, neuartige Erkenntnisse abzuleiten und bisher verborgene Zusammenhänge in ihren Daten zu entdecken.

Abstract (englisch):

The rapid developments of computer technologies have led to the advancements in almost every research discipline. Researchers from various disciplines rely on the power of computers be it computing power, storage size, or advanced algorithms to extract information from their scientific data. Digital humanities, which make use of computer-aided methods in the humanities, such as literature studies and history, are gaining in importance due to a widely increasing amount of handwritten historical document images for analysis and for gaining insights.

Document layout analysis is essential for the identification of physical regions enclosed in the document images. ... mehrIt is utilized to determine the precise information about the physical regions. Previous research has focused on various methods to identify different physical regions of such document images that provide significant improvements regarding speed and accuracy. However, traditional methods are limited to a specific set of document layout structures, produce results in proprietary data formats, and do not allow exploration of the identified physical regions and the derived information.

The scope of this thesis is the research and development of a generic method that can be applied to a variety of documents with overlapping layout, generates reproducible and deterministic results, and enables humanities researchers to explore their data and gain valuable insights.

The first component of this method is a generic and a fully automated approach for the identification of physical regions, such as text regions and picture regions enclosed in the document images. This approach is also capable of extracting various layout features of the identified physical regions. Due to its fully automatic nature, the results produced by this approach are also deterministic and reproducible and adhere to a standard document representation format that records information about the document image characteristics, layout structure, and page content. Moreover, the ground truth evaluation shows that the results produced by the approach in this thesis are comparable to the results produced by traditional methods or tools.

The second component of this method is the application of the proposed layout
analysis approach on the large and heterogeneous set of document images to identify the physical regions enclosed in them and also to extract their corresponding layout features. The proposed approach is applied to 150,000 handwritten document images digitized within the scope of the project “Virtual Scriptorium St. Matthias”. The proof of generality is shown by application of the layout analysis approach to the printed Spanish magazines, PDF documents, Aristoteles documents, Parzival, and Saint Gall database.

The third component of this thesis is the generic design strategy to aid information visualization designers for efficiently choosing suitable information visualization techniques and their combinations that can be applied to a particular application. For instance, in this thesis it is applied to multidimensional and text documents data.

The fourth component is the multiple-coordinated information visualization design. This component enables researchers in the domain of digital humanities, firstly, to explore their data to determine valuable information. Secondly, to view the complete physical structure of the multiple documents at a single glance and thirdly, to determine correlations, outliers, clusters, and a range of values. The qualitative evaluation and feedback of the visualization design from the humanities researchers show that the desig is capable of exploring different information of handwritten historical document images and providing beneficial information which may contribute to a more precise physical layout analysis.

As a result, this research work has enabled domain experts in the field of digital humanities to explore the identified physical regions, and their corresponding layout features more engagingly to gain better insights and discover hidden knowledge in their data.

Zugehörige Institution(en) am KIT	Institut für Prozessdatenverarbeitung und Elektronik (IPE) Scientific Computing Center (SCC)
Publikationstyp	Hochschulschrift
Publikationsjahr	2019
Sprache	Englisch
Identifikator	urn:nbn:de:swb:90-892390 KITopen-ID: 1000089239
HGF-Programm	46.12.02 (POF III, LK 01) Data Activities
Auflage	1
Verlag	Karlsruher Institut für Technologie (KIT)
Umfang	IX, 189 S.
Art der Arbeit	Dissertation
Fakultät	Fakultät für Informatik (INFORMATIK)
Institut	Scientific Computing Center (SCC)
Prüfungsdatum	16.05.2018
Projektinformation	eCodicology (BMBF, 01UG1350C) SFB 980/2 (DFG, DFG KOORD, SFB 980/2 2016)
Schlagwörter	Digital Humanities, Generic Automatic Layout Analysis, Handwritten Document Images, Feature Extraction, Information Visualization, Generic Design Strategy, Multiple-coordinated Information Visualization Design
Nachgewiesen in	OpenAlex
Globale Ziele für nachhaltige Entwicklung
Referent/Betreuer	Dachsbacher, C.

Repository KITopen

Automatic Layout Analysis and Visual Exploration of Multidimensional Datasets with Applications in the Digital Humanities

Abstract:

Abstract (englisch):