KIT | KIT-Bibliothek | Impressum | Datenschutz

D5.2 Development of Best Practices for Large-scale Data Management Infrastructure

Stadtmüller, Steffen; Mühleisen, Hannes; Bizer, Chris; Kersten, Martin; Rijke, Arjen de; Groffen, Fabian; Zhang, Ying; Ladwig, Günter; Harth, Andreas; Trampus, Mitja

The amount of available data for processing is constantly increasing and becomes more diverse. We collect our experiences on deploying large-scale data management tools on local-area clusters or cloud infrastructures and provide guidance to use these computing and storage infrastructures. In particular we describe Apache Hadoop, one of the most widely used software libraries to perform largescale data analysis tasks on clusters of computers in parallel and provide guidance on how to achieve optimal execution time when performing analysis over large-scale data. Furthermore we report on our experiences with projects, that provide valuable insights in the deployment and use of large-scale data management tools: The Web Data Commons project for which we extracted all Microformat, Microdata and RDFa data from the Common Crawl web corpus, the largest and most up-to-date web corpus that is currently available to the public. SciLens, a local-area cluster machine that has been built to facilitate research on data management issues for data-intensive scientific applications. CumulusRDF, an RDF store on cloud-based architectures. We investigat ... mehr

Zugehörige Institution(en) am KIT Institut für Angewandte Informatik und Formale Beschreibungsverfahren (AIFB)
Publikationstyp Forschungsbericht
Jahr 2012
Sprache Englisch
Identifikator KITopen-ID: 1000094892
Verlag KIT, Karlsruhe
KIT – Die Forschungsuniversität in der Helmholtz-Gemeinschaft
KITopen Landing Page