ChemScanner: extraction and re-use(ability) of chemical information from common scientific documents containing ChemDraw files

Nguyen, An 1; Huang, Yu-Chieh 1; Tremouilhac, Pierre 1; Jung, Nicole ORCID iD icon 1,2; Bräse, Stefan 1,2
1 Institut für Toxikologie und Genetik (ITG), Karlsruher Institut für Technologie (KIT)
2 Institut für Organische Chemie (IOC), Karlsruher Institut für Technologie (KIT)


We developed ChemScanner, a software that can be used for the extraction of chemical information from ChemDraw binary (CDX) or ChemDraw XML-based (CDXML) files and to retrieve the ChemDraw scheme from DOC, DOCX or XML documents. This can facilitate the reuse of chemical information embedded into diverse documents used as standard storage and communication instrument in chemical sciences (e.g. for student’s theses, PhD theses, or publications). The extracted information is processed to reactions, molecules, as well as additional text and values and can be accessed via the ChemScanner UI. ChemScanner supports the export to Excel and CML, the direct import of the extracted data to the Open Source ELN Chemotion or the use via “copy and paste” of selected information. The software was designed with a focus on the processing of documents with embedded molecular structure information as CDX or CDXML as these are the most common file formats for chemical drawings. The project aims to support the chemists in their efforts to re-use chemistry research data by providing them missing tools for an automated assembly of reaction data.

Publikationstyp Zeitschriftenaufsatz
Publikationsjahr 2019
Sprache Englisch
Identifikator ISSN: 1758-2946
KITopen-ID: 1000117758
HGF-Programm 47.01.01 (POF III, LK 01) Biol.Netzwerke u.Synth.Regulat. ITG+ITC
Erschienen in Journal of cheminformatics
Verlag SpringerOpen
Band 11
Heft 1
Seiten Art. Nr.: 77
Bemerkung zur Veröffentlichung Gefördert durch den KIT-Publikationsfonds
Vorab online veröffentlicht am 11.12.2019
Schlagwörter Data mining, Chemical data extraction, CDX, CDXML, Molecule recognition
Nachgewiesen in Web of Science

DOI: 10.5445/IR/1000117758
Veröffentlicht am 17.03.2020
DOI: 10.1186/s13321-019-0400-5
