For our Linked Open Citation Database, we develop new approaches to extract reference data from reference lists. One step in this process includes the segmentation of such lists into single references, i.e., for each reference, a bounding box is determined.
For training and evaluation purposes we labeled 515 pages containing references from books and chapters. For each page we manually annotated a box for each reference on that page, resulting in a total of 10.722 boxes and their coordinates in XML files, e.g.
where the first box is saved as:
See here for the complete XML of this page file with all boxes.
The complete data set can be downloaded from MADATA together with the bibliographic information to enable you to create data citations: https://doi.org/10.7801/268
A very first dump of the citation data we generate in our project is available today: dump_20180118_0910.json (4,17 MB).
It is not in RDF format, but serves as example showing the qualitity of the citation links. It contains 985 bibliographic resources, with a total of 10,905 references among which 951 are resolved.
Anne Lauscher will be presenting the LOC-DB project at the symposium “75 Jahre Zukunft: Bibliotheks- und Informationsmanagement im Wandel” in Stuttgart with a presentation titled ‘Linked Open Citation Database’.
The registration for the first Linked Open Citation Database workshop on November 7, 2017 in Mannheim is open.
Prof. Dr. Kai Eckert, Anne Lauscher, HDM Stuttgart and Akansha Bhardwaj, DFKI will be presenting the motivation and challenges related to the LOC-DB project at the EXCITE Workshop with a presentation titled ‘LOC-DB: A Linked Open Citation Database provided by Libraries. Motivation and Challenges’ at the EXCITE Workshop 2017 on ‘Challenges in Extracting and Managing References’ to be held at Cologne, Germany on 30.03.2017 – 31.03.2017.