Second Dataset: Labeled Reference Lists for Image Segmentation

For our Linked Open Citation Database, we develop new approaches to extract reference data from reference lists. One step in this process includes the segmentation of such lists into single references, i.e., for each reference, a bounding box is determined.

For further training and evaluation purposes we labeled 2.402 additional pages, containing references from books and chapters.

The coordinates for the first box are:

<xmin>194</xmin>
<ymin>700</ymin>
<xmax>1758</xmax>
<ymax>800</ymax>

See here for the complete XML of this page file with all boxes.

The complete data set can be downloaded from MADATA together with the bibliographic information to enable you to create data citations: https://doi.org/10.7801/283