Browsing by Author "Evci, Hasan"

Now showing 1 - 1 of 1

Open Access
Extracting and segmenting high-variance references from PDF documents with BERT
(2021) Evci, Hasan
The extraction and segmentation of references from scientific articles is a core task of modern digital libraries. Once references are extracted and segmented, the bibliographic information can be made publicly available and linked, enabling efficient literature study. However, references often vary in their structure and content. This makes the extraction and segmentation of references a challenging but valuable task. The purpose of this thesis is to investigate whether Bidirectional Encoder Representations from Transformers (BERT) is suitable for the extraction and segmentation of bibliographic references. Therefore, we follow a deep learning approach for the extraction and segmentation of references from PDF documents. We use a neural network architecture based on BERT, a deep language representation model that has significantly increased performance on many natural language processing tasks. Over the BERT output, we put a linear-chain Conditional Random Field. We experiment with different BERT models and input formats and also examine two approaches for reference extraction and segmentation. The experiments are evaluated on a challenging dataset that contains both English and German social science publications with highly varying references. Our results show that the best performing BERT models were pre-trained on similar data to the data that we used for the fine-tuning of the BERT models on the task of reference extraction and reference segmentation. Moreover, our findings show that long, context-based input sequences yield the best results. The extraction model identifies and extracts references with an average F1-score of 81.9%. References are segmented with an average F1-score of 93.6%. We show that our models compare well to one other previously published work. Our conclusion is that BERT is a suitable choice for reference extraction and reference segmentation.