Please use this identifier to cite or link to this item:
http://dx.doi.org/10.18419/opus-12455
Authors: | Sangolli, Suhas Devendrakeerti |
Title: | Deep learning in stream entity resolution |
Issue Date: | 2022 |
metadata.ubs.publikation.typ: | Abschlussarbeit (Master) |
metadata.ubs.publikation.seiten: | 82 |
URI: | http://nbn-resolving.de/urn:nbn:de:bsz:93-opus-ds-124741 http://elib.uni-stuttgart.de/handle/11682/12474 http://dx.doi.org/10.18419/opus-12455 |
Abstract: | Entity Resolution (ER) determines which virtual representations of entities map to the same real-world entity. Most current ER-related research in big-data scenarios focuses on volume and variety problems. However, with increased digitization, data is not only generated in bulk but also in a continuous fashion. So, velocity is also an issue that needs to be addressed in the ER domain. Another major issue in the deep learning-based ER is data labelling. It is hard to find pre-labelled data to train the model, and it turns out even more difficult when new data is being streamed continuously. In this thesis, we aim to address all the aforementioned issues by developing a deep learning-based classification function that incorporates continuous streaming entity pairs and classifies them into match or not-match. The end-to-end system has two main layers; one for training and another for prediction. In the training layer, we use a pre-trained language model (DistilBERT) as a base and train it iteratively as newer entity pairs arrive. To train the model, labelled data are obtained through active learning. The prediction layer makes use of the latest trained model to classify the streaming entity pairs into match or non-match. Both training and prediction layers function in parallel and independent of each other. We evaluate the system proposed in this thesis on several benchmark datasets that vary in size, skewness and origin-domain. As a evaluation metrics we use F1 score, losses, time and iterations. Our iterative model performs similar to the non-iterative models by achieving a match class’s f1 score of 0.97 for benchmark datasets. |
Appears in Collections: | 05 Fakultät Informatik, Elektrotechnik und Informationstechnik |
Files in This Item:
File | Description | Size | Format | |
---|---|---|---|---|
DLStreamER.pdf | 5,66 MB | Adobe PDF | View/Open |
Items in OPUS are protected by copyright, with all rights reserved, unless otherwise indicated.