Generating TEI-based XML for literary texts

Sihag, Nidhi

Bitte benutzen Sie diese Kennung, um auf die Ressource zu verweisen: http://dx.doi.org/10.18419/opus-12783

Autor(en):	Sihag, Nidhi
Titel:	Generating TEI-based XML for literary texts
Erscheinungsdatum:	2022
Dokumentart:	Abschlussarbeit (Master)
Seiten:	74
URI:	http://nbn-resolving.de/urn:nbn:de:bsz:93-opus-ds-128028 http://elib.uni-stuttgart.de/handle/11682/12802 http://dx.doi.org/10.18419/opus-12783
Zusammenfassung:	Generating TEI-based XML files for literary texts is a long-standing problem in the Natural Language Processing. It is a task that requires developing a system to encode the text in their relevant TEI tags. We address the challenge of enriching the plain text with the learned XML elements. We are going to deal with the theatre plays (i.e. dramatic texts) and letters. These are encoded in XML. For now, we have these XML files for a few hundred plays and letters, but, as we can probably imagine, creating this kind of annotation manually is a lot of work. And since when new plays are digitized, they are initially only available as (OCRd) plain text. So we tried to build an automatic process for this. So that if these XML elements are recognized as an annotation, we could predict them essentially as a sequence labeling task. This thesis takes its starting point from the recent advances in Natural Language Processing being developed upon the Transformer model. One of the significant developments recently was the release of a deep bidirectional encoder called BERT that broke several state-of-the-art results at its release. BERT utilises Transfer Learning to improve modelling language dependencies in texts. BERT is used for several different Natural Language Processing tasks, this thesis looks at Named Entity Recognition, sometimes referred to as sequence classification. The purpose of this thesis is to investigate whether Bidirectional Encoder Representations from Transformers (BERT) is suitable for the automatic annotation of plain text. Therefore, we follow a deep learning approach for the extraction of plain text along with its tags from XML files. We use a neural network architecture based on BERT, a deep language representation model that has significantly increased performance on many natural language processing tasks. We experiment with different BERT models and input formats. The experiments are evaluated on a challenging dataset that contains letters in English and plays in multi-languages.
Enthalten in den Sammlungen:	05 Fakultät Informatik, Elektrotechnik und Informationstechnik

Dateien zu dieser Ressource:

Datei	Beschreibung	Größe	Format
Master's_Thesis_Nidhi_Sihag.pdf		806,91 kB	Adobe PDF	Öffnen/Anzeigen

Zur Langanzeige

Alle Ressourcen in diesem Repositorium sind urheberrechtlich geschützt.

Universität Stuttgart

OPUS - Online Publikationen der Universität Stuttgart