Bitte benutzen Sie diese Kennung, um auf die Ressource zu verweisen: http://dx.doi.org/10.18419/opus-3323
Autor(en): Thiele, Gregor
Titel: Graphical error mining for linguistic annotated corpora
Erscheinungsdatum: 2013
Dokumentart: Abschlussarbeit (Diplom)
URI: http://nbn-resolving.de/urn:nbn:de:bsz:93-opus-93863
http://elib.uni-stuttgart.de/handle/11682/3340
http://dx.doi.org/10.18419/opus-3323
Zusammenfassung: Corpora contain linguistically annotated data. Producing these annotations is a complex process that easily leads to inconsistencies within the annotation. Since corpora are used to evaluate automatic language processing systems the evaluation may suffer when there are too many errors within the data. This thesis focuses on finding erroneous annotations within corpora. To detect sequence annotation errors within part-of-speech tags we implemented the algorithm introduced by Dickinson and Meurers (2003). Additionally for structured annotations we choose the approach shown in Boyd et al.(2008) that targets inconsistency within dependency structures. We designed and built a graphical user interface (GUI) that is easy to handle and user-friendly. Implementing state-of-the-art algorithms for error detection with an user-friendly interface increase the operation domain because the algorithms can be used by a wider audience without deeper knowledge of computers. It provides even non-expert users with the capability to find inconsistent pos tags and dependency structures within a corpus. We evaluate the system using the German TIGER corpus and the English Penn Treebank. For the TIGER corpus we also perform a manual evaluation where we sample 115 6-grams and check manually if these contain errors. We find that 94.96% are erroneous and it is easy to decide the correct tag as a human. For 4.20% we can say that these are errors but determining the correct tag is very to difficult. In total we detect errors with a precision of 99.16%. Only one case (0.84%) is not caused by inconsistency but constitutes genuine ambiguity.
Enthalten in den Sammlungen:05 Fakultät Informatik, Elektrotechnik und Informationstechnik

Dateien zu dieser Ressource:
Datei Beschreibung GrößeFormat 
DIP_3568.pdf1,71 MBAdobe PDFÖffnen/Anzeigen


Alle Ressourcen in diesem Repositorium sind urheberrechtlich geschützt.