A modular framework for coreference resolution

Kobdani, Hamidreza

A modular framework for coreference resolution

Files

dissertation.pdf (1.3 MB)

Date

2012

Authors

Kobdani, Hamidreza

Abstract

Coreference resolution is playing an increasingly important role in a wide range of disciplines such as theoretical, corpus and computational linguistics. It has been shown that it is beneficial in a number of natural language processing applications, including machine translation, automatic abstracting, information extraction and question answering. As a result, it has enjoyed increased interest in recent years. First, this thesis introduces a modular supervised system for coreference resolution. It is composed of separate, interchangeable components, between which there are clear well-defined logical boundaries that improve maintainability. This system has been used successfully in two international shared tasks on coreference resolution achieving good performance. The good performance of our system demonstrates the general validity of our design. In addition, a new framework for feature engineering of natural language processing will be presented that is based on a relational data model of text. It includes fast and flexible methods for implementing and extracting new features, thereby reducing the effort of creating an NLP system for a particular task. This thesis presents an instantiation and evaluation of the framework for the problem of coreference resolution in multiple languages. Competitive results were able to be obtained in a short implementation period. This demonstrates the potential power of this framework for feature engineering. An unsupervised framework will also be presented that bootstraps a complete coreference resolution system from word associations mined from a large unlabeled corpus. I will show that word associations are useful for coreference resolution - e.g., the strong association between Obama and President is an indicator of likely coreference. Association information has so far not been used in coreference resolution because it is sparse and difficult to learn from small labeled corpora. Since unlabeled text is readily available, the unsupervised approach proposed here addresses the sparseness problem. In a self-training framework, I train a decision tree on a corpus that is automatically labeled using word associations. I will show that this unsupervised system has better coreference resolution performance than other learning approaches that do not use manually labeled data.

In den letzten Jahren hat Koreferenzresolution eine zunehmend wichtige Rolle in verschiedenen Disziplinen gespielt. Es hat sich gezeigt, dass die Anwendung der Koreferenzresolution auf viele Aufgaben wie z.B. der maschinellen übersetzung, dem automatischen Zusammenfassen und der Informationsextraktion vorteilhaft ist. In dieser Dissertation stellen wir zuerst ein modulares System für überwachte Koreferenzresolution vor. Es besteht aus verschiedenen, austauschbaren Komponenten, zwischen denen es klar definierte logische Grenzen gibt, die die Wartbarkeit verbessern. Dieses System erzielte bei zwei internationalen Koreferenzresolutionswettbewerben gute Ergebnisse, was die Vorteile unseres Designs zeigt. Außerdem präsentieren wir, ein neues Framework für die Merkmalsverarbeitung natürlicher Sprache, das auf einem relationalen Datenmodell basiert. Es umfasst eine schnelle und flexible Methode für die Umsetzung und das Extrahieren von neuen Funktionen und verringert dadurch den Aufwand der Erstellung eines konkreten NLP-Systems. Die vorliegende Arbeit stellt eine Instanziierung und Auswertung des Frameworks für das Problem der Koreferenzresolution in mehreren Sprachen dar. Wir konnten in einer kurzen Bearbeitungszeit gute Ergebnisse erzielen, was das Potenzial unseres Frameworks für die Merkmalsverarbeitung unterstreicht. Wir präsentieren auch ein unüberwachtes System, in dem ein komplettes Koreferenzresolution-System aus den Wort-Assoziationen, die aus einem großen unannotierten Korpus extrahiert worden sind, gebootstrappt wird. Wir zeigen, dass Wort-Assoziationen erfolgreich in Koreferenzresolution eingesetzt werden können. So ist die starke Assoziation zwischen Merkel und Bundeskanzlerin ein Indikator für eine höhere Koreferenzwahrscheinlichkeit. Die Anwendung von Koreferenzresolution im großen Stil scheiterte bisher am Fehlen ausreichend großer annotierter Datensätze. Für unseren unüberwachten Ansatz benötigen wir nur unanotierten Text, der für die meisten Sprachen in ausreichender Menge zur Verfügung steht. Mit der Assoziationsinformation können wir einen Textkorpus automatisch annotieren und dann iterativ unser überwachtes System darauf trainieren und den Korpus neu annotieren. Wir zeigen, dass das resultierende unüberwachte System besser als alle uns bekannten unüberwachten Lernmethoden funktioniert.

URI

http://nbn-resolving.de/urn:nbn:de:bsz:93-opus-71324
http://elib.uni-stuttgart.de/handle/11682/2825
http://dx.doi.org/10.18419/opus-2808

Collections

05 Fakultät Informatik, Elektrotechnik und Informationstechnik

Full item page

A modular framework for coreference resolution

Files

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Description

Keywords

Citation

URI

Collections

Endorsement

Review

Supplemented By

Referenced By