Morphological processing of compounds for statistical machine translation

Cap, Fabienne

Morphological processing of compounds for statistical machine translation

Files

fabienne_cap_dissertation.pdf (1.62 MB)

Date

2014

Authors

Cap, Fabienne

Abstract

Machine Translation denotes the translation of a text written in one language into another language performed by a computer program. In times of internet and globalisation, there has been a constantly growing need for machine translation. For example, think of the European Union, with its 24 official languages into which each official document must be translated. The translation of official documents would be less manageable and much less affordable without computer-aided translation systems.

Most state-of-the-art machine translation systems are based on statistical models. These are trained on a bilingual text collection to “learn” translational correspondences of words (and phrases) of the two languages. The underlying text collection must be parallel, i.e. the content of one line must exactly correspond to the translation of this line in the other language. After training the statistical models, they can be used to translate new texts. However, one of the drawbacks of Statistical Machine Translation (SMT) is that it can only translate words which have occurred in the training texts.

This applies in particular to SMT systems which have been designed for translating from and to German. It is widely known that German allows for productive word formation processes. Speakers of German can put together existing words to form new words, called compounds. An example is the German “Apfel + Baum = Apfelbaum” (=“apple + tree = apple tree”). Theoretically there is no limit to the length of a German compound. Whereas “Apfelbaum” (= apple tree”) is a rather common German compound, “Apfelbaumholzpalettenabtransport” (= “apple|tree|wood|pallet|removal”) is a spontaneous new creation, which (probably) has not occurred in any text collection yet. The productivity of German compounds leads to a large number of distinct compound types, many of which occur only with low frequency in a text collection, if they occur at all. This fact makes German compounds a challenge for SMT systems, as only words which have occurred in the parallel training data can later be translated by the systems. Splitting compounds into their component words can solve this problem. For example, splitting “Apfelbaumholzpalettenabtransport” into its component words, it becomes intuitively clear that “Apfel” (= “apple”), “Baum” (= “tree”), “Palette” (= “palette”) and “Abtransport” (= “removal”) are all common German words, which should have occurred much more often in any text collection than the compound as a whole. Splitting compounds thus potentially makes them translatable part-by-part.

This thesis deals with the question as to whether using morphologically aware compound splitting improves translation performance, when compared to previous approaches to compound splitting for SMT. To do so, we investigate both translation directions of the language pair German and English. In the past, there have been several approaches to compound splitting for SMT systems for translating from German to English. However, the problem has mostly been ignored for the opposite translation direction, from English to German. Note that this translation direction is the more challenging one: prior to training and translation, compounds must be split and after translation, they must be accurately reassembled. Moreover, German has a rich inflectional morphology. For example, it requires the agreement of all noun phrase components which are morphologically marked. In this thesis, we introduce a compound processing procedure for SMT which is able to put together new compounds that have not occurred in the parallel training data and inflects these compounds correctly – in accordance to their context. Our work is the first which takes syntactic information, derived from the source language sentence (here: English) into consideration for our decision which simple words to merge into compounds.

We evaluate the quality of our morphological compound splitting approach using manual evaluations. We measure the impact of our compound processing approach on the translation performance of a state-of-the-art, freely available SMT system. We investigate both translation directions of the language pair German and English. Whenever possible, we compare our results to previous approaches to compound processing, most of which work without morphological knowledge.

Der Begriff Maschinelle Übersetzung beschreibt Übersetzungen von einer natürlichen Sprache in eine andere unter Zuhilfenahme eines Computers oder Computerprogramms. In Zeiten des Internets und zunehmender Globalisierung sind maschinelle Übersetzungssysteme allgegenwärtig geworden. Man denke nur an die Europäische Union, mit ihren 24 offiziellen Amtssprachen, in welchen jedes offizielle EU-Dokument vorliegen muss. Die Übersetzungen offizieller Dokumente wären ohne computer-gestützte Systeme kaum zu bewältigen, vor allem aber wären sie unbezahlbar. Heutige maschinelle Übersetzungssysteme basieren zumeist auf statistischen Modellen. Diese werden auf einer zweisprachigen Textmenge trainiert um Wortentsprechungen beider Sprachen zu “lernen”. Die zugrundeliegende Textmenge, bestehend aus Millionen von Sätzen, muss in paralleler Form vorliegen, d.h. der Inhalt jeder Zeile muss genau der Übersetzung dieser Zeile in der anderen Sprache entsprechen. Nachdem die statistischen Modelle trainiert wurden, können sie dann auf die Übersetzung von neuen Texten angewandt werden. Ein entscheidender Nachteil der Statistischen Maschinellen Übersetzung (SMÜ) ist, dass nur Wörter und Konstrukte übersetzt werden können, die zuvor in der großen Trainingstextmenge vorgekommen sind.

Dies gilt insbesondere für SMÜ Systeme, die für die Übersetzung von und nach Deutsch konzipiert sind. Die deutsche Sprache ist weitgehend bekannt für ihre produktiven Wortbildungsprozesse. Sprecher des Deutschen können jederzeit durch Zusammensetzung bereits vorhandener Wörter neue Wörter bilden, sogenannte Komposita. Ein Beispiel hierfür ist “Apfel+Baum = Apfelbaum”. Deutsche Komposita können theoretisch unendlich lang werden. Wohingegen “Apfelbaum” ein recht gebräuchliches und dadurch häufig vorkommendes Kompositum ist, ist “Apfelbaumholzpalettenabtransport” eine spontane Neubildung, für die es (vermutlich) noch keine Belege gibt. Durch die Produktivität deutscher Komposita, kommt es zu einer sehr hohen Anzahl an verschiedenen Komposita-Typen, von denen wiederum viele nur selten (oder auch gar nicht) in Texten vorgekommen sind. Diese Tatsache macht deutsche Komposita problematisch für SMÜ Systeme, da nur Wörter, die in den Trainingstexten vorgekommen sind, auch von den Systemen übersetzt werden können. Die Zerlegung von Komposita in ihre Einzelwörter kann hierbei Abhilfe schaffen. Wenn man z.B. “Apfelbaumholzpalettenabtransport” in seine Bestandteile zerlegt, wird schnell klar, daß “Apfel”, “Baum”, “Holz”, “Palette,” und “Abtransport” alles gewöhnliche deutsche Wörter sind, die eher in den Trainingstexten vorgekommen sind als das Kompositum an sich. Die Zerlegung von Komposita macht sie also potentiell Wort für Wort übersetzbar.

Diese Dissertation befasst sich mit der Frage ob durch Zerlegung deutscher Komposita mithilfe morphologischen Wissens die Übersetzungsqualität eines SMÜ Systems verbessert werden kann, im Vergleich zu früheren Methoden zur Kompositazerlegung. Wir untersuchen hierfür beide Übersetzungsrichtungen des Sprachpaares Deutsch und Englisch. Wohingegen es schon einige verfügbare Ansätze zur Kompositazerlegung für SMÜ Systeme von Deutsch nach Englisch gibt, ist das Problem für die entgegengesetzte Übersetzungsrichtung von Englisch nach Deutsch bisher weitgehend ignoriert worden. Man bedenke zum einen, dass bei einer Übersetzung vom Englischen ins Deutsche die deutschen Komposita nicht nur vor der Übersetzung zerlegt werden müssen, sondern sie müssen auch anschließend wieder korrekt zusammengefügt werden. Zum anderen verfügt das Deutsche über eine reiche Flexionsmorphologie, die z.B. die Übereinstimmung aller morphologisch markierten Merkmale innerhalb einer Nominalphrase verlangt. Wir stellen in dieser Dissertation erstmals ein Werkzeug zur Kompositabehandlung in SMÜ vor, das bei Bedarf Komposita zusammenfügen kann, die in den Trainingstexten nicht vorgekommen sind und außerdem diese Komposita – in Abhängigkeit ihres unmittelbaren Kontextes – mit einer korrekten Flexionsendung versehen kann. Die Entscheidung darüber, welche Einzelwörter nach der Übersetzung zu Komposita zusammengefügt werden sollen, treffen wir erstmals unter Berücksichtigung von syntaktischen Informationen, die aus dem zu übersetzenden Satz aus der Quellsprache (in diesem Fall: Englisch) abgeleitet wurden.

Wir überprüfen die Qualität unseres morphologischen Ansatzes zur Kompositazer-legung einerseits anhand manueller Evaluierungen, andererseits messen wir den Einfluß unserer Kompositabehandlung auf die Übersetzungsqualität eines aktuellen, frei verfügbaren, SMÜ Systems. Wir untersuchen beide Übersetzungsrichtungen des Sprachpaares Deutsch und Englisch. Wo immer möglich, vergleichen wir unsere Ergebnisse mit früheren Ansätzen zur Kompositabehandlung, die zumeist ohne morphologisches Wissen auskommen.

URI

http://nbn-resolving.de/urn:nbn:de:bsz:93-opus-97682
http://elib.uni-stuttgart.de/handle/11682/3491
http://dx.doi.org/10.18419/opus-3474

Collections

05 Fakultät Informatik, Elektrotechnik und Informationstechnik

Full item page

Morphological processing of compounds for statistical machine translation

Files

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Description

Keywords

Citation

URI

Collections

Endorsement

Review

Supplemented By

Referenced By