Browsing by Author "Zeller, Tom"

Now showing 1 - 2 of 2

Open Access
Detecting ambiguity in statutory texts
(2018) Zeller, Tom
Ambiguity is ever-present in natural language production. A human typically has no difficulties in selecting the right interpretation for an ambiguous expression by using lexical and pragmatic knowledge. While the inclusion of broad semantic knowledge poses a challenge for general disambiguation systems and parsers, its utilization might be a feasible approach for disambiguation in a restricted context. A domain that is very sensitive to ambiguity is the legal domain, especially in the wording of statutory text. Some parsing systems deal with ambiguous input by specifying all possible interpretations without explicitly choosing a solution or by returning multiple parses along with their respective probability. This work serves two purposes: An application is created which allows the input of statutory texts or single text excerpts and which detects included structural ambiguities in the form of prepositional phrase attachments and coordination ambiguities, and semantic ambiguity in the form of scopal ambiguity. Furthermore, the found ambiguities are filtered by including subcategorizational information and by utilizing domain-specific semantic knowledge which is encoded in the form of a legal domain ontology and selectional preferences for common legal expressions. The filtering capability and the effect of including the semantic knowledge are evaluated on the DUBLIN3 Regulation.
Open Access
Exploring the effects of enriched English language input on language model efficiency
(2024) Zeller, Tom
Recent years have seen the advent of large-scale language modeling as exemplified by transformer-based models like GPT or variants of the BERT architecture. These models, which are trained on massive datasets and using compute unattainable by actors that are not of the scale of the biggest tech companies, have shown impressive feats of syntactic and semantic understanding. Naturally, interest has risen in making these models more efficient, in terms of compute as well as data requirements. Research in this area can be seen as primarily motivated by two factors: reducing the barrier for smaller actors like research institutes or end consumers to train and execute state-of-the-art models, as well as reducing the carbon footprint of these models. To achieve this goal, model compression techniques like quantization, pruning or distillation are utilized. This work aims to explore a different, less model-centric and more data-centric approach: Modifying the training and inference data, by enriching it with syntactic and semantic information. To this end, a lexical resource is created which maps English words to a form where individual characters represent values of a range of semantic and syntactic features, providing lexical information that is accessible to all model types that operate on tokens at the sub-word or character-level. Different features and methods of representation are discussed, and their effect on model performance is evaluated by pretraining a small GPT-family model and fine-tuning on downstream tasks of the SuperGLUE benchmark. Given a fixed amount of data and compute, the experiments show a performance advantage for a character-level model trained using the enriched data.