Exploring the effects of enriched English language input on language model efficiency
Files
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Recent years have seen the advent of large-scale language modeling as exemplified by transformer-based models like GPT or variants of the BERT architecture. These models, which are trained on massive datasets and using compute unattainable by actors that are not of the scale of the biggest tech companies, have shown impressive feats of syntactic and semantic understanding. Naturally, interest has risen in making these models more efficient, in terms of compute as well as data requirements. Research in this area can be seen as primarily motivated by two factors: reducing the barrier for smaller actors like research institutes or end consumers to train and execute state-of-the-art models, as well as reducing the carbon footprint of these models. To achieve this goal, model compression techniques like quantization, pruning or distillation are utilized. This work aims to explore a different, less model-centric and more data-centric approach: Modifying the training and inference data, by enriching it with syntactic and semantic information. To this end, a lexical resource is created which maps English words to a form where individual characters represent values of a range of semantic and syntactic features, providing lexical information that is accessible to all model types that operate on tokens at the sub-word or character-level. Different features and methods of representation are discussed, and their effect on model performance is evaluated by pretraining a small GPT-family model and fine-tuning on downstream tasks of the SuperGLUE benchmark. Given a fixed amount of data and compute, the experiments show a performance advantage for a character-level model trained using the enriched data.