Browsing by Author "Klinger, Roman (Prof. Dr.)"

Now showing 1 - 2 of 2

Open Access
When few-shot fails : low-resource, domain-specific text classification with transformers
(2024) Wertz, Lukas; Klinger, Roman (Prof. Dr.)
Text classification (TC) is a foundational technique in natural language processing (NLP). The ability to automatically classify texts into predetermined categories serves a critical role in numerous applications. With recent advances in NLP owed to powerful pre-trained language models, there is also a rapidly growing interest in using TC for difficult, real-world problems in both business and industrial sectors as well as the scientific community. Using pre-trained language models, modern TC systems achieve state-of-the-art accuracy on benchmark datasets using only a handful of training examples for effective fine-tuning. However, when faced with challenging, low-resource datasets from specialised language domains, relying on small labeled datasets to train classification models is often not effective. When few-shot classification fails, we need to employ more traditional NLP techniques which increase the amount of training data in order to train accurate models. We find that existing approaches for expanding the training data are often unsuitable for deep, transformer-based classification networks. In addition, these approaches are usually tested on standard benchmark datasets, which do not properly reflect the complexity of real-world classification tasks. Consequently, there is a need for effective data augmentation or selection techniques that allow TC systems to handle complex tasks, relevant for modern industrial or business applications. Our primary goal in this work is to design effective data collection and augmentation systems for TC on low-resource datasets from technical or otherwise non-standard language domains that more closely resemble real-world applications. As a consequence, our experiments also demonstrate the limits of existing approaches and strongly motivate the need for more complex, domain-specific benchmark datasets. First, we investigate the use of generational language models for data augmentation. We find that simple language edits are smoothed out by the language model and fine-tuning on the small training data proves unstable. As such we propose a simple generation scheme, which uses specific model prompts built from the data. Second, we employ a variety of existing selection strategies for active learning. Since we find that no strategy consistently outperforms a random selection across datasets, we design an approach that combines the strategies via reinforcement learning. This allows learning which information source for data selection is most valuable and greatly improves the classification performance in early stages of the active learning process. Overall, we find that TC ist still a challenge in NLP, in particular when systems have to be designed with modern application contexts in mind. Our experiments with various baselines show, that existing augmenting techniques and AL strategies can not easily be transferred to current architectures, increasingly complex tasks or domain specific language. Consequently, the approaches presented in this work are an important step towards TC for sparse, complex datasets and real-world challenges.
Open Access
Where are emotions in text? A human-based and computational investigation of emotion recognition and generation
(2023) Troiano, Enrica; Klinger, Roman (Prof. Dr.)