05 Fakultät Informatik, Elektrotechnik und Informationstechnik
Permanent URI for this collectionhttps://elib.uni-stuttgart.de/handle/11682/6
Browse
1 results
Search Results
Item Open Access Just-in-time pruning of large prompted language models(2024) Bareiß, PatrickPrompting of transformer-based autoregressive large language models (LLMs) is a powerful and ergonomic approach to solve language processing tasks (sentiment analysis, question answering, . . . ) using few data. However for each individual task/prompt it is also wasteful: Only a fraction of the generality in the underlying model is ever used. This raises the question: Is this also reflected in the model in the form of irrelevant model parts that do not affect the task-specific accuracy when pruned away (for instance layers)? Previous work does not address this question at a level that preserves the natural affordances of prompting: (1) Low data requirements and (2) ability to reuse the same deployed model for multiple tasks. We propose a new approach that can identify and remove irrelevant model parts if they exist, which does not require additional data (we generate it instead) and removes irrelevant parts only just before a prompt gets passed as input to the model, or just-in-time. After the prompt has been processed we re-add the removed parts to the model. In this way, we can then reuse the model and remove (potentially different) irrelevant model parts for another prompt. We identify a class of model parts for which pruning/re-adding is efficient and therefore allows for efficient just-in-time pruning. During our experiments we find that irrelevant just-in-time prunable model parts do exist for many prompts (for Mistral-7B and GPT-2-XL) and we can remove them to a substantial degree, reducing FLOPs by up to 46% while preserving the accuracy of the original model.