Noun compound compositionality in scientific English

Thumbnail Image

Date

2025

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

This work investigates noun compound compositionality on English scientific text. I investigate the contributions and characteristics of different features for predicting compound compositionality with regard to both constituents, focusing on their diachronic development. To this end, I create a dataset of noun-noun compounds from English scientific text and have it annotated for compositionality with regard to their heads and modifiers. I extract different types of diachronic features for these compounds from the Royal Society Corpus, a large diachronic corpus of scientific text, among them frequency-based, information-theoretic and context-vector based features. Compound compositionality is predicted as a binary classification between a high- and a low-compositionality class using a variety of feature settings. These experiments are also extended to a general-domain corpus to investigate domain and data sparsity issues. The results show that a variety of different features, including frequency, cosine similarity and dispersion features are predictive for compound-constituent compositionality, but that no feature is reliably the best across constituents and settings. Combining different types of features is usually not helpful. I find substantial differences between constituent tasks. When trying to use or include general-domain data, issues arise due to data sparsity. Data availability and target set selection play important roles.

Description

Keywords

Citation

Endorsement

Review

Supplemented By

Referenced By