Can BERT generate Quebec English? : exploring and improving the acceptability of regional variation in pretrained language models

Thumbnail Image

Date

2024

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

While pretrained language models have transformed natural language processing by generating contextually rich language representations, their performance can vary widely across different language varieties. Monolingual models, trained on large datasets in a single language, may struggle with language varieties that incorporate elements from multiple languages. This thesis examines the ability of pretrained language models, particularly BERT, to handle Quebec English, a regionally influenced variety of English shaped by contact with French. By comparing three different BERT models-one monolingual, one multilingual, and one fine-tuned on Quebec English-specific data-this study evaluates their effectiveness in generating Quebec English target words and English synonyms within a masked language modeling framework. Results suggest that fine-tuning improves performance for Quebec English-specific target words, outperforming the standard pretrained models. Additionally, findings indicate that tokenization, sentence context, and pretraining data substantially impact prediction accuracy, with all models struggling most with infrequent, region-specific expressions. This work contributes to the broader goal of developing natural language processing tools that inclusively represent diverse linguistic communities, underscoring the importance of fine-tuning in adapting language models to regional and minority language varieties.

Description

Keywords

Citation

Endorsement

Review

Supplemented By

Referenced By