Gender identity in language models : an inclusive approach to data creation and probing

Knupleš, Urban

Gender identity in language models : an inclusive approach to data creation and probing

dc.contributor.author	Knupleš, Urban
dc.date.accessioned	2024-10-24T10:48:44Z
dc.date.available	2024-10-24T10:48:44Z
dc.date.issued	2024	de
dc.description.abstract	Gender identity encompasses a broad spectrum that goes beyond traditional cisnormative views. In applications of pre-trained language models (PLMs), such as identity verification systems, cisnormative practices can harm individuals, for instance, misinterpreting non-binary identities as non-human (Dev et al., 2021). Considering the black-box nature of PLMs, such harmful classification raises questions about the encoded information in the model’s representations. While cisgender identity information is encoded in these representations (Lauscher et al., 2022), the (potentially biased) encoding for transgender and non-binary individuals remains unknown. In this work, we examine the encoding of gender identity information in the representations of PLMs for transgender and non-binary individuals. We first propose a corpus creation pipeline that results in the TRANsCRIPT corpus, containing text from transgender, cisgender, and non-binary individuals. We continue with a sociolinguistic analysis to investigate the differences in language use of the gender identity groups in TRANsCRIPT. Furthermore, we use TRANsCRIPT to explore the encoding of gender identity information in the representations of PLMs by applying probing techniques on their (1) frozen and (2) topic-controlled frozen representations. Finally, we fine-tune the PLMs on an explicit signal. Our findings reveal that gender identity information is encoded in the representations of PLMs for transgender, cisgender and non-binary individuals. We find that the encodings are intrinsically gender-biased. During fine-tuning, this is further amplified into gender-biased predictions. These findings highlight the harmful effects that biased representations in downstream tasks can have on transgender and non-binary individuals. Ultimately, this work highlights the importance of considering transgender and non-binary individuals in the context of developing and assessing language technologies.	en
dc.identifier.other	190693312X
dc.identifier.uri	http://nbn-resolving.de/urn:nbn:de:bsz:93-opus-ds-151638	de
dc.identifier.uri	http://elib.uni-stuttgart.de/handle/11682/15163
dc.identifier.uri	http://dx.doi.org/10.18419/opus-15144
dc.language.iso	en	de
dc.rights	info:eu-repo/semantics/openAccess	de
dc.subject.ddc	004	de
dc.subject.ddc	400	de
dc.title	Gender identity in language models : an inclusive approach to data creation and probing	en
dc.type	masterThesis	de
ubs.fakultaet	Informatik, Elektrotechnik und Informationstechnik	de
ubs.institut	Institut für Maschinelle Sprachverarbeitung	de
ubs.publikation.seiten	104	de
ubs.publikation.typ	Abschlussarbeit (Master)	de

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Master_thesis_Knuples.pdf
Size:: 2.99 MB
Format:: Adobe Portable Document Format
Description:

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 3.3 KB
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

05 Fakultät Informatik, Elektrotechnik und Informationstechnik