Gender identity in language models : an inclusive approach to data creation and probing

dc.contributor.authorKnupleš, Urban
dc.date.accessioned2024-10-24T10:48:44Z
dc.date.available2024-10-24T10:48:44Z
dc.date.issued2024de
dc.description.abstractGender identity encompasses a broad spectrum that goes beyond traditional cisnormative views. In applications of pre-trained language models (PLMs), such as identity verification systems, cisnormative practices can harm individuals, for instance, misinterpreting non-binary identities as non-human (Dev et al., 2021). Considering the black-box nature of PLMs, such harmful classification raises questions about the encoded information in the model’s representations. While cisgender identity information is encoded in these representations (Lauscher et al., 2022), the (potentially biased) encoding for transgender and non-binary individuals remains unknown. In this work, we examine the encoding of gender identity information in the representations of PLMs for transgender and non-binary individuals. We first propose a corpus creation pipeline that results in the TRANsCRIPT corpus, containing text from transgender, cisgender, and non-binary individuals. We continue with a sociolinguistic analysis to investigate the differences in language use of the gender identity groups in TRANsCRIPT. Furthermore, we use TRANsCRIPT to explore the encoding of gender identity information in the representations of PLMs by applying probing techniques on their (1) frozen and (2) topic-controlled frozen representations. Finally, we fine-tune the PLMs on an explicit signal. Our findings reveal that gender identity information is encoded in the representations of PLMs for transgender, cisgender and non-binary individuals. We find that the encodings are intrinsically gender-biased. During fine-tuning, this is further amplified into gender-biased predictions. These findings highlight the harmful effects that biased representations in downstream tasks can have on transgender and non-binary individuals. Ultimately, this work highlights the importance of considering transgender and non-binary individuals in the context of developing and assessing language technologies.en
dc.identifier.other190693312X
dc.identifier.urihttp://nbn-resolving.de/urn:nbn:de:bsz:93-opus-ds-151638de
dc.identifier.urihttp://elib.uni-stuttgart.de/handle/11682/15163
dc.identifier.urihttp://dx.doi.org/10.18419/opus-15144
dc.language.isoende
dc.rightsinfo:eu-repo/semantics/openAccessde
dc.subject.ddc004de
dc.subject.ddc400de
dc.titleGender identity in language models : an inclusive approach to data creation and probingen
dc.typemasterThesisde
ubs.fakultaetInformatik, Elektrotechnik und Informationstechnikde
ubs.institutInstitut für Maschinelle Sprachverarbeitungde
ubs.publikation.seiten104de
ubs.publikation.typAbschlussarbeit (Master)de

Files

Original bundle

Now showing 1 - 1 of 1
Thumbnail Image
Name:
Master_thesis_Knuples.pdf
Size:
2.99 MB
Format:
Adobe Portable Document Format
Description:

License bundle

Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
3.3 KB
Format:
Item-specific license agreed upon to submission
Description: