Neural-based NLP systems for code-switched Arabic-English speech
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
In the ever-evolving language landscape, code-switching has emerged as an interesting linguistic phenomenon, where people seamlessly alternate between multiple languages in the same discourse. The global prevalence of this phenomenon placed a need for language technologies that are able to handle it proficiently, with the aim of providing user-friendly solutions. Despite this necessity, the progress in language technologies -and NLP research in general- in code-switching contexts is still lagging behind, compared to the remarkable strides achieved in monolingual languages. This disparity is also evident in the case of diglossic languages, such as Arabic, where language technologies are better supported in the formal variant of the language compared to the dialects. This serves as the motivation for our work, where we focus on the under-researched code-switched Egyptian Arabic-English language pair. This language pair offers an interesting set of challenges, where the complexity posed by code-switching is further compounded with challenges introduced by the primary language, including low-resourcefulness, morphological richness, and unstandardized orthography. Under this language setup, we tackle challenges in three dimensions: data collection, modeling, and evaluation. With regards to data collection, we collect a code-switched Egyptian Arabic-English speech translation corpus. The corpus consists of 12 hours of spontaneous speech gathered from bilingual speakers, containing considerable amounts of code-switching. As part of our work, we develop transcription and translation guidelines. Our ArzEn-ST corpus can be used in speech recognition, machine translation, speech translation, and linguistic analyses. We make the corpus publicly available to enable and facilitate further research for this language pair. With regards to modeling, we explore challenges and solutions in building machine translation and automatic speech recognition systems. Firstly, we compare two widely-used architectures in building speech recognition systems, namely hybrid and end-to-end architectures. We present a thorough comparison between both systems with regards to their multilingual and crosslingual knowledge transfer abilities, and their tolerance towards unstandardized orthography. We show that both systems provide comparable yet complementary performance, thus successfully propose hypotheses' combination for improving recognition. Secondly, we tackle the issue of data sparsity through segmentation, where we investigate the best segmentation approach for code-switched machine translation under different levels of low-resource settings. Thirdly, we present a comprehensive study for data augmentation, examining the relation between the quality of synthesized code-switched data and the improvements achieved in downstream tasks. Our experiments involve a wide-range of techniques, covering lexical replacements, linguistic theories, and back-translation. As part of our contribution, we examine the effectiveness in utilizing a code-switched predictive model that is capable of identifying plausible code-switching segments and augments the data accordingly. We also propose several steps for boosting the amount of generated code-switched data in back-translation, which is usually restricted by the limited amount of code-switched data. With regards to evaluation, we focus on the question of robust and fair evaluation metrics for speech recognition when dealing with code-switched and orthographically unstandardized languages, where both challenges are present in the language pair of our concern. We conduct an extensive study comparing the performance of a wide range of metrics against human judgment. Through our metrics, we overcome cross-transcription and unstandardized orthography issues by bringing the hypotheses and references into one shared space of orthography, phonology, or semantics. Through the proposed techniques, we achieve higher correlation to human judgment, outperforming the currently widely-used metrics. Finally, we believe that the findings in this thesis, presented linguistic analyses, and collected corpora can help in reaching a better understanding of this code-switched language pair and can contribute towards advancing language technologies to better accommodate for code-switching, with the overall aim of providing better performing language technologies with more human-like communication.