Evaluation of transfer learning methods in text-to-speech

Zhou, Zhenliang

Bitte benutzen Sie diese Kennung, um auf die Ressource zu verweisen: http://dx.doi.org/10.18419/opus-13842

Langanzeige der Metadaten

DC Element	Wert	Sprache
dc.contributor.author	Zhou, Zhenliang	-
dc.date.accessioned	2023-12-19T15:08:25Z	-
dc.date.available	2023-12-19T15:08:25Z	-
dc.date.issued	2023	de
dc.identifier.other	1876993537	-
dc.identifier.uri	http://nbn-resolving.de/urn:nbn:de:bsz:93-opus-ds-138616	de
dc.identifier.uri	http://elib.uni-stuttgart.de/handle/11682/13861	-
dc.identifier.uri	http://dx.doi.org/10.18419/opus-13842	-
dc.description.abstract	Transfer learning is widely used as an important machine learning method in natural language processing. To complete a specific task, the developer must use the relevant datasets to complete the training of the specific task model, which will consume a lot of data and computing resources. In this case, transfer learning is proposed. To finish the specific task, the developer can first train a multi-task model with giant datasets, and then only need a small amount of data to train the model on a specific task to achieve migration. In natural language processing, four major transfer learning methods have been proposed: adapter, BitFit, diff pruning, and full finetuning, which use less finetuning data and less training time to achieve comparable results with a single task model. We expect to apply these four transfer learning methods in the text-to-speech domain. We enable pre-training on Fastspeech2 using the multi-speaker dataset to learn the speech information of these speakers. Then a single speaker training dataset is used to finetune the pre-training model to imitate the speaker's speech characteristics. After generating the speech audios by four transfer models, we compare the generated audios with the original speech of the speaker and score these speech signals through non-subjective and subjective evaluation to obtain the methods' performance. We find that BitFit has the best performance in the transfer learning experiment trained with low resources datasets(vctk), while full finetuning encountered the problem of overfitting, which heavily influence the audio duration information. Besides, the audios generated by the diff pruning model are all noise, which represents diff pruning is completely unsuitable for the migration of low resources datasets. In the comparative experiment, we use LJspeech(high resources) dataset for finetuning. The adapter and full finetuning models have the best speech restoration. Although the voice quality of BitFit and diff pruning is inferior to the adapter and full finetuning, the audio quality is not significantly reduced.	deen
dc.language.iso	en	de
dc.rights	info:eu-repo/semantics/openAccess	de
dc.subject.ddc	004	de
dc.title	Evaluation of transfer learning methods in text-to-speech	en
dc.type	masterThesis	de
ubs.fakultaet	Informatik, Elektrotechnik und Informationstechnik	de
ubs.institut	Institut für Maschinelle Sprachverarbeitung	de
ubs.publikation.seiten	79	de
ubs.publikation.typ	Abschlussarbeit (Master)	de
Enthalten in den Sammlungen:	05 Fakultät Informatik, Elektrotechnik und Informationstechnik

Dateien zu dieser Ressource:

Datei	Beschreibung	Größe	Format
master_thesis_Zhenliang Zhou.pdf		11,57 MB	Adobe PDF	Öffnen/Anzeigen

Zur Kurzanzeige

Alle Ressourcen in diesem Repositorium sind urheberrechtlich geschützt.

Universität Stuttgart

OPUS - Online Publikationen der Universität Stuttgart