Content-aware text-to-speech with prompt-based prosody control

dc.contributor.authorBott, Thomas
dc.date.accessioned2024-03-27T11:18:07Z
dc.date.available2024-03-27T11:18:07Z
dc.date.issued2023de
dc.description.abstractThis thesis proposes a text-to-speech system that is conditioned on sentences embeddings extracted from natural language prompts in order to make the prosodic parameters of generated speech controllable in an intuitive and effective way. The system builds on a transformer-based TTS architecture and provides benefits regarding speed, data efficiency, robustness and controllability. The proposed integration scheme essentially concatenates speaker and sentence embeddings by modeling inter-dependencies between them before inducing the joint representation into the model. Furthermore, a training strategy is developed that operates on merged emotional speech and text datasets and varies prompts in each iteration, increasing the generalization capabilities of the model and reducing the risk of over-fitting. Extensive objective and subjective evaluations on utterances generated from sentences of emotional text datasets demonstrate the prompting capabilities of the conditioned TTS system. It achieves high prosodic controllability whereby the emotional content of provided prompts is transferred accurately to generated speech. At the same time the system maintains precise tractability of speaker identities as well as overall high speech quality and intelligibility. Besides a high correlation between prompts and speech prosody, fine-tuning the sentence embedding extractor has been found to be crucial. The proposed TTS system is limited with regard to modeling unseen speakers, intensities and multiple languages.en
dc.identifier.other1884955134
dc.identifier.urihttp://nbn-resolving.de/urn:nbn:de:bsz:93-opus-ds-141531de
dc.identifier.urihttp://elib.uni-stuttgart.de/handle/11682/14153
dc.identifier.urihttp://dx.doi.org/10.18419/opus-14134
dc.language.isoende
dc.rightsinfo:eu-repo/semantics/openAccessde
dc.subject.ddc004de
dc.subject.ddc400de
dc.titleContent-aware text-to-speech with prompt-based prosody controlen
dc.typemasterThesisde
ubs.fakultaetInformatik, Elektrotechnik und Informationstechnikde
ubs.institutInstitut für Maschinelle Sprachverarbeitungde
ubs.publikation.seiten128de
ubs.publikation.typAbschlussarbeit (Master)de

Files

Original bundle

Now showing 1 - 1 of 1
Thumbnail Image
Name:
thesis_bott.pdf
Size:
4.77 MB
Format:
Adobe Portable Document Format
Description:

License bundle

Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
3.3 KB
Format:
Item-specific license agreed upon to submission
Description: