Controllable text-to-speech system : speaking style control using hierarchical variational autoencoder

dc.contributor.authorYang, Yung-Ching
dc.date.accessioned2024-10-25T14:28:09Z
dc.date.available2024-10-25T14:28:09Z
dc.date.issued2024de
dc.description.abstractThis research proposes an utterance embedding model that provides disentangling and scalable control over latent attributes in human speech. Our model is formulated as a hierarchical generative model based on the Variational Autoencoder (VAE) framework, integrated with the FastSpeech2 Text-to-Speech (TTS) system. The work demonstrates that image initiative networks on hierarchical pattern learning can be adapted to model complex distributions in speaking styles and prosody. This work merges advancements in VAE research-particularly those addressing critical statistical challenges such as posterior collapse and unbounded KL divergence-with recent studies focusing on structural enhancements of architectures in VAEs. We introduce a hierarchical structure in latent variable modeling and augment the learning objective with hierarchical information to ensure the latent variables at each level are hierarchically factorized. This approach learns the smooth latent prosody space and deepens our understanding of the relationship between the hierarchical nature of prosody and neural network architecture. Through our customized control mechanism, integrated into various levels of the latent spaces, the model is capable of manipulation of prosodic elements, allowing for both independent and scalable adjustments. By incorporating these techniques, our model is capable of capturing a wide range of prosodic variations, offering a refined level of control and expressiveness in speech synthesis in unsupervised learning contexts.en
dc.identifier.other1906963231
dc.identifier.urihttp://nbn-resolving.de/urn:nbn:de:bsz:93-opus-ds-151699de
dc.identifier.urihttp://elib.uni-stuttgart.de/handle/11682/15169
dc.identifier.urihttp://dx.doi.org/10.18419/opus-15150
dc.language.isoende
dc.rightsinfo:eu-repo/semantics/openAccessde
dc.subject.ddc004de
dc.subject.ddc400de
dc.titleControllable text-to-speech system : speaking style control using hierarchical variational autoencoderen
dc.typemasterThesisde
ubs.fakultaetInformatik, Elektrotechnik und Informationstechnikde
ubs.institutInstitut für Maschinelle Sprachverarbeitungde
ubs.publikation.seiten53de
ubs.publikation.typAbschlussarbeit (Master)de

Files

Original bundle

Now showing 1 - 1 of 1
Thumbnail Image
Name:
master_thesis_YungChing_Yang.pdf
Size:
4.95 MB
Format:
Adobe Portable Document Format
Description:

License bundle

Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
3.3 KB
Format:
Item-specific license agreed upon to submission
Description: