This page is the demo of
“Unified Source-Filter GAN: Unified Source-Filter Network Based On Factorization of Quasi-Periodic Parallel WaveGAN” [paper] [code]
Abstract
We propose a unified approach to data-driven source-filter modeling using a single neural network for developing a neural vocoder capable of generating high-quality synthetic speech waveforms while retaining flexibility of the source-filter model to control their voice characteristics. Our proposed network called unified source-filter generative adversarial networks (uSFGAN) is developed by factorizing quasi-periodic parallel WaveGAN (QPPWG), one of the neural vocoders based on a single neural network, into a source excitation generation network and a vocal tract resonance filtering network by additionally implementing a regularization loss. Moreover, inspired by neural source filter (NSF), only a sinusoidal waveform is additionally used as the simplest clue to generate a periodic source excitation waveform while minimizing the effect of approximations in the source filter model. The experimental results demonstrate that uSFGAN outperforms conventional neural vocoders, such as QPPWG and NSF in both speech quality and pitch controllability.
Corpus and references:
CMU-ARCTIC
NSF
NSF_demo
PWG
QPPWG
QPPWG_demo
Architecture of uSFGAN
Generator of uSFGAN
Pitch-dependent dilated convolution (see QPPWG for details)
Demo Sounds
- Conditioned on 1×F0
Vocoder | Female (clb) | Male (bdl) |
---|---|---|
Natural | ||
WORLD *1 | ||
NSF *2 | ||
QPPWG_20 *3 | ||
uSFGAN_60 *4 |
*1. WORLD: WORLD vocoder (Baseline I)
*2. NSF: Neural Source-Filter vocoder of hn-sinc-nsf9 (Baseline II)
*3. QPPWG_20: QPPWG vocoder with 10 adaptive blocks + 10 fixed blocks (Baseline III)
*4. uSFGAN_60: uSFGAN vocoder with source-network of 30 adaptive blocks + filter-network of 30 fixed blocks
- Conditioned on 0.5×F0
Vocoder | Female (clb) | Male (bdl) |
---|---|---|
WORLD | ||
NSF | ||
QPPWG_20 | ||
uSFGAN_60 |
- Conditioned on 2.0×F0
Vocoder | Female (clb) | Male (bdl) |
---|---|---|
WORLD | ||
NSF | ||
QPPWG_20 | ||
uSFGAN_60 |
Ablation Study
- Comparison of waveforms and spectrograms of output source signals with (left) and without (right) the spectral envelope regularization loss
![](res/figure/ablation.jpg)
- Comparison of sound of the output source signals with and without the regularization loss
Condition | Female (clb) | Male (bdl) |
---|---|---|
1.0×F0 w/ Lr | ||
0.5×F0 w/ Lr | ||
2.0×F0 w/ Lr | ||
1.0×F0 w/o Lr | ||
0.5×F0 w/o Lr | ||
2.0×F0 w/o Lr |
Subjective Results
- MOS results of speech quality
![](res/figure/MOS.jpg)
- XAB results of pitch accuracy
![](res/figure/XAB.jpg)
page layout is modified from cayman-theme and cayman-blog. LICENSE