Triple M: A Practical Text-to-speech System With Multi-guidance Attention And Multi-band Multi-time LPCNet
Note: Testing text is not shown in the training set, and it is converted into system input (Chinese Pinyin with tone and pause) through the pre-trained front-end model.
Part1: Audio samples synthesized by the base text to feature model and the multi-guidance text to feature model (using the original LPCNet as the vocoder).
Base text2feature model | Multi-guidance text2feature model | |
Base text2feature model | Multi-guidance text2feature model | |
Base text2feature model | Multi-guidance text2feature model | |
Base text2feature model | Multi-guidance text2feature model |
Part2: Long sentences synthesized by the base text to feature model (failed examples), single GMM-based attention model (for comparison) and the multi-guidance text to feature model (using the original LPCNet as the vocoder).
Base text2feature model | Single GMM-based attention model | Multi-guidance text2feature model |
Base text2feature model | Single GMM-based attention model | Multi-guidance text2feature model |
|
Base text2feature model | Single GMM-based attention model | Multi-guidance text2feature model |
| |
|
Base text2feature model | Single GMM-based attention model | Multi-guidance text2feature model |
| |
|
Base text2feature model | Single GMM-based attention model | Multi-guidance text2feature model |
| |
|
Part3: Audio samples synthesized by the original LPCNet and the multi-band multi-time LPCNet (using the multi-guidance text to feature model).
Original LPCNet | Multi-band multi-time LPCNet |
|
|
Original LPCNet | Multi-band multi-time LPCNet |
|
|
Original LPCNet | Multi-band multi-time LPCNet |
|
|
Original LPCNet | Multi-band multi-time LPCNet |
|
|