espnet使用方法_使用espnet与tacotron 2和fastspeech进行文本语音转换

时间:2024-04-12 17:21:58

espnet使用方法

Text-to-speech (TTS) as the name suggests, reads aloud text. It takes written words as input and converts them into audio. TTS can help anyone who doesn't want to give the effort to read a book, blog or an article. In this article, we will see how we can create a TTS engine considering we don’t know a thing about TTS.

顾名思义,文本转语音(TTS)会朗读文本。 它以书面文字作为输入并将其转换为音频。 TTS可以帮助任何不想阅读书籍,博客或文章的人。 在本文中,考虑到我们对TTS一无所知,我们将了解如何创建TTS引擎。

文字转语音架构 (Text-To-Speech Architecture)

espnet使用方法_使用espnet与tacotron 2和fastspeech进行文本语音转换
Our TTS Architecture
我们的TTS架构

The above diagram is a simplistic representation of the architecture we are going to follow. We will look into each and every component in detail and we will be using ESPnet framework for implementation purpose.

上图是我们将要遵循的架构的简化表示。 我们将详细研究每个组件,并将使用ESPnet框架进行实现。

前端 (Front-end)

espnet使用方法_使用espnet与tacotron 2和fastspeech进行文本语音转换
Our Front-end.
我们的前端。

It has mainly three components :

它主要包括三个部分:

  1. POS Tagger: It does the Part Of Speech tagging of the input text.

    POS Tagger:对输入文本进行词性标注。

  2. Tokenize: Tokenize a sentence into words.

    标记化:将一个句子标记成单词。

  3. Pronunciation: It breaks the input text into phonemes, based on the pronunciation. e.g. Hello, how are you → HH AH0 L OW, HH AW1 AA1 R Y UW1. This is done by a Grapheme-to-Phoneme convertor, we are using a neural pre-trained G2P(Grapheme to Phoneme) model in this case. This model is designed to convert English graphemes (spelling) to phonemes (pronunciation). To simply illustrate the working of this G2P model, we can say it consults a dictionary if we want to know the pronunciation of some word and if the word is not stored in the dictionary, it uses a seq2seq model based on TensorFlow to predict the phonemes.

    发音:根据发音将输入文本分为音素。 例如,您好,您好吗→HH AH0 L OW,HH AW1 AA1 RY UW1。 这是由一个Grapheme-to-Phoneme转换器完成的,在这种情况下,我们使用的是神经预训练的G2P(Grapheme至Phoneme)模型 。 此模型旨在将英语字素(拼写)转换为音素(读音)。 为了简单说明此G2P模型的工作原理,如果我们想知道某个单词的发音,并且可以说该单词没有存储在词典中,则可以说它参考了词典,它使用基于TensorFlow的seq2seq模型来预测音素。

序列到序列回归: (Sequence to Sequence Regressor:)

espnet使用方法_使用espnet与tacotron 2和fastspeech进行文本语音转换
seq-2-seq regressor.
seq-2-seq回归器。

We will be using pre-trained seq-to-seq regressor which inputs Linguistic features(Phonemes) and outputs Acoustic features(Mel-spectrogram). Here we will use Tacotron-2(Google’s) and Fastspeech(Facebook’s) for this operation. so let’s quickly look into both of them:

我们将使用经过预训练的seq-to-seq回归器,该回归器输入语言特征(音素)并输出声学特征(梅尔声谱图)。 在这里,我们将使用Tacotron-2(谷歌的)和Fastspeech(Facebook的)进行此操作。 因此,让我们快速研究一下两者:

Tacotron-2 (Tacotron-2)

espnet使用方法_使用espnet与tacotron 2和fastspeech进行文本语音转换
Source. 来源

Tacotron is an AI-powered speech synthesis system that can convert text to speech. Tacotron 2’s neural network architecture synthesises speech directly from text. It functions based on the combination of convolutional neural network (CNN) and recurrent neural network (RNN).

Tacotron是一款由AI驱动的语音合成系统,可以将文本转换为语音。 Tacotron 2的神经网络架构直接从文本中合成语音。 它基于卷积神经网络(CNN)和递归神经网络(RNN)的组合来发挥作用。

快速语音 (FastSpeech)

espnet使用方法_使用espnet与tacotron 2和fastspeech进行文本语音转换
Source. 来源

(a),(b) Feed-Forward Transformer :

(a),(b)前馈变压器:

FastSpeech adopts a novel feed-forward transformer structure, discarding the conventional encoder-attention-decoder framework, as shown in the above figure. The major component of the feed-forward transformer is the feed-forward transformer block (FFT block, as shown in Figure (b)), which consists of self-attention and 1D convolution. FFT blocks are used for the conversion from phoneme sequence to mel-spectrogram sequence, with N stacked blocks in the phoneme side and mel-spectrogram side, respectively. Uniquely, there is a length regulator in between, which is used to bridge the length mismatch between the phoneme and mel-spectrogram sequences. (Note: Phonemes are the small, distinct sounds of speech.)

FastSpeech采用了一种新颖的前馈变压器结构,放弃了传统的编码器-注意-解码器框架,如上图所示。 前馈变压器的主要组成部分是前馈变压器模块(FFT模块,如图(b)所示),它由自注意力和一维卷积组成。 FFT块用于从音素序列到梅尔频谱图序列的转换,在音素端和梅尔频谱图端分别有N个堆叠块。 唯一的是,两者之间有一个长度调节器,用于弥合音素和mel谱图序列之间的长度不匹配。 (注意:音素是微小的,独特的语音。)

(c) Length Regulator:

(c)长度调节器:

The model’s length regulator is shown in the above figure. As the length of the phoneme sequence is smaller than that of the mel-spectrogram sequence, one phoneme corresponds to several mel-spectrograms. The number of mel-spectrograms that aligns to a phoneme is called phoneme duration. The length regulator expands the hidden sequence of phonemes according to the duration in order to match the length of a mel-spectrogram sequence. We can increase or decrease the phoneme duration proportionally to adjust the voice speed and can also change the duration of blank tokens to adjust the break between words in order to control part of the prosody.

型号的长度调节器如上图所示。 由于音素序列的长度小于mel频谱图序列的长度,因此一个音素对应于几个mel频谱图。 与音素对齐的mel频谱图的数量称为音素持续时间 。 长度调节器根据持续时间扩展隐藏的音素序列,以匹配梅尔频谱图序列的长度。 我们可以按比例增加或减少音素持续时间以调整语音速度,还可以更改空白标记的持续时间以调整单词之间的间隔以控制部分韵律。

(d) Duration Predictor :

(d)持续时间预测器:

The duration predictor is very critical for the length regulator to be able to determine duration of each phoneme. As shown in the above figure, the duration predictor consists of a two-layer 1D convolution and a linear layer to predict the duration. The duration predictor stacks on the FFT block in the phoneme side and is jointly trained with FastSpeech through a mean squared error (MSE) loss function. The label of the phoneme duration is extracted from the attention alignment between the encoder and decoder in an autoregressive teacher model.

持续时间预测器对于长度调节器能够确定每个音素的持续时间至关重要。 如上图所示,持续时间预测器由两层一维卷积和用于预测持续时间的线性层组成。 持续时间预测值堆叠在音素侧的FFT块上,并通过均方误差(MSE)损失函数与FastSpeech一起进行训练。 从自回归教师模型中的编码器和解码器之间的注意力对齐中提取音素持续时间的标签。

波形发生器/声码器 (Waveform-generator / Vocoder :)

espnet使用方法_使用espnet与tacotron 2和fastspeech进行文本语音转换
Vocoder
声码器

We will be using pre-trained seq-to-seq model which inputs Acoustic features(Mel-spectogram) and outputs Waveform(Audio). Here we will use parallel WaveGAN vocoder. Here a generative adversarial network (GAN) architechture is used to generate the waveforms from the mel-spectograms, more about this architecture can be found here.

我们将使用预训练的序列到序列模型,该模型输入声学特征(Mel频谱图)并输出波形(音频)。 在这里,我们将使用并行的WaveGAN声码器。 这里使用生成对抗网络 ( GAN)体系结构从mel频谱图生成波形,有关此架构的更多信息,请参见此处

实作 (Implementation)

We have implemented the above architecture using ESPnet framework. It provides an amazing structure to easily implement all the above pre-trained models, and integrate them. Here is the notebook of the complete Text-to-Speech implementation.

我们使用ESPnet实现了以上架构 框架。 它提供了一种惊人的结构,可以轻松实现上述所有预训练的模型并将其集成。 这是完整的“文本到语音”实现的笔记本

结论 (Conclusion)

We have implemented a neural TTS system, using various pre-trained models of Tacotron-2, Fastspeech, Parallel WaveGAN etc. We can further try out other models which might produce even better results.

我们已经使用Tacotron-2,Fastspeech,Parallel WaveGAN等各种预训练模型实现了神经TTS系统。我们可以进一步尝试其他可能产生更好结果的模型。

翻译自: https://towardsdatascience.com/text-to-speech-with-tacotron-2-and-fastspeech-using-espnet-3a711131e0fa

espnet使用方法