Keras深度学习实战（41）——语音识别

0.前言

语音识别(Automatic Speech Recognition, ASR，或称语音转录文本)使声音变得"可读"，让计算机能够"听懂"人类的语言并做出相应的操作，是人工智能实现人机交互的关键技术之一。在《图像字幕生成》一节中，我们已经学习了如何将手写文本图像转录为文本，在本节中，我们将利用类似的端到端模型实现将语音转录文本模型，将语音文件转录为文字。

1. 模型与数据集分析

1.1 数据集分析

为了构建语音转录文本模型，我们所使用的数据集中包含了大约 29000 条语音文件及其对应的文本，相关数据集可以在 openslr 链接中下载，下载完成后解压缩，可以看到文件夹 train-clean-100 中的有若干子目录，每个目录下都有数条音频文件和对应的文本数据。

Keras深度学习实战（41）——语音识别

1.2 模型分析

在我们继续实现语音转文字之前，首先简单介绍模型采用的转录语音策略流程：

下载包含音频文件及其相对应的转录文本(真实标签)的数据集
在读取音频文件时指定采样率：
- 如果采样率为 16000，则每秒可以提取 16000 个数据样本点
提取音频序列的快速傅立叶变换 (Fast Fourier Transformation, FFT)：
- 使用 FFT 可以确保我们仅提取信号最重要的特征
- 默认情况下，FFT 获取 n / 2 个数据样本点，其中 n 是整个音频记录中的数据样本点数
采样音频的 FFT 特征，一次提取 320 个数据样本点；也就是说，我们一次提取 20 毫秒 (320/16000 = 1/50秒) 的音频数据
此外，我们将每隔 10 毫秒的时间间隔采样 20 毫秒的音频数据
本节中，为了降低模型的复杂度，作为演示目的我们仅使用音频持续时间小于 20 秒的音频记录
将每次采样的 20 毫秒音频数据存储到一个数组中：
- 每隔 10 毫秒采样 20 毫秒的数据
- 因此，对于一秒钟的音频剪辑，有 100 x 320 个数据样本点，对于 10 秒钟的音频剪辑，有 1000 x 320 = 320000 个数据样本点
初始化一个包含 160000 个数据样本点的空数组，并用 FFT 值填充这些值——我们已经知道 FFT 值是原始数据样本点数的一半
对于每个 1000 x 320 数据样本点的数组，存储相应的转录文本
为每个字符分配一个索引，然后将输出转换为索引列表
此外，还需要存储输入长度作为预定义的时间戳数以及转录文本长度作为输出中出现的实际字符数
基于实际输出、预测输出、时间戳数(输入长度)和转录文本长度(输出中的字符数)定义 CTC 损失函数
定义模型，该模型综合使用 conv1D 和 GRU，同时在模型中使用批归一化对数据进行归一化，以避免出现梯度消失问题
每次使用 mini batch 训练该模型，首次随机采样一个 batch 数据，将其输入到构建的模型中，以最大程度地减少 CTC 损失
最后，使用 ctc_decode 方法对测试数据样本点上的模型预测进行解码

2. 语音识别模型

接下来，我们实现在上一小节中讨论的语音识别模型。

2.1 数据加载与预处理

(1) 首先，导入相关的软件包，并遍历数据集中所有音频文件及其对应的转录文本，然后将它们存储到列表中：

import librosa
import numpy as np
import os
import re
import random
from matplotlib import pyplot as plt

org_path = 'train-clean-100/LibriSpeech/train-clean-100/'
count = 0
inp = []
k = 0
audio_name = []
audio_trans = []
for dir1 in os.listdir(org_path):
    dir2_path = org_path + dir1 + '/'
    for dir2 in os.listdir(dir2_path):
        dir3_path = dir2_path + dir2 + '/'
        for audio in os.listdir(dir3_path):
            if audio.endswith('.txt'):
                k += 1
                file_path = dir3_path + audio
                with open(file_path) as f:
                    lines = f.readlines()
                    for line in lines:
                        audio_name.append(dir3_path + line.split()[0] + '.flac')
                        words2 = line.split()[1:]
                        words3 = ' '.join(words2)
                        audio_trans.append(words3)

(2) 将转录文本长度存储到列表中，以便我们获取最大转录文本长度：

len_audio_name = []
for i in range(len(audio_name)):
    tt = re.sub(' ','-',audio_trans[i])
    len_audio_name.append(len(tt))

(3) 为了能够在单个 GPU 上训练模型，我们将仅使用转录文本长度小于 100 个字符的音频文件进行训练(如果想要获取性能更加优异的模型，在 GPU 内存允许的情况下可以使用长度更高的音频文件，以提高训练数据集大小)：

final_audio_name = []
final_audio_trans = []
for i in range(len(audio_name)):
    if(len_audio_name[i]<100):
        final_audio_name.append(audio_name[i])
        final_audio_trans.append(audio_trans[i])

在以上的代码中，我们仅存储转录文本长度少于 100 个字符的音频记录的音频名称和相应的音频转录文本。

(4) 将输入存储为 2D 数组，并仅存储持续时间少于 10 秒的音频文件的相应输出：

inp = []
inp2 = []
op = []
op2 = []

for j in range(len(final_audio_name)):
    t = librosa.core.load(final_audio_name[j],sr=16000, mono= True) 
    if(t[0].shape[0]<160000):
        t = np.array(t[0])
        t2 = np.zeros(160000)
        t2[:len(t)] = t
        inp = []
        for i in range(t2.shape[0]//160-1):
            k = t2[(i*160):((i*160)+320)]
            fft = np.fft.rfft(k)
            inp.append(np.abs(fft))
        inp2.append(inp)
        op2.append(final_audio_trans[j])

(5) 为数据中的每个不重复字符创建一个索引：

import itertools
list2d = op2
charList = list(set(list(itertools.chain(*list2d))))

(6) 创建用于存储输入和转录文本长度的 Numpy 数组，我们创建的输入长度为 243，因此之后创建模型的输出也将具有 243 个时间戳：

num_audio = len(op2)
y2 = []
input_lengths = np.ones((num_audio,1))*243
label_lengths = np.zeros((num_audio,1))
for i in range(num_audio):
    val = list(map(lambda x: charList.index(x), op2[i]))
    while len(val)<243:
        val.append(len(charList)+1)
    y2.append(val)
    label_lengths[i] = len(op2[i])
    input_lengths[i] = 243

2.2 模型构建与训练

(1) 定义CTC损失函数：

import keras.backend as K
def ctc_loss(args):
    y_pred, labels, input_length, label_length = args
    return K.ctc_batch_cost(labels, y_pred, input_length, label_length)

(2) 定义语音识别模型：

from keras.layers import Input, BatchNormalization, Conv1D, GRU, concatenate
from keras.layers import TimeDistributed, Dense, Activation, Lambda
from keras.models import Model

input_data = Input(name='the_input', shape = (999,161), dtype='float32')
inp = BatchNormalization(name="inp")(input_data)
conv= Conv1D(filters=220, kernel_size = 11,strides = 2, padding='valid',activation='relu')(inp)
conv = BatchNormalization(name="Normal0")(conv)
conv1= Conv1D(filters=220, kernel_size = 11,strides = 2, padding='valid',activation='relu')(conv)
conv1 = BatchNormalization(name="Normal1")(conv1)
gru_3 = GRU(512, return_sequences = True, name = 'gru_3')(conv1)
gru_4 = GRU(512, return_sequences = True, go_backwards = True, name = 'gru_4')(conv1)
merged = concatenate([gru_3, gru_4])
normalized = BatchNormalization(name="Normal")(merged)
dense = TimeDistributed(Dense(30))(normalized)
y_pred = TimeDistributed(Activation('softmax', name='softmax'))(dense)
Model(inputs = input_data, outputs = y_pred).summary()

(3) 定义优化器以及 CTC 损失函数的输入和输出参数：

from keras.optimizers import Adam
optimizer = Adam(lr = 0.001)
labels = Input(name = 'the_labels', shape=[243], dtype='float32')
input_length = Input(name='input_length', shape=[1],dtype='int64')
label_length = Input(name='label_length',shape=[1],dtype='int64')
output = Lambda(ctc_loss, output_shape=(1,),name='ctc')([y_pred, labels, input_length, label_length])

(4) 构建并编译模型：

model = Model(inputs = [input_data, labels, input_length, label_length], outputs= output)
model.compile(loss={'ctc': lambda y_true, y_pred: y_pred}, optimizer = optimizer, metrics = ['acc'])

该模型的简要架构信息输出如下：

Model: "functional_1"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
the_input (InputLayer)          [(None, 999, 161)]   0                                            
__________________________________________________________________________________________________
inp (BatchNormalization)        (None, 999, 161)     644         the_input[0][0]                  
__________________________________________________________________________________________________
conv1d (Conv1D)                 (None, 495, 220)     389840      inp[0][0]                        
__________________________________________________________________________________________________
Normal0 (BatchNormalization)    (None, 495, 220)     880         conv1d[0][0]                     
__________________________________________________________________________________________________
conv1d_1 (Conv1D)               (None, 243, 220)     532620      Normal0[0][0]                    
__________________________________________________________________________________________________
Normal1 (BatchNormalization)    (None, 243, 220)     880         conv1d_1[0][0]                   
__________________________________________________________________________________________________
gru_3 (GRU)                     (None, 243, 512)     1127424     Normal1[0][0]                    
__________________________________________________________________________________________________
gru_4 (GRU)                     (None, 243, 512)     1127424     Normal1[0][0]                    
__________________________________________________________________________________________________
concatenate (Concatenate)       (None, 243, 1024)    0           gru_3[0][0]                      
                                                                 gru_4[0][0]                      
__________________________________________________________________________________________________
Normal (BatchNormalization)     (None, 243, 1024)    4096        concatenate[0][0]                
__________________________________________________________________________________________________
time_distributed (TimeDistribut (None, 243, 30)      30750       Normal[0][0]                     
__________________________________________________________________________________________________
time_distributed_1 (TimeDistrib (None, 243, 30)      0           time_distributed[0][0]           
==================================================================================================
Total params: 3,214,558
Trainable params: 3,211,308
Non-trainable params: 3,250
__________________________________________________________________________________________________

(5) 每次从输入数据中采样一个 mini batch 的数据进行训练，按照以上步骤循环训练，提取了 20000 个 mini batch 的数据，对输入数据进行归一化，并拟合模型：

x = np.asarray(inp2)
y2 = np.asarray(y2)
l_train = []
for i in range(20000):
    samp=random.sample(range(len(inp2)-25),32)
    batch_input=[inp2[i] for i in samp]
    batch_input = np.array(batch_input)
    batch_input = batch_input / np.max(inp2)
    batch_output = [y2[i] for i in samp]
    batch_output = np.array(batch_output)
    input_lengths2 = [input_lengths[i] for i in samp]
    label_lengths2 = [label_lengths[i] for i in samp]
    input_lengths2 = np.array(input_lengths2)
    label_lengths2 = np.array(label_lengths2)
    inputs = {'the_input': batch_input,
            'the_labels': batch_output,
            'input_length': input_lengths2,
            'label_length': label_lengths2}
    outputs = {'ctc': np.zeros([32])} 
    history = model.fit(inputs, outputs, batch_size = 32, epochs=2, verbose =1)
    if i % 100:
        l_train.append(history.history['loss'][0])

此外，由于该数据集和模型组合的 CTC 损失降低较为缓慢，因此需要大量的时间进行训练。

(6) 根据训练完成的模型，预测测试音频。指定模型 model2，输入测试数组并在243个时间戳中的每个时间步中提取模型预测：

model2 = Model(inputs = input_data, outputs = y_pred)

k=-12
pred= model2.predict(np.array(inp2[k]).reshape(1,999,161)/np.max(inp2))

在以上代码中，我们使用输入数组的倒数第 12 个数据样本，并利用训练后的模型预测该数据样本。我们将输入数据传递给训练后的模型，并以与模型训练过程相同的方式对输入数据进行预处理。

(7) 定义函数用于解码模型对测试数据样本点的预测结果，我们使用 ctc_decode 方法对预测进行解码。最后，通过调用定义的函数来解码预测，打印预测结果：

def decoder(pred):
    pred_ints = (K.eval(K.ctc_decode(pred,[243])[0][0])).flatten().tolist()
    out = ""
    for i in range(len(pred_ints)):
        if pred_ints[i]<28:
            out = out+charList[pred_ints[i]]
    print(out)

decoder(pred)

预测的输出如下：

AI YOUN MARN KON MAN SOME FHRTATI NER AUHER

尽管前面的输出看起来较为混乱，但在声音上与实际音频确有类似之处。我们可以使用以下方法进一步提高语音转录的准确率：

使用更多的数据样本进行训练
合并自然语言处理模型以对输出执行模糊匹配，以便校正预测的输出

小结

语音识别 (Automatic Speech Recognition, ASR) 是人工智能领域里一个重要的研究方向，是人机交互的重要方式。对于如何实现语音识别，将语音序列转化为文本序列一直以来都是研究人员关注的重点领域，近年来神经网络技术在语音识别领域的应用快速发展，已经成为语音识别领域中主流的声学建模技术。在本节中，我们利用 Keras 实现了端到端的深度神经网络模型，达到将语音文件转录为文字的目的。

系列链接

秒客网

Keras深度学习实战（41）——语音识别