语音识别开源软件-- DeepSpeech（2）训练中文数据源thchs30

Thchs30数据源

是清华大学的30小时公用数据集

下载地址：
http://www.openslr.org/18/

数据的预处理

Alphabet

根据DeepSpeech中Data中带的样例来形成Alphabet.txt
例如：
语音识别开源软件-- DeepSpeech（2）训练中文数据源thchs30
Alphabet.txt中必须包含trian,dev，和test中的所有单字。
至于如何形成请自行用python形成。

Vocabulary

把数据中的每句话的文字进行去标点以单字来进行划分。一行为一句话，以此来形成Vocabulary.txt
例如：
语音识别开源软件-- DeepSpeech（2）训练中文数据源thchs30

CSV

创建train.csv, dev.csv, test.csv三个文件，这三个文件中分别对应trian,dev,test三个数据集
其中每个CSV文件包含三列

wav_filename,wav_filesize,transcript
形成如下文件：

lm.bin

用kenlm来生成二进制文件：

./kenlm/build/bin/lmplz -o 3 --text vocabulary.txt --arpa word.arpa
./kenlm/build/bin/build_binary -T -s word.arpa lm.bin

生成了lm.bin文件

trie

./deepSpeech/native_client/generate_trie alphabet.txt lm.bin trie

生成了trie文件。

训练

结构

准备好后所有的文件的位置为：
其中TRAIN DEV TEST为文件夹
语音识别开源软件-- DeepSpeech（2）训练中文数据源thchs30

编写.sh运行文件

如下文件参考了 https://discourse.mozilla.org/t/tutorial-how-i-trained-a-specific-french-model-to-control-my-robot/22830

在deepSpeech/bin中建立thch30_run.sh

#!/bin/sh
set -xe
if [ ! -f DeepSpeech.py ]; then
    echo "Please make sure you run this from DeepSpeech's top level directory."
    exit 1
fi;

python -u DeepSpeech.py \
  --train_files /home/nvidia/DeepSpeech/data/alfred/train/train.csv \
  --dev_files /home/nvidia/DeepSpeech/data/alfred/dev/dev.csv \
  --test_files /home/nvidia/DeepSpeech/data/alfred/test/test.csv \
  --train_batch_size 80 \
  --dev_batch_size 80 \
  --test_batch_size 40 \
  --n_hidden 375 \
  --epoch 33 \
  --validation_step 1 \
  --early_stop True \
  --earlystop_nsteps 6 \
  --estop_mean_thresh 0.1 \
  --estop_std_thresh 0.1 \
  --dropout_rate 0.22 \
  --learning_rate 0.00095 \
  --report_count 100 \
  --use_seq_length False \
  --export_dir /home/nvidia/DeepSpeech/data/alfred/results/model_export/ \
  --checkpoint_dir /home/nvidia/DeepSpeech/data/alfred/results/checkout/ \
  --decoder_library_path /home/nvidia/tensorflow/bazel-bin/native_client/libctc_decoder_with_kenlm.so \
  --alphabet_config_path /home/nvidia/DeepSpeech/data/alfred/alphabet.txt \
  --lm_binary_path /home/nvidia/DeepSpeech/data/alfred/lm.binary \
  --lm_trie_path /home/nvidia/DeepSpeech/data/alfred/trie \
  "[email protected]"

运行

./bin/thchs_run.sh

参考文档：

1.https://blog.yuwu.me/?p=3989
2.https://discourse.mozilla.org/t/tutorial-how-i-trained-a-specific-french-model-to-control-my-robot/22830
3.https://github.com/mozilla/DeepSpeech/issues/1756

秒客网

语音识别开源软件-- DeepSpeech（2）训练中文数据源thchs30

语音识别开源软件-- DeepSpeech（2）训练中文数据源thchs30

Thchs30数据源

相关软件安装

数据的预处理

Alphabet

Vocabulary

CSV

lm.bin

trie

训练

结构

编写.sh运行文件

运行

参考文档：