Kaldi如何准备自己的数据

时间:2021-01-05 20:12:54

Introduction

跑完kaldi的一些脚本例子,你可能想要自己用Kaldi跑自己的数据集。这里将会阐述如何准备好数据。

run.sh较上的部分是有关数据准备的,通常local与数据集相关。

例如:RM数据集

local/rm_data_prep.sh /export/corpora5/LDC/LDC93S3A/rm_comp || exit 1;

utils/prepare_lang.sh data/local/dict '!SIL' data/local/lang data/lang || exit 1;

local/rm_prepare_grammar.sh || exit 1;

再例如:再WSJ数据集

wsj0=/export/corpora5/LDC/LDC93S6B
wsj1=/export/corpora5/LDC/LDC94S13B local/wsj_data_prep.sh $wsj0/??-{?,??}.? $wsj1/??-{?,??}.? || exit 1; local/wsj_prepare_dict.sh || exit 1; utils/prepare_lang.sh data/local/dict "<SPOKEN_NOISE>" data/local/lang_tmp data/lang || exit 1; local/wsj_format_data.sh || exit 1; WSJ相对RM多的命令与训练的语言模型有关,但是最重要的还是两者共有的命令。 数据准备阶段输出包含两个部分:一个与data有关( data/train/),一个与language有关(data/lang/)。data与具体的录音数据有关,lang与语言本身有关
(lexicon, phone等),如果你想要用现有的系统和语言模型解码准备好的数据,则你需要进一步熟悉它们。

Data preparation-- the "data" part.

可以看下egs/swbd/s5, data/train 目录:

s5# ls data/train
cmvn.scp feats.scp reco2file_and_channel segments spk2utt text utt2spk wav.scp
所有文件都很重要,对于简单的任务,是没有seg信息的(发音直接对应于一个文件)。
"utt2spk", "text" and "wav.scp" and possibly "segments" and "reco2file_and_channel"这些需要你自己创建,剩下的可以用标准脚本生成。
如果文件命令规范的话,很多可以用脚本自动生成。

Files you need to create yourself

text文件记录了每个发音id与其对应的文本。

s5# head -3 data/train/text
sw02001-A_000098-001156 HI UM YEAH I'D LIKE TO TALK ABOUT HOW YOU DRESS FOR WORK AND
sw02001-A_001980-002131 UM-HUM
sw02001-A_002736-002893 AND IS 发音id:语音库名称,说话人id作为前缀,语音时间戳信息000098-001156
2001-A作为前缀;这样的好处是有助与与说话人信息相关的排序(utt2spk and spk2utt)。
有时我们需要将说话人id与发音分离开,用-是非常安全的,因为它是最小的ASCII值,如果说话人Id长度变化,某些情况排序过程会终止,有可能会崩溃的(按照C排序)。 另外一个重要的文件是wav.scp。
s5# head -3 data/train/wav.scp
sw02001-A /home/dpovey/kaldi-trunk/tools/sph2pipe_v2.5/sph2pipe -f wav -p -c 1 /export/corpora3/LDC/LDC97S62/swb1/sw02001.sph |
sw02001-B /home/dpovey/kaldi-trunk/tools/sph2pipe_v2.5/sph2pipe -f wav -p -c 2 /export/corpora3/LDC/LDC97S62/swb1/sw02001.sph | 文件格式:
<recording-id> <extended-filename>
extended-filename有可能是真实的wav文件,如果seg文件不存在,则wav.scp每一行的第一个标识是发音id.
wav.scp必须是单声道的,如果文件是多声道的,可以用sox提取出指定的声道。
"segments" 文件:
s5# head -3 data/train/segments
sw02001-A_000098-001156 sw02001-A 0.98 11.56
sw02001-A_001980-002131 sw02001-A 19.8 21.31
sw02001-A_002736-002893 sw02001-A 27.36 28.93
<utterance-id> <recording-id> <segment-begin> <segment-end>
秒为单位,

 "reco2file_and_channel" 仅仅会被用到,当NIST sclite工具评分时。
s5# head -3 data/train/reco2file_and_channel
sw02001-A sw02001 A
sw02001-B sw02001 B
sw02005-A sw02005 A
<recording-id> <filename> <recording-side (A or B)>
如果你没有stm文件,则你不需要知道其它的,也不需要用到reco2file_and_channel文件
utt2spk"文件
s5# head -3 data/train/utt2spk
sw02001-A_000098-001156 2001-A
sw02001-A_001980-002131 2001-A
sw02001-A_002736-002893 2001-A

The format is

<utterance-id> <speaker-id>
spk2gender文件
s5# head -3 ../../rm/s5/data/train/spk2gender
adg0 f
ahh0 m
ajp0 m
说话人id   性别
export LC_ALL=C 排序方式

Files you don't need to create yourself

"spk2utt" file 
utils/utt2spk_to_spk2utt.pl data/train/utt2spk > data/train/spk2utt
feats.scp file.指出了发音id,其对应的mfcc特征位于ark文件的位置
s5# head -3 data/train/feats.scp
sw02001-A_000098-001156 /home/dpovey/kaldi-trunk/egs/swbd/s5/mfcc/raw_mfcc_train.1.ark:24
sw02001-A_001980-002131 /home/dpovey/kaldi-trunk/egs/swbd/s5/mfcc/raw_mfcc_train.1.ark:54975
sw02001-A_002736-002893 /home/dpovey/kaldi-trunk/egs/swbd/s5/mfcc/raw_mfcc_train.1.ark:62762
<utterance-id> <extended-filename-of-features>

This feats.scp file是由下面的命令创建的
steps/make_mfcc.sh --nj 20 --cmd "$train_cmd" data/train exp/make_mfcc/train $mfccdir

cmvn.scp 说话人id   特征ark位置
s5# head -3 data/train/cmvn.scp
2001-A /home/dpovey/kaldi-trunk/egs/swbd/s5/mfcc/cmvn_train.ark:7
2001-B /home/dpovey/kaldi-trunk/egs/swbd/s5/mfcc/cmvn_train.ark:253
2005-A /home/dpovey/kaldi-trunk/egs/swbd/s5/mfcc/cmvn_train.ark:499
cmvn.scp由下面命令创建
steps/compute_cmvn_stats.sh data/train exp/make_mfcc/train $mfccdir

验证、修复 文件格式
utils/validate_data_dir.sh data/train
utils/fix_data_dir.sh data/train

Data preparation-- the "lang" directory.

lang文件夹下的内容:

s5# ls data/lang
L.fst L_disambig.fst oov.int oov.txt phones phones.txt topo words.txt 拷贝lang的所有文件,加上G.fst存入lang_test中
s5# ls data/lang_test
G.fst L.fst L_disambig.fst oov.int oov.txt phones phones.txt topo words.txt

lang_test/ was created by copying lang/ and adding G.fst.

s5# ls data/lang/phones
context_indep.csl disambig.txt nonsilence.txt roots.txt silence.txt
context_indep.int extra_questions.int optional_silence.csl sets.int word_boundary.int
context_indep.txt extra_questions.txt optional_silence.int sets.txt word_boundary.txt
disambig.csl nonsilence.csl optional_silence.txt silence.csl phones里面的文件非常多,幸运的时,作为kaldi用户,你不需要自己创建这些文件,只需要一些简单的输入,执行utils/prepare_lang.sh脚本就可完成。
Fortunately you, as a Kaldi user, don't have to create all of these files because we have a script "utils/prepare_lang.sh" that creates it all for you based on simpler inputs

语音识别过程中需要对声学模型进行构图,即扩展HCLG的过程,

扩展是按照H<-C<-L<-G的顺序进行的,

首先扩展G,

1.G.fst: The Language Model FST

FSA grammar,可以通过n-gram构建得到,即把字构成了词组

2.L_disambig.fst: The Phonetic Dictionary with Disambiguation Symbols FST

构建一个FST(LG),输入时phone,输出是word,即把phone转化成了字

The file L.fst is the Finite State Transducer form of the lexicon with phone symbols on the input and word symbols on the output.
A lexicon with disambiguation symbols

3.C.fst: The Context FST

把triphone 转化成monophone,即在第2步骤中扩展了context,即扩展triphone,最终输出是CLG

4.H.fst: The HMM FST

把HMM的state映射到triphone ,即把HMM的pdf-id映射到triphone,也就是扩展了HMM,
即输入时pdf-id,输出是word,也就是HCLG

HCLG.fst: final graph

把步骤1-4合起来HCLG,就是构图中构建WFST的过程。

即,输入是pdf-id,输出是对应的词组
Kaldi如何准备自己的数据

Contents of the "lang" directory

phones.txt和words.txt都是OpenFst格式的符号文件,每一行内容是文本与对应的数字。

phones.txt and words.txt. These are both symbol-table files, in the OpenFst format
each line is the text form and then the integer form:
s5# head -3 data/lang/phones.txt
<eps> 0
SIL 1
SIL_B 2
s5# head -3 data/lang/words.txt
<eps> 0
!SIL 1
-'S 2 这两个文件会被int2sym.pl和sym2int.pl脚本以及fstcompile和fstprint命令使用到。

They are mostly only accessed by the scripts utils/int2sym.pl and utils/sym2int.pl, and by the OpenFst programs fstcompile and fstprint.

L.fst时FST形式的lexicon,详细可参考下面的论文。

The file L.fst is the Finite State Transducer form of the lexicon。见 (L, see "Speech Recognition with Weighted Finite-State Transducers"by Mohri, Pereira and Riley, in Springer Handbook on SpeechProcessing and Speech Communication, 2008).

L_disambig.fst也是lexicon,不过包含的时歧义的符号#1,#2。

L_disambig.fst is the lexicon, as above but including the disambiguation symbols #1, #2

The file data/lang/oov.txt contains just a single line:

oov.txt仅仅包含一行。

s5# cat data/lang/oov.txt
<UNK>
在训练过程中,这个词不在发音词典里。我们称之为垃圾音素,这个音素与各种不同的spoken noise对齐。

This is the word that we will map all out-of-vocabulary words to during training,we designate as a "garbage phone"; this phone will align with various kinds of spoken noise.<SPN> (short for "spoken noise"):

s5# grep -w UNK data/local/dict/lexicon.txt
<UNK> SPN oov.int对应oov.txt的整数形式。
The file oov.int contains the integer form of this topo描述的时HMM的拓扑结构。
The file data/lang/topo ;This specifies the topology of the HMMs 
s5# cat data/lang/topo
<Topology>
<TopologyEntry>
<ForPhones>
21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188
</ForPhones>
<State> 0 <PdfClass> 0 <Transition> 0 0.75 <Transition> 1 0.25 </State>
<State> 1 <PdfClass> 1 <Transition> 1 0.75 <Transition> 2 0.25 </State>
<State> 2 <PdfClass> 2 <Transition> 2 0.75 <Transition> 3 0.25 </State>
<State> 3 </State>
</TopologyEntry>
<TopologyEntry>
<ForPhones>
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
</ForPhones>
<State> 0 <PdfClass> 0 <Transition> 0 0.25 <Transition> 1 0.25 <Transition> 2 0.25 <Transition> 3 0.25 </State>
<State> 1 <PdfClass> 1 <Transition> 1 0.25 <Transition> 2 0.25 <Transition> 3 0.25 <Transition> 4 0.25 </State>
<State> 2 <PdfClass> 2 <Transition> 1 0.25 <Transition> 2 0.25 <Transition> 3 0.25 <Transition> 4 0.25 </State>
<State> 3 <PdfClass> 3 <Transition> 1 0.25 <Transition> 2 0.25 <Transition> 3 0.25 <Transition> 4 0.25 </State>
<State> 4 <PdfClass> 4 <Transition> 4 0.75 <Transition> 5 0.25 </State>
<State> 5 </State>
</TopologyEntry>
</Topology>
查看头3行
s5# head -3 data/lang/phones/context_indep.txt    上下文无关的音素 文本表示
SIL
SIL_B
SIL_E
s5# head -3 data/lang/phones/context_indep.int   整数表示
1
2
3
s5# cat data/lang/phones/context_indep.csl 所有音素,整数表示
1:2:3:4:5:6:7:8:9:10:11:12:13:14:15:16:17:18:19:20

context_indep.txt包含一些音素,我们用这些建立与上下文无关的模型,我们不会构建有关左音素和右音素的决策树,事实上,我们会构建较小的树,这些树只会问与中心音素与HMM状态相关的问题,
创建这些树依赖于root.txt

"context_indep.txt" contains a list of those phones for which we build context-independent models  we do not build a decision tree  that  gets to ask questions about the left and right phonetic context
In fact, we do build smaller trees where we get to ask questions about the central phone and the HMM-state,
this depends on the "roots.txt" file root.txt包含一些信息,它与我们如何构建与音素相关的决策树。 The file roots.txt contains information that relates to how we build the phonetic-context decision tree: shared表示共享中心节点,split表示分裂成以下几种节点,如图所示。
head data/lang/phones/roots.txt
shared split SIL SIL_B SIL_E SIL_I SIL_S
shared split SPN SPN_B SPN_E SPN_I SPN_S
shared split NSN NSN_B NSN_E NSN_I NSN_S
shared split LAU LAU_B LAU_E LAU_I LAU_S
...
shared split B_B B_E B_I B_S
In addition, all three states of a HMM (or all five states, for silences) share the root

# cat data/lang/phones/context_indep.txt
SIL
SIL_B
SIL_E
SIL_I
SIL_S
SPN
SPN_B
SPN_E
SPN_I
SPN_S
NSN
NSN_B
NSN_E
NSN_I
NSN_S
LAU
LAU_B
LAU_E
LAU_I
LAU_S 这些音素的变体与词位置有关,并不是所有的变体会被用到。
There are a lot of variants of these phones because of word-position dependency; 
 not all of these variants will ever be used
eg: SIL 变体  SIL_B  SIL_I   SIL_E SIL_S  我们区别开silence.txt和nonsilence.txt,silence.txt的音素都会被用于各种线性变换,如LDA,MLLT,fMLLR,而nonsilence不会的。
What we mean by "nonsilence" phones is, phones that we intend to estimate various kinds of linear transforms on: that is, global transforms such as LDA and MLLT, and speaker adaptation transforms such as fMLLR
does not pay to include silence in the estimation of such transforms
s5# head -3 data/lang/phones/silence.txt
SIL
SIL_B
SIL_E
s5# head -3 data/lang/phones/nonsilence.txt
IY_B
IY_E
IY_I
s5# head -3 data/lang/phones/disambig.txt
#0
#1
#2 optional_silence.txt只包含一个音素,这个音素有选择的位于两个词之间。
The file optional_silence.txt contains a single phone which can optionally appear between word
s5# cat data/lang/phones/optional_silence.txt
SIL
set.txt 的每一行,组合排列着某个音素的所有变体。
 sets.txt groups together all the word-position-dependent versions of each phone
s5# head -3 data/lang/phones/sets.txt
SIL SIL_B SIL_E SIL_I SIL_S
SPN SPN_B SPN_E SPN_I SPN_S
NSN NSN_B NSN_E NSN_I NSN_S
 
extra_questions.txt问题集,人工的,用于构建决策树。
s5# cat data/lang/phones/extra_questions.txt
IY_B B_B D_B F_B G_B K_B SH_B L_B M_B N_B OW_B AA_B TH_B P_B OY_B R_B UH_B AE_B S_B T_B AH_B V_B W_B Y_B Z_B CH_B AO_B DH_B UW_B ZH_B EH_B AW_B AX_B EL_B AY_B EN_B HH_B ER_B IH_B JH_B EY_B NG_B
IY_E B_E D_E F_E G_E K_E SH_E L_E M_E N_E OW_E AA_E TH_E P_E OY_E R_E UH_E AE_E S_E T_E AH_E V_E W_E Y_E Z_E CH_E AO_E DH_E UW_E ZH_E EH_E AW_E AX_E EL_E AY_E EN_E HH_E ER_E IH_E JH_E EY_E NG_E
IY_I B_I D_I F_I G_I K_I SH_I L_I M_I N_I OW_I AA_I TH_I P_I OY_I R_I UH_I AE_I S_I T_I AH_I V_I W_I Y_I Z_I CH_I AO_I DH_I UW_I ZH_I EH_I AW_I AX_I EL_I AY_I EN_I HH_I ER_I IH_I JH_I EY_I NG_I
IY_S B_S D_S F_S G_S K_S SH_S L_S M_S N_S OW_S AA_S TH_S P_S OY_S R_S UH_S AE_S S_S T_S AH_S V_S W_S Y_S Z_S CH_S AO_S DH_S UW_S ZH_S EH_S AW_S AX_S EL_S AY_S EN_S HH_S ER_S IH_S JH_S EY_S NG_S
SIL SPN NSN LAU
SIL_B SPN_B NSN_B LAU_B
SIL_E SPN_E NSN_E LAU_E
SIL_I SPN_I NSN_I LAU_I
SIL_S SPN_S NSN_S LAU_S
The first four questions are asking about the word-position,
last five do the same for the "silence phones" word_boundary.txt 解释这些音素与词位置的关系。

The file word_boundary.txt explains how the phones relate to word positions:

s5# head  data/lang/phones/word_boundary.txt
SIL nonword
SIL_B begin
SIL_E end
SIL_I internal
SIL_S singleton
SPN nonword
SPN_B begin
 

Creating the language model or grammar

文档出处:http://www.kaldi-asr.org/doc/data_prep.html