毕业设计-基于深度神经网络的语音关键词检出系统-使用python脚本作词频统计-Librispeech

<span style="font-family: Arial, Helvetica, sans-serif; background-color: rgb(255, 255, 255);"></span><span style="font-family: Arial, Helvetica, sans-serif; background-color: rgb(255, 255, 255);">TIMIT之后，这次来分析Librispeech的词频，文件组织结构如图所示：</span>

librispeech文件夹下的dev-clean中含有多个多级子文件夹，每个末节文件夹下含有一个txt含有抄本外加多个音频文件是该抄本的朗读：

毕业设计-基于深度神经网络的语音关键词检出系统-使用python脚本作词频统计-Librispeech

脚本任务是对所有txt抄本读取其中的词数并作统计，抄本内容样例如下

毕业设计-基于深度神经网络的语音关键词检出系统-使用python脚本作词频统计-Librispeech

其中全部是大写单词，我们的操作分为两步：

1.利用os.walk()遍历所有文件，记下所有txt文件路径写入一个path.txt文本中

2.从path.txt读取文本路径打开，用python字典记录词与频率，写入keyword.txt作词频统计

代码如下：

import os
import os.path

rootDir = "dev-clean"



#functions used
"""This functions get in the file path and dictionary of keyword to count keywords"""
def keywordCounter(fileDir, keywordContainer):
    f = open(fileDir)
    row = f.readline()
    while row != '':
        wordPart = row[row.index(" ")+1:-1]
        words = wordPart.split(" ")
        for word in words:
            if keywordContainer.has_key(word):
                keywordContainer[word] += 1
            else:
                keywordContainer[word] = 1
        row = f.readline()


#1
print "Step 1: Get absolute directory of all the transcript file inside this folder"
"""I got the script from cnblog without fully understanding, just use it as black box"""
transPathDoc = open("dreaminghzAnalysedData\pathDoc.txt","w+")
for parent, dirnames, filenames in os.walk(rootDir):
    #for dirname in dirnames:
        #print dirname

    for filename in filenames:
        if filename[-4:] == ".txt":
            #print os.path.join(parent,filename)
            transPathDoc.write(os.path.join(parent,filename)+"\n")
transPathDoc.close() 

print "Step 1 finished"

#2
print "Step 2: Read in all the transcript file and do the word counting"
keywordContainer = {}
pathes = open("dreaminghzAnalysedData\pathDoc.txt")
pathTmp = pathes.readline()
#readin path and call the function keywordCounter to process it
while pathTmp != '':
    keywordCounter(pathTmp[:-1], keywordContainer)
    pathTmp = pathes.readline()
pathes.close()
print "Step 2 finished"
#claim
print "There's totally " + str(len(keywordContainer)) + " keywords"

#3
print "Step 3: Save keyword into dreaminghzAnalysedData\keyWords.txt"
outfile = open("dreaminghzAnalysedData\keyWords.txt","w+")
for ks in keywordContainer.keys():
    outfile.write(ks + " " + str(keywordContainer[ks]) + "\n")
outfile.close()

同时，再次读取keyword.txt文件可以给出词频在所要求词频及以上的单词，便于关键词选定与以后的实验：

"""This script is used for finding high frequency keywords that appears more than given time number"""
#use the file generated by LibriWordCounter.py
kw = open("keyWords.txt")
qualification = False
while not qualification:
    try:
        num = int(raw_input("Input the lower bound of frequency as positive integer pls:"))
        if num <= 0:
            qualification = False
        else:
            qualification = True
    except:
        qualification = False

highFFilename = "keywords-noless-" + str(num) + "-times.txt"
highFkw = open(highFFilename,"w+")

#read in and write down qualified keywords
row = kw.readline()
ctr = 0
while row != '':
    unit = row[:-1].split(' ')
    key, frequency = unit[0], int(unit[1])
    if frequency >= num:
        highFkw.write(row)
        ctr += 1
    row = kw.readline()

kw.close()
highFkw.close()
print("Finished, there's totally " + str(ctr) + " records written into the file")

秒客网

毕业设计-基于深度神经网络的语音关键词检出系统-使用python脚本作词频统计-Librispeech

相关文章