音频帧包含什么?

Im doing some research on how to compare sound files(wave). Basically i want to compare stored soundfiles (wav) with sound from a microphone. So in the end i would like to pre-store some voice commands of my own and then when Im running my app I would like to compare the pre-stored files with input from the microphone.

我正在做一些关于如何比较声音文件(wave)的研究。基本上，我想比较存储的soundfiles (wav)和来自麦克风的声音。最后，我想要预先存储一些我自己的语音指令，然后当我运行我的应用程序时，我想要比较预存的文件和来自麦克风的输入。

My thought was to put in some margin when comparing because saying something two times in a row in the exatly same way would be difficult I guess.

我的想法是在比较的时候放上一些空白，因为在同一条路上连续说两遍会很困难。

So after some googling i see that python have this module named wave and the Wave_read object. That object has a function named readframes(n):

在谷歌搜索之后，我看到python有一个名为wave和Wave_read对象的模块。该对象具有一个名为readframe (n)的函数:

Reads and returns at most n frames of audio, as a string of bytes.

读取和返回至多n帧音频，作为一串字节。

What does these bytes contain? Im thinking of looping thru the wave files one frame at the time comparing them frame by frame.

这些字节包含什么?我想到的是在一个帧的时候，通过一个帧来对它们进行比较。

4 个解决方案

#1

An audio frame, or sample, contains amplitude (loudness) information at that particular point in time. To produce sound, tens of thousands of frames are played in sequence to produce frequencies.

一个音频帧，或样本，包含振幅(响度)信息在那个特定的时间点。为了产生声音，数以万计的帧按顺序播放以产生频率。

In the case of CD quality audio or uncompressed wave audio, there are around 44,100 frames/samples per second. Each of those frames contains 16-bits of resolution, allowing for fairly precise representations of the sound levels. Also, because CD audio is stereo, there is actually twice as much information, 16-bits for the left channel, 16-bits for the right.

在CD音质音频或未压缩波音频的情况下，每秒钟大约有44,100帧/样本。每一帧包含16位分辨率，允许对声音级别进行相当精确的表示。而且，因为CD音频是立体声的，实际上有两倍的信息，16位的左边通道，16位的右边。

When you use the sound module in python to get a frame, it will be returned as a series of hexadecimal characters:

当您使用python中的声音模块获得一个框架时，它将作为一系列十六进制字符返回:

One character for an 8-bit mono signal.
一个字符用于一个8位的mono信号。
Two characters for 8-bit stereo.
两个字符用于8位立体声。
Two characters for 16-bit mono.
两个字符为16位的mono。
Four characters for 16-bit stereo.
四个字符用于16位立体声。

In order to convert and compare these values you'll have to first use the python wave module's functions to check the bit depth and number of channels. Otherwise, you'll be comparing mismatched quality settings.

为了转换和比较这些值，您必须首先使用python wave模块的函数来检查通道的深度和数量。否则，您将会比较不匹配的质量设置。

#2

A simple byte-by-byte comparison has almost no chance of a successful match, even with some tolerance thrown in. Voice-pattern recognition is a very complex and subtle problem that is still the subject of much research.

一个简单的字节字节比较几乎没有成功匹配的机会，即使有一些容忍。语音模式识别是一个非常复杂和微妙的问题，仍然是许多研究的课题。

#3

The first thing you should do is a fourier transformation to transform the data into its frequencies. It is rather complex however. I wouldn't use voice recognition libraries here as it sounds like you don't record voices only. You would then try different time shifts (in case the sounds are not exactly aligned) and use the one that gives you the best similarity - where you have to define a similarity function. Oh and you should normalize both signals (same maximum loudness).

首先要做的是傅里叶变换将数据转换成频率。然而，它相当复杂。我不会在这里使用语音识别库，因为听起来你并不是只记录声音。然后，您将尝试不同的时间转换(如果声音没有完全对齐)，并使用给您最佳相似度的方法——您必须定义一个相似函数。哦，你应该把两个信号都正常化(相同的最大音量)。

#4

I believe the accepted description to be slightly incorrect.

我相信被接受的描述有点不正确。

A frame appears to be somewhat like stride in graphics formats. For interleaved stereo @ 16 bits/sample, the frame size is 2*sizeof(short)=4 bytes. For non-interleaved stereo @ 16 bits/sample, the samples of the left channel are all one after another, so the frame size is just sizeof(short).

一个框架看起来有点像在图形格式上的跨步。对于交叉立体声@ 16位/样本，帧大小为2*sizeof(short)=4字节。对于非交叉立体声@ 16位/样本，左侧通道的样本都是一个接一个的，所以帧的大小只是大小(短)。

#1