波形文件分析C (libsndfile, fftw3)

时间:2021-12-24 19:42:53

I'm trying to develop a simple C application that can give a value from 0-100 at a certain frequency range at a given timestamp in a WAV-file.

我正在尝试开发一个简单的C应用程序,该应用程序可以在一个特定的频率范围内,在一个给定的时间戳中,在一个wave文件中给出一个0-100的值。

Example: I have frequency range of 44.1kHz (typical MP3 file) and I want to split that range into n amount of ranges (starting from 0). I then need to get the amplitude of each range, being from 0 to 100.

例子:我的频率范围是44.1kHz(典型的MP3文件),我想把这个范围分成n个范围(从0开始),然后我需要得到每个范围的振幅,从0到100。

What I've managed so far:

到目前为止我所管理的:

Using libsndfile I'm now able to read the data of a WAV-file.

使用libsndfile,我现在可以读取一个波浪文件的数据。

infile = sf_open(argv [1], SFM_READ, &sfinfo);

float samples[sfinfo.frames];

sf_read_float(infile, samples, 1);

However, my understanding of FFT is rather limited. But I know it's required inorder to get the amplitudes at the ranges I need. But how do I move on from here? I found the library FFTW-3, which seems to be suited for the purpose.

然而,我对FFT的理解相当有限。但我知道要得到我所需要的范围的振幅是必须的。但我该如何离开这里呢?我找到了图书馆的fftwc -3,它似乎很适合这个用途。

I found some help here: https://*.com/a/4371627/1141483

我在这里找到了一些帮助:https://*.com/a/4371627/1141483。

and looked at the FFTW tutorial here: http://www.fftw.org/fftw2_doc/fftw_2.html

并查看了FFTW教程:http://www.fftw.org/fftw2_doc/fftw_2.html。

But as I'm unsure about the behaviour of the FFTW, I don't know to progress from here.

但由于我对FFTW的行为不确定,我不知道从这里取得进展。

And another question, assuming you use libsndfile: If you force the reading to be single channeled (with a stereo file) and then read the samples. Will you then actually only be reading half of the samples of the total file? As half of them being from channel 1, or does automaticly filter those out?

另一个问题是,假设您使用libsndfile:如果您强迫读取是单通道的(带有一个立体声文件),然后读取示例。那么,您实际上只阅读了全部文件的一半样本吗?一半是来自通道1,还是自动过滤掉?

Thanks a ton for your help.

多谢你的帮助。

EDIT: My code can be seen here:

编辑:我的代码可以在这里看到:

double blackman_harris(int n, int N){
double a0, a1, a2, a3, seg1, seg2, seg3, w_n;
a0 = 0.35875;
a1 = 0.48829;
a2 = 0.14128;
a3 = 0.01168;

seg1 = a1 * (double) cos( ((double) 2 * (double) M_PI * (double) n) / ((double) N - (double) 1) );
seg2 = a2 * (double) cos( ((double) 4 * (double) M_PI * (double) n) / ((double) N - (double) 1) );
seg3 = a3 * (double) cos( ((double) 6 * (double) M_PI * (double) n) / ((double) N - (double) 1) );

w_n = a0 - seg1 + seg2 - seg3;
return w_n;
}

int main (int argc, char * argv [])
{   char        *infilename ;
SNDFILE     *infile = NULL ;
FILE        *outfile = NULL ;
SF_INFO     sfinfo ;


infile = sf_open(argv [1], SFM_READ, &sfinfo);

int N = pow(2, 10);

fftw_complex results[N/2 +1];
double samples[N];

sf_read_double(infile, samples, 1);


double normalizer;
int k;
for(k = 0; k < N;k++){
    if(k == 0){

        normalizer = blackman_harris(k, N);

    } else {
        normalizer = blackman_harris(k, N);
    }

}

normalizer = normalizer * (double) N/2;



fftw_plan p = fftw_plan_dft_r2c_1d(N, samples, results, FFTW_ESTIMATE);

fftw_execute(p);


int i;
for(i = 0; i < N/2 +1; i++){
    double value = ((double) sqrtf(creal(results[i])*creal(results[i])+cimag(results[i])*cimag(results[i]))/normalizer);
    printf("%f\n", value);

}



sf_close (infile) ;

return 0 ;
} /* main */

1 个解决方案

#1


13  

Well it all depends on the frequency range you're after. An FFT works by taking 2^n samples and providing you with 2^(n-1) real and imaginary numbers. I have to admit I'm quite hazy on what exactly these values represent (I've got a friend who has promised to go through it all with me in lieu of a loan I made him when he had financial issues ;)) other than an angle around a circle. Effectively they provide you with an arccos of the angle parameter for a sine and cosine for each frequency bin from which the original 2^n samples can be, perfectly, reconstructed.

这取决于你所追求的频率范围。一个FFT作品通过2 ^ n样品和为你提供2 ^(n - 1)真实的和想象的数字。我必须承认,我对这些价值观究竟代表了什么(我有一个朋友,他曾承诺,在他有财务问题的时候,我就会把这些钱用在我身上),而不是一个围绕着一个圈子的角度。他们为您提供一个有效arccos角参数的每个频率的正弦和余弦本的原始2 ^ n样本可以完美,重建。

Anyway this has the huge advantage that you can calculate magnitude by taking the euclidean distance of the real and imaginary parts (sqrtf( (real * real) + (imag * imag) )). This provides you with an unnormalised distance value. This value can then be used to build a magnitude for each frequency band.

无论如何,这有一个巨大的优势,你可以通过计算实部和虚部的欧几里得距离来计算大小(sqrtf((真实的*实数)+ (imag * imag))。这为您提供了一个不正常的距离值。这个值可以用来为每个频段建立一个大小。

So lets take an order 10 FFT (2^10). You input 1024 samples. You FFT those samples and you get 512 imaginary and real values back (the particular ordering of those values depends on the FFT algorithm you use). So this means that for a 44.1Khz audio file each bin represents 44100/512 Hz or ~86Hz per bin.

那么,我们来订购10台FFT(2 10)吧。你输入1024个样本。您可以得到512个虚构的和真实的值(这些值的特定顺序取决于您使用的FFT算法)。这意味着,对于一个44.1Khz的音频文件,每个bin代表44100/512 Hz或~86Hz / bin。

One thing that should stand out from this is that if you use more samples (from whats called the time or spatial domain when dealing with multi dimensional signals such as images) you get better frequency representation (in whats called the frequency domain). However you sacrifice one for the other. This is just the way things go and you will have to live with it.

如果你使用更多的样本(在处理多维信号例如图像时,从所谓的时间或空间域)得到更好的频率表示(在所谓的频率域中)。然而,你为另一个牺牲了一个。这就是事情的发展方向,你将不得不接受它。

Basically you will need to tune the frequency bins and time/spatial resolution to get the data you require.

基本上你需要调整频率箱和时间/空间分辨率来获得你需要的数据。

First a bit of nomenclature. The 1024 time domain samples I referred to earlier is called your window. Generally when performing this sort of process you will want to slide the window on by some amount to get the next 1024 samples you FFT. The obvious thing to do would be to take samples 0->1023, then 1024->2047, and so forth. This unfortunately doesn't give the best results. Ideally you want to overlap the windows to some degree so that you get a smoother frequency change over time. Most commonly people slide the window on by half a window size. ie your first window will be 0->1023 the second 512->1535 and so on and so forth.

首先是命名法。我之前提到的1024个时间域样本称为您的窗口。一般情况下,在执行这类过程时,你会想要在窗口上滑动一些,以获得下一个1024个样品。显然要做的是取0->1023,然后是1024->2047,等等。不幸的是,这并没有给出最好的结果。理想情况下,你想要在某种程度上重叠窗口,这样你就能得到更平滑的频率随时间变化。最常见的情况是,人们会把窗户打开一半的窗户。你的第一个窗口将是0->1023第二个512->1535等等。

Now this then brings up one further problem. While this information provides for perfect inverse FFT signal reconstruction it leaves you with a problem that frequencies leak into surround bins to some extent. To solve this issue some mathematicians (far more intelligent than me) came up with the concept of a window function. The window function provides for far better frequency isolation in the frequency domain though leads to a loss of information in the time domain (ie its impossible to perfectly re-construct the signal after you have used a window function, AFAIK).

这又引出了一个问题。虽然这一信息提供了完美的反FFT信号重构,但它给你留下了一个问题,频率在一定程度上泄漏到环绕箱。为了解决这个问题,一些数学家(比我聪明得多)提出了一个窗口函数的概念。在频域中,窗口函数提供了更好的频率隔离,但会导致在时域上丢失信息(例如,在使用了窗口函数AFAIK后,无法完全重构信号)。

Now there are various types of window function ranging from the rectangular window (effectively doing nothing to the signal) to various functions that provide far better frequency isolation (though some may also kill surrounding frequencies that may be of interest to you!!). There is, alas, no one size fits all but I'm a big fan (for spectrograms) of the blackmann-harris window function. I think it gives the best looking results!

现在有各种各样的窗口函数,从矩形窗口(有效地对信号没有作用)到提供更好的频率隔离的各种函数(尽管有些可能会杀死您感兴趣的周围频率!)。可惜的是,没有一种尺寸适合所有人,但我是blackmann-harris窗口功能的超级粉丝。我认为它给出了最好的结果!

However as I mentioned earlier the FFT provides you with an unnormalised spectrum. To normalise the spectrum (after the euclidean distance calculation) you need to divide all the values by a normalisation factor (I go into more detail here).

然而,正如我前面提到的,FFT为您提供了一个不正常的频谱。为了使光谱标准化(在欧氏距离计算之后),你需要将所有的值除以一个标准化因子(我在这里详细讨论)。

this normalisation will provide you with a value between 0 and 1. So you could easily multiple this value by 100 to get your 0 to 100 scale.

这种标准化将为您提供介于0和1之间的值。你可以很容易地将这个值乘以100得到0到100的比例。

This, however, is not where it ends. The spectrum you get from this is rather unsatisfying. This is because you are looking at the magnitude using a linear scale. Unfortunately the human ear hears using a logarithmic scale. This rather causes issues with how a spectrogram/spectrum looks.

然而,这并不是它的终点。你从中得到的光谱是相当不令人满意的。这是因为你用线性尺度来观察大小。不幸的是,人类的耳朵使用的是对数刻度。这就导致了谱图/光谱的外观问题。

To get round this you need to convert these 0 to 1 values (I'll call it 'x') to the decibel scale. The standard transformation is 20.0f * log10f( x ). This will then provide you a value whereby 1 has converted to 0 and 0 has converted to -infinity. your magnitudes are now in the appropriate logarithmic scale. However its not always that helpful.

为了解决这个问题,你需要将0到1的值(我称之为x)转换为分贝刻度。标准转换为20.0f * log10f(x)。这将为你提供一个值,其中1转换为0,而0转换为-∞。你的大小现在是合适的对数尺度。然而,它并不总是那么有帮助。

At this point you need to look into the original sample bit depth. At 16-bit sampling you get a value that is between 32767 and -32768. This means your dynamic range is fabsf( 20.0f * log10f( 1.0f / 65536.0f ) ) or ~96.33dB. So now we have this value.

此时,您需要查看原始的样本位深度。在16位采样时,你得到的值在32767到-32768之间。这意味着您的动态范围是fabsf(20.0f * log10f(1.0f / 65536.0f))或~96.33dB。现在我们有了这个值。

Take the values we've got from the dB calculation above. Add this -96.33 value to it. Obviously the maximum amplitude (0) is now 96.33. Now didivde by that same value and you nowhave a value ranging from -infinity to 1.0f. Clamp the lower end to 0 and you now have a range from 0 to 1 and multiply that by 100 and you have your final 0 to 100 range.

从上面的dB计算中得到我们得到的值。加上这个-96.33。显然,最大振幅(0)现在是96.33。现在didivde的值是一样的,从-∞到1。0f。将低端点设置为0,现在你有一个范围从0到1,然后乘以100,最后0到100的范围。

And that is much more of a monster post than I had originally intended but should give you a good grounding in how to generate a good spectrum/spectrogram for an input signal.

这比我原先打算的要多得多,但应该给你一个很好的基础,如何为输入信号生成一个好的频谱/光谱图。

and breathe

和呼吸

Further reading (for people other than the original poster who has already found it):

进一步阅读(除了已经找到的原始海报以外的人):

Converting an FFT to a spectogram

将FFT转换为spectogram。

Edit: As an aside I found kiss FFT far easier to use, my code to perform a forward fft is as follows:

编辑:顺便说一下,我发现kiss FFT更容易使用,我的代码执行一个forward FFT是这样的:

CFFT::CFFT( unsigned int fftOrder ) :
    BaseFFT( fftOrder )
{
    mFFTSetupFwd    = kiss_fftr_alloc( 1 << fftOrder, 0, NULL, NULL );
}

bool CFFT::ForwardFFT( std::complex< float >* pOut, const float* pIn, unsigned int num )
{
    kiss_fftr( mFFTSetupFwd, pIn, (kiss_fft_cpx*)pOut );
    return true;
}

#1


13  

Well it all depends on the frequency range you're after. An FFT works by taking 2^n samples and providing you with 2^(n-1) real and imaginary numbers. I have to admit I'm quite hazy on what exactly these values represent (I've got a friend who has promised to go through it all with me in lieu of a loan I made him when he had financial issues ;)) other than an angle around a circle. Effectively they provide you with an arccos of the angle parameter for a sine and cosine for each frequency bin from which the original 2^n samples can be, perfectly, reconstructed.

这取决于你所追求的频率范围。一个FFT作品通过2 ^ n样品和为你提供2 ^(n - 1)真实的和想象的数字。我必须承认,我对这些价值观究竟代表了什么(我有一个朋友,他曾承诺,在他有财务问题的时候,我就会把这些钱用在我身上),而不是一个围绕着一个圈子的角度。他们为您提供一个有效arccos角参数的每个频率的正弦和余弦本的原始2 ^ n样本可以完美,重建。

Anyway this has the huge advantage that you can calculate magnitude by taking the euclidean distance of the real and imaginary parts (sqrtf( (real * real) + (imag * imag) )). This provides you with an unnormalised distance value. This value can then be used to build a magnitude for each frequency band.

无论如何,这有一个巨大的优势,你可以通过计算实部和虚部的欧几里得距离来计算大小(sqrtf((真实的*实数)+ (imag * imag))。这为您提供了一个不正常的距离值。这个值可以用来为每个频段建立一个大小。

So lets take an order 10 FFT (2^10). You input 1024 samples. You FFT those samples and you get 512 imaginary and real values back (the particular ordering of those values depends on the FFT algorithm you use). So this means that for a 44.1Khz audio file each bin represents 44100/512 Hz or ~86Hz per bin.

那么,我们来订购10台FFT(2 10)吧。你输入1024个样本。您可以得到512个虚构的和真实的值(这些值的特定顺序取决于您使用的FFT算法)。这意味着,对于一个44.1Khz的音频文件,每个bin代表44100/512 Hz或~86Hz / bin。

One thing that should stand out from this is that if you use more samples (from whats called the time or spatial domain when dealing with multi dimensional signals such as images) you get better frequency representation (in whats called the frequency domain). However you sacrifice one for the other. This is just the way things go and you will have to live with it.

如果你使用更多的样本(在处理多维信号例如图像时,从所谓的时间或空间域)得到更好的频率表示(在所谓的频率域中)。然而,你为另一个牺牲了一个。这就是事情的发展方向,你将不得不接受它。

Basically you will need to tune the frequency bins and time/spatial resolution to get the data you require.

基本上你需要调整频率箱和时间/空间分辨率来获得你需要的数据。

First a bit of nomenclature. The 1024 time domain samples I referred to earlier is called your window. Generally when performing this sort of process you will want to slide the window on by some amount to get the next 1024 samples you FFT. The obvious thing to do would be to take samples 0->1023, then 1024->2047, and so forth. This unfortunately doesn't give the best results. Ideally you want to overlap the windows to some degree so that you get a smoother frequency change over time. Most commonly people slide the window on by half a window size. ie your first window will be 0->1023 the second 512->1535 and so on and so forth.

首先是命名法。我之前提到的1024个时间域样本称为您的窗口。一般情况下,在执行这类过程时,你会想要在窗口上滑动一些,以获得下一个1024个样品。显然要做的是取0->1023,然后是1024->2047,等等。不幸的是,这并没有给出最好的结果。理想情况下,你想要在某种程度上重叠窗口,这样你就能得到更平滑的频率随时间变化。最常见的情况是,人们会把窗户打开一半的窗户。你的第一个窗口将是0->1023第二个512->1535等等。

Now this then brings up one further problem. While this information provides for perfect inverse FFT signal reconstruction it leaves you with a problem that frequencies leak into surround bins to some extent. To solve this issue some mathematicians (far more intelligent than me) came up with the concept of a window function. The window function provides for far better frequency isolation in the frequency domain though leads to a loss of information in the time domain (ie its impossible to perfectly re-construct the signal after you have used a window function, AFAIK).

这又引出了一个问题。虽然这一信息提供了完美的反FFT信号重构,但它给你留下了一个问题,频率在一定程度上泄漏到环绕箱。为了解决这个问题,一些数学家(比我聪明得多)提出了一个窗口函数的概念。在频域中,窗口函数提供了更好的频率隔离,但会导致在时域上丢失信息(例如,在使用了窗口函数AFAIK后,无法完全重构信号)。

Now there are various types of window function ranging from the rectangular window (effectively doing nothing to the signal) to various functions that provide far better frequency isolation (though some may also kill surrounding frequencies that may be of interest to you!!). There is, alas, no one size fits all but I'm a big fan (for spectrograms) of the blackmann-harris window function. I think it gives the best looking results!

现在有各种各样的窗口函数,从矩形窗口(有效地对信号没有作用)到提供更好的频率隔离的各种函数(尽管有些可能会杀死您感兴趣的周围频率!)。可惜的是,没有一种尺寸适合所有人,但我是blackmann-harris窗口功能的超级粉丝。我认为它给出了最好的结果!

However as I mentioned earlier the FFT provides you with an unnormalised spectrum. To normalise the spectrum (after the euclidean distance calculation) you need to divide all the values by a normalisation factor (I go into more detail here).

然而,正如我前面提到的,FFT为您提供了一个不正常的频谱。为了使光谱标准化(在欧氏距离计算之后),你需要将所有的值除以一个标准化因子(我在这里详细讨论)。

this normalisation will provide you with a value between 0 and 1. So you could easily multiple this value by 100 to get your 0 to 100 scale.

这种标准化将为您提供介于0和1之间的值。你可以很容易地将这个值乘以100得到0到100的比例。

This, however, is not where it ends. The spectrum you get from this is rather unsatisfying. This is because you are looking at the magnitude using a linear scale. Unfortunately the human ear hears using a logarithmic scale. This rather causes issues with how a spectrogram/spectrum looks.

然而,这并不是它的终点。你从中得到的光谱是相当不令人满意的。这是因为你用线性尺度来观察大小。不幸的是,人类的耳朵使用的是对数刻度。这就导致了谱图/光谱的外观问题。

To get round this you need to convert these 0 to 1 values (I'll call it 'x') to the decibel scale. The standard transformation is 20.0f * log10f( x ). This will then provide you a value whereby 1 has converted to 0 and 0 has converted to -infinity. your magnitudes are now in the appropriate logarithmic scale. However its not always that helpful.

为了解决这个问题,你需要将0到1的值(我称之为x)转换为分贝刻度。标准转换为20.0f * log10f(x)。这将为你提供一个值,其中1转换为0,而0转换为-∞。你的大小现在是合适的对数尺度。然而,它并不总是那么有帮助。

At this point you need to look into the original sample bit depth. At 16-bit sampling you get a value that is between 32767 and -32768. This means your dynamic range is fabsf( 20.0f * log10f( 1.0f / 65536.0f ) ) or ~96.33dB. So now we have this value.

此时,您需要查看原始的样本位深度。在16位采样时,你得到的值在32767到-32768之间。这意味着您的动态范围是fabsf(20.0f * log10f(1.0f / 65536.0f))或~96.33dB。现在我们有了这个值。

Take the values we've got from the dB calculation above. Add this -96.33 value to it. Obviously the maximum amplitude (0) is now 96.33. Now didivde by that same value and you nowhave a value ranging from -infinity to 1.0f. Clamp the lower end to 0 and you now have a range from 0 to 1 and multiply that by 100 and you have your final 0 to 100 range.

从上面的dB计算中得到我们得到的值。加上这个-96.33。显然,最大振幅(0)现在是96.33。现在didivde的值是一样的,从-∞到1。0f。将低端点设置为0,现在你有一个范围从0到1,然后乘以100,最后0到100的范围。

And that is much more of a monster post than I had originally intended but should give you a good grounding in how to generate a good spectrum/spectrogram for an input signal.

这比我原先打算的要多得多,但应该给你一个很好的基础,如何为输入信号生成一个好的频谱/光谱图。

and breathe

和呼吸

Further reading (for people other than the original poster who has already found it):

进一步阅读(除了已经找到的原始海报以外的人):

Converting an FFT to a spectogram

将FFT转换为spectogram。

Edit: As an aside I found kiss FFT far easier to use, my code to perform a forward fft is as follows:

编辑:顺便说一下,我发现kiss FFT更容易使用,我的代码执行一个forward FFT是这样的:

CFFT::CFFT( unsigned int fftOrder ) :
    BaseFFT( fftOrder )
{
    mFFTSetupFwd    = kiss_fftr_alloc( 1 << fftOrder, 0, NULL, NULL );
}

bool CFFT::ForwardFFT( std::complex< float >* pOut, const float* pIn, unsigned int num )
{
    kiss_fftr( mFFTSetupFwd, pIn, (kiss_fft_cpx*)pOut );
    return true;
}