如何在音频文件/流中搜索内容？

I have always wondered how many different search techniques existed, for searching text, for searching images and even for videos.

我一直想知道有多少种不同的搜索技术,用于搜索文本,搜索图像甚至视频。

However, I have never come across a solution that searched for content within audio files.

但是,我从未遇到过在音频文件中搜索内容的解决方案。

For example: Let us assume that I have about 200 podcasts downloaded to my PC in the form of mp3, wav and ogg files. They are all named generically say podcast1.mp3, podcast2.mp3, etc. So, it is not possible to know what the content is, without actually hearing them. Lets say that, I am interested in finding out, which the podcasts talk about 'game programming'. I want the results to be shown as:

例如:让我们假设我有大约200个播客以mp3,wav和ogg文件的形式下载到我的电脑上。它们都被命名为podcast1.mp3,podcast2.mp3等。所以,不知道内容是什么,而不是实际听到它们。让我们说,我有兴趣发现,播客谈论“游戏编程”。我希望结果显示为:

Podcast1.mp3 - 3 result(s) at time index(es) - 0:16:21, 0:43:45, 1:12:31

Podcast1.mp3 - 时间索引(s)的3个结果 - 0:16:21,0:43:45,1:12:31

Podcast21.ogg - 1 result(s) at time index(es) - 0:12:01

Podcast21.ogg - 在时间索引 - 0:12:01获得1个结果

So my questions:

所以我的问题:

How could one approach this problem?

怎么能解决这个问题呢?

Are there are suitable algorithms developed to do something like this?

是否有合适的算法开发来做这样的事情?

One idea the cropped up in my mind was that, one could use a 'speech-to-text' software to get transcripts along with time indexes for each of the audio files, then parse the transcript to get the output.

我想到的一个想法是,人们可以使用“语音到文本”软件来获取每个音频文件的时间索引的成绩单,然后解析成绩单以获得输出。

I was considering this as one of my hobby projects. Thanks!

我认为这是我的业余爱好项目之一。谢谢!

1 个解决方案

#1

If you want to search for text (i.e. what is being said) inside an audio stream you would have to process it with some kind of speech recognition algorithm and store the text as meta data associated with the files. For video you could also do text recognition for text inside the video. Evernote already does this for text inside image files, but has no support for audio as far as I know.

如果要在音频流中搜索文本(即所说的内容),则必须使用某种语音识别算法对其进行处理,并将文本存储为与文件关联的元数据。对于视频,您还可以对视频内的文本进行文本识别。 Evernote已针对图像文件中的文本执行此操作,但据我所知,它不支持音频。

Something similar is possible when using audio to search for audio. I don't know the details of these algorithms, but I'm guessing they involve some kind of frequency analysis. Shazam is using this kind of technology to identify songs based on audio clips.

使用音频搜索音频时可能会出现类似情况。我不知道这些算法的细节,但我猜它们涉及某种频率分析。 Shazam正在使用这种技术来识别基于音频剪辑的歌曲。

Here are some Wikipedia articles that may be useful:

以下是一些可能有用的*文章: