如何开始使用语音转文本？

I'm really interested in speech-to-text algorithms, but I'm not sure where to start studying up on them. A bunch of searching around led me to this, but it's from 1996 and I'm fairly certain that there have been improvements since then.

我对语音到文本算法很感兴趣,但我不确定从哪里开始研究它们。一堆搜索让我想到了这一点,但它是从1996年开始的,我很确定从那时起就有了改进。

Does anyone who has any experience with this sort of stuff have any recommendations for reading / source code to examine? Or just general advice on what I should be trying to learn about if I want to get into the world of writing speech recognition programs (sometimes it's hard to know what to search for if you don't have much knowledge about the domain).

有这种经验的人是否有任何关于阅读/源代码的建议?或者只是关于如果我想进入编写语音识别程序的世界我应该学习什么的一般建议(如果你对这个领域没有太多的了解,有时很难知道要搜索什么)。

Edit: I'd like to do something cross-platform, but for the moment I'd be targeting linux.

编辑:我想跨平台做一些事情,但目前我的目标是linux。

Edit 2: Thanks csmba for the well-thought out reply. At this point in time, I'm mainly interested in being able to create applications that allow automation, or execution of different commands through voice. So, a limited amount of recognizable commands being able to be strung together. An example would be a music player that took commands like "Play the album Hello Everything by Squarepusher", or an application launcher that allowed the user to create voice-shortcuts to launch specific apps.

编辑2:感谢csmba提供经过深思熟虑的回复。此时,我主要感兴趣的是能够创建允许自动化或通过语音执行不同命令的应用程序。因此,有限数量的可识别命令能够串联在一起。一个例子是一个音乐播放器,它接受诸如“播放由Squarepusher播放Hello Everything”的命令,或者允许用户创建语音快捷方式以启动特定应用的应用程序启动器。

I realize that it's a pretty giant problem, and that I have nowhere near the level of knowledge required right now to tackle implementing an entire recognition engine, although the techniques involved with doing so fascinate me, and it is something I'd like to work myself up to doing. In all likelihood, I'll probably end up picking up a book or two on the subject and studying up / playing with "simple" implementations in my free time.

我意识到这是一个非常巨大的问题,并且我现在无法达到实现整个识别引擎所需的知识水平,尽管这样做的技术让我着迷,这是我想要工作的东西我自己去做。很可能,我可能最终会在这个主题上拿起一两本书,并在我的空闲时间学习/玩“简单”的实现。

6 个解决方案

#1

This is a HUGE questions, I wouldn't know how to begin... So let me just try giving you the right "terms" so you can refine your quest:

这是一个巨大的问题,我不知道如何开始...所以,让我试着给你正确的“条款”,这样你就可以完善你的任务:

First, understand that Speech Recognition is a diverse and complicated subject, and it has many different applications. People tend to map this domain to the first thing that comes to their head (usually, that would be computers understanding what you are saying like in IVR systems). So first lets distinguise the concept into the main categories:

首先,要理解语音识别是一个多样化和复杂的主题,它有许多不同的应用。人们倾向于将这个领域映射到他们头脑中的第一件事(通常,这将是计算机理解你在IVR系统中所说的内容)。首先,让我们将概念区分为主要类别:

Human-to-Machine: Applications that deal with understanding what a human is saying, but the human knows he is talking to a machine and the grammar is very limited. Examples are

人与机器:处理理解人类所说内容的应用程序,但是人类知道他正在与机器交谈并且语法非常有限。例子是

Computer automation
Specialized: Pilots automating some controls for example (noise a huge problem)

专业:飞行员自动控制一些控件(噪音很大的问题)

IVR (Interactive Voice Response) systems like Google-411 or when you call the bank and the computer on the other side says "say 'service' to get customer service"

像谷歌411这样的IVR(交互式语音应答)系统,或者当你打电话给银行和另一边的电脑时说“说'服务'来获得客户服务”

human-to-human (Spontaneous speech): This is a bigger, more complex problem. Here we can also break it down into different applciations:

人与人(自发性言论):这是一个更大,更复杂的问题。在这里,我们还可以将其分解为不同的应用程序:

Call Center: conversation between Agent-Customer, phone quality, compressed

呼叫中心:代理 - 客户之间的对话,电话质量,压缩

Intelligence: radio/phone/live conversations between 2 or more individuals

情报:2人或更多人之间的广播/电话/现场对话

Now, Speech-To-Text is not what you should be saying that you care about. What you care about is solving a problem. Different technologies are used to solve different problems. See an overview here of some of them. to summarize, other approaches are Phonetic transcription, LVCSR and direct based.

现在,言语到文本不是你应该说的,你关心的。你关心的是解决问题。不同的技术用于解决不同的问题。查看其中一些概述。总而言之,其他方法是语音转录,LVCSR和直接基础。

Also, are you interested in being the PHd behind the technology? you would need a Masters equivalent involving Signal processing and probably a PHd to be cutting edge. In which case, you will work for a company that develops the actual speech engine. Companies like Nuance and IBM are the big ones, but also Phillips and other startups exist.

另外,您是否有兴趣成为技术背后的PHD?你需要一个涉及信号处理的Masters等效物,并且可能是一个最先进的PHd。在这种情况下,您将为开发实际语音引擎的公司工作。像Nuance和IBM这样的公司是最重要的公司,但菲利普斯和其他创业公司也存在。

On the other hand, if you want to be the one implementing applications, you will not be working on the engine, but working on building application that USE the engine. A good analogy I think is form the gaming industry: Are you developing the graphic engine (like the Cry engine), or working on one of several hundred games, all use the same graphic engine?

另一方面,如果您希望成为实现应用程序的应用程序,则不会使用引擎,而是在构建使用引擎的应用程序。我认为是游戏行业的一个很好的比喻:你是在开发图形引擎(比如Cry引擎),还是在几百个游戏之一上工作,都使用相同的图形引擎?

Don't get me wrong, there is plenty to work on the quality of the search also outside the IBM/Nuance of the world. The engine is usually very open, and there are a lot of algorithmic tweaking to be done that can dramatically affect performance. Each business application has different constraints and cost/benefit function, so you can make experiments for many years building better voice recognition based applications.

不要误解我的意思,在IBM / Nuance of the world之外的搜索质量还有很多工作要做。引擎通常是非常开放的,并且需要进行大量的算法调整才能显着影响性能。每个业务应用程序都有不同的约束和成本/收益功能,因此您可以多年进行实验,构建更好的基于语音识别的应用程序。

one more thing: in general, you would also want to have good statistics background the lower in the stack you want to be.

还有一件事:一般来说,你也希望在你想要的堆栈中有较低的统计背景。

At this point in time, I'm mainly interested in being able to create applications that allow automation

在这个时间点,我主要感兴趣的是能够创建允许自动化的应用程序

Good, we are converging here... Then you have no interest in "Speech-to-Text". That buzzwords takes you to the world of full transcription, a place you do not need to go to. You should be focusing on some of the more Human-to-Machine technologies like Voice XML and the ones used in IVR systems (Nuance is the biggest player there)

好,我们在这里融合......然后你对“言语到文本”没兴趣。流行语将带您进入全转录世界,这是您不需要去的地方。您应该专注于一些更加人性化的技术,如Voice XML和IVR系统中使用的技术(Nuance是那里最大的玩家)

#2

I would definitely recommend picking up a book or two if you are new to the field. I've got no experience in the field, so I can't make a recommendation. If you are still in college (or still have close ties), you should find out if any of your professors can make a recommendation.

如果你是新手,我肯定会建议你拿一两本书。我没有这方面的经验,所以我无法提出建议。如果你还在大学(或者仍然有密切联系),你应该知道你的教授是否可以提出建议。

The survey you linked is probably an excellent resource, too. I'm sure there have been advancements since 1996, but the basics are unlikely to have fundamentally changed. If the survey is well-written, then it would be well worth your time to read it.

您链接的调查可能也是一个很好的资源。我确信自1996年以来已有进步,但基本面不太可能从根本上改变。如果调查写得很好,那么值得您花时间阅读。

#3

For OS X check out this: OS X Speech Technologies

对于OS X,请查看:OS X Speech Technologies

For Windows check out this: Microsoft Speech API

对于Windows,请查看:Microsoft Speech API

#4

I have worked with IBMs ViaVoice product. It has a good ASR (automated speech recognition) engine, and a nice text-to-speech engine.

我曾与IBM的ViaVoice产品合作过。它有一个很好的ASR(自动语音识别)引擎,以及一个很好的文本到语音引擎。

The websites not very good, but this is a link for the Embedded version http://www-01.ibm.com/software/voice/support/

网站不是很好,但这是嵌入式版本的链接http://www-01.ibm.com/software/voice/support/

It is platform agnostic though, and everything works through a MVC architecture using vxml a variant of xml for voice purposes.

它与平台无关,并且一切都通过MVC架构使用vxml xml的变体用于语音目的。

#5

What platform are you targeting ?. There is Microsoft Speech APIs that you can use if its for windows.

您定位的平台是什么?如果是for Windows,可以使用Microsoft Speech API。

#6

There is also the Speech Recognition Service for Android.

还有针对Android的语音识别服务。

#1