将一个句子分成单独的单词

I need to split a Chinese sentence into separate words. The problem with Chinese is that there are no spaces. For example, the sentence may look like: 主楼怎么走 (with spaces it would be: 主楼怎么走).

我需要将一个中文句子分成单独的单词。中文的问题是没有空格。例如,句子可能看起来像:主楼怎么走(有空格就是:主楼怎么走)。

At the moment I can think of one solution. I have a dictionary with Chinese words (in a database). The script will:

目前我可以想到一个解决方案。我有一个带有中文单词的字典(在数据库中)。该脚本将:

try to find the first two characters of the sentence in the database (主楼),

试着在数据库中找到句子的前两个字符(主楼),
if 主楼 is actually a word and it's in the database the script will try to find first three characters (主楼怎). 主楼怎 is not a word, so it's not in the database => my application now knows that 主楼 is a separate word.

如果主楼实际上是一个单词并且它在数据库中,则脚本将尝试找到前三个字符(主楼怎么)。主楼怎么不是一个单词,所以它不在数据库中=>我的应用程序现在知道主楼是一个单独的单词。
try do it with the rest of characters.

尝试与其余的角色一起做。

I don't really like this approach, because to analyze even a small text it would query the database too many times.

我真的不喜欢这种方法,因为即使是一个小文本,它也会查询数据库太多次。

Is there any other solutions to this?

还有其他解决方案吗?

11 个解决方案

#1

Thanks to everyone for you help!

感谢大家的帮助!

After a little research I've found some working tools (having in mind all your suggestions), that's why I'm answering my own question.

经过一番研究后,我找到了一些工具(考虑到你的所有建议),这就是我回答自己问题的原因。

A PHP class (http://www.phpclasses.org/browse/package/2431.html)

PHP类(http://www.phpclasses.org/browse/package/2431.html)
A Drupal module, basically another PHP solution with 4 different segmentation algorithms (pretty easy to understand how it works) (http://drupal.org/project/csplitter)

一个Drupal模块,基本上是另一个PHP解决方案,有4种不同的分割算法(很容易理解它是如何工作的)(http://drupal.org/project/csplitter)
A PHP extension for Chinese word segmentation (http://code.google.com/p/phpcws/)

中文分词的PHP扩展(http://code.google.com/p/phpcws/)
There are some other solutions availabe if you try searching baidu.com for "中文分词"

如果您尝试在baidu.com上搜索“中文分词”,可以使用其他一些解决方案。

Sincerely,

Equ

#2

You might want to consider using a trie data structure. You first construct the trie from the dictionary then searching for valid words will be much faster. The advantage is determining if you are at the end of a word or need to continue looking for longer words is very fast.

您可能需要考虑使用trie数据结构。首先从字典构造trie,然后搜索有效的单词会快得多。优点是确定你是否在一个单词的最后或需要继续寻找更长的单词非常快。

#3

You have the input text, sentence, paragraph whatever. So yes, your processing of it will need to query against your DB for each check.

你有输入文本,句子,段落。所以,是的,您对它的处理将需要针对每个检查查询您的数据库。

With decent indexing on the word column though, you shouldn't have too many problems.

虽然在单词列上有合适的索引,但是你不应该有太多的问题。

Having said that, how big is this dictionary? After all, you would only need the words, not their definitions to check whether it's a valid word. So if at all possible (depending on the size), having a huge memory map/hashtable/dictionary with just keys (the actual words) may be an option and would be quick as lightning.

话虽如此,这本字典有多大?毕竟,你只需要单词而不是它们的定义来检查它是否是一个有效的单词。因此,如果可能的话(取决于大小),拥有一个巨大的内存映射/哈希表/字典只有键(实际的单词)可能是一个选项,并将快速闪电。

At 15 million words, say average 7 characters @ 2 bytes each works out around the 200 Megabytes mark. Not too crazy.

在1500万字,比如平均7个字符@ 2字节,每个字符在200兆字节左右。不要太疯狂

Edit: At 'only' 1 million words, you're looking at around just over 13 Megabytes, say 15 with some overhead. That's a no-brainer I would say.

编辑:在“只有”100万字的情况下,你只需要超过13兆字节就可以看到,比如说有一些开销。我会说,这是一个明智的选择。

#4

Another one that works well is http://www.itgrass.com/phpanalysis/index.html

另一个效果很好的是http://www.itgrass.com/phpanalysis/index.html

Its the only one that I found that works properly with utf-8. The rest only worked for me in gb18030, which caused tons of issues later on down the line. I thought I was going to have to start over, but this one saved me a lot of time.

它是我发现的唯一一个与utf-8配合使用的产品。其余的只在gb18030中为我工作,后来引起了大量的问题。我以为我将不得不重新开始,但是这个节省了我很多时间。

#5

Well, if you have a database with all words and there is no other way to get those word involved I think you are forced to re-query the database.

好吧,如果你有一个包含所有单词的数据库,并且没有其他方法可以获得这些单词,我认为你*重新查询数据库。

#6

To improve the performance of this, can't you do all these checks before you insert the sentence into the database, and add spaces yourself?

为了提高性能,在将句子插入数据库之前不能进行所有这些检查,并自己添加空格吗?

#7

(using ABCDE to represent Chinese characters for simplicity)

(为简单起见,使用ABCDE代表汉字)

Let's say you've got the 'sentence' ABCDE input, and your dictionary contains these words that start with A: AB, ABC, AC, AE, and ABB. And presume that the word CDE exists, but DE, nor E do not.

假设你有'句子'ABCDE输入,你的词典包含以A:AB,ABC,AC,AE和ABB开头的这些词。并假设CDE这个词存在,但DE,E都没有。

When parsing the input sentence, going left to right, the script pulls the first character A. Instead of querying the database to see if A is a word, query the database to pull all words that start with A.

解析输入句子时,从左到右,脚本会拉出第一个字符A.而不是查询数据库以查看A是否为单词,查询数据库以提取所有以A开头的单词。

Loop through those results, grabbing the next few characters from the input string to get a proper comparison:

循环遍历这些结果,从输入字符串中抓取下几个字符以获得正确的比较:

AB  ?= AB : True
ABC ?= ABC: True
AC  ?= AB : False
AE  ?= AB : False
ABB ?= ABC: False

At this point the program forks down the two 'true' branches it found. On the first, it presumes AB is the first word, and tries to find C-starting words. CDE is found, so that branch is possible. Down the other branch, ABC is the first word, but DE is not possible, so that branch is invalid, meaning the first must be the true interpretation.

此时程序会分解它找到的两个“真正的”分支。首先,它假定AB是第一个单词,并试图找到C起始单词。找到CDE,因此可以进行分支。在另一个分支中,ABC是第一个单词,但DE不可能,因此分支无效,这意味着第一个必须是真正的解释。

I think this method minimized the number of calls to the database (though it might return larger sets from the database, as you're fetching sets of words all starting with the same character). If your database were indexed for this sort of searching, I think this would work better than going letter-by letter. Looking at this whole process now, and the other answers, I think this is actually a trie structure (assuming the character searched for is the root of a tree), as another poster had suggested. Well, here's an implementation of that idea!

我认为这种方法可以最大限度地减少对数据库的调用次数(尽管它可能会从数据库中返回更大的集合,因为您提取的所有单词都以相同的字符开头)。如果你的数据库被编入索引进行这种搜索,我认为这比逐字母更好。现在看看整个过程,以及其他答案,我认为这实际上是一个特里结构(假设搜索到的字符是树的根),正如另一张海报所暗示的那样。好吧,这是一个实现这个想法!

#8

I do realize that the chinese word segmentation problem is a very complex one, but in some cases this trivial algorithm may be sufficient: search the longest word w starting with the ith character, then start again for the i+length(w)th character.

我确实认识到中文分词问题是一个非常复杂的问题,但在某些情况下,这个简单的算法可能就足够了:搜索以第i个字符开头的最长单词w,然后再次为i + length(w)字符搜索。

Here's a Python implementation:

这是一个Python实现:

#!/usr/bin/env python
# encoding: utf-8

import re
import unicodedata
import codecs

class ChineseDict:

    def __init__(self,lines,rex):
        self.words = set(rex.match(line).group(1) for line in lines if not line.startswith("#"))
        self.maxWordLength = max(map(len,self.words))

    def segmentation(self,text):
        result = []
        previousIsSticky = False
        i = 0
        while i < len(text):
            for j in range(i+self.maxWordLength,i,-1):
                s = text[i:j]
                if s in self.words:
                    break
            sticky = len(s)==1 and unicodedata.category(s)!="Lo"
            if previousIsSticky or (result and sticky):
                result[-1] += s
            else:
                result.append(s)
            previousIsSticky = sticky
            i = j
        return u" | ".join(result)

    def genWords(self,text):
        i = 0
        while i < len(text):
            for j in range(i+self.maxWordLength,i,-1):
                s = text[i:j]
                if s in self.words:
                    yield s
                    break
            i = j


if __name__=="__main__":
    cedict = ChineseDict(codecs.open("cedict_ts.u8",'r','utf-8'),re.compile(r"(?u)^.+? (.+?) .+"))
    text = u"""33. 你可以叫我夏尔
    *将军和夫人在科隆贝双教堂村过周末。星期日早晨，伊冯娜无意中走进浴室，正巧将军在洗盆浴。她感到非常意外，不禁大叫一声：“我的上帝！”
    *于是转过身，看见妻子因惊魂未定而站立在门口。他继续用香皂擦身，不紧不慢地说：“伊冯娜，你知道，如果是我们之间的隐私，你可以叫我夏尔，用不着叫我上帝……”
    """
    print cedict.segmentation(text)
    print u" | ".join(cedict.genWords(text))

The last part uses a copy of the CCEDICT dictionary to segment a (simplified) chinese text in two flavours (resp., with and without non-word characters):

最后一部分使用CCEDICT字典的副本来分割两种风格的(简化的)中文文本(分别有和没有非单词字符):

33. 你 | 可以 | 叫 | 我 | 夏 | 尔
    * | 将军 | 和 | 夫人 | 在 | 科隆 | 贝 | 双 | 教堂 | 村 | 过 | 周末。星期日 | 早晨，伊 | 冯 | 娜 | 无意中 | 走进 | 浴室，正巧 | 将军 | 在 | 洗 | 盆浴。她 | 感到 | 非常 | 意外，不禁 | 大 | 叫 | 一声：“我的 | 上帝！”
    * | 于是 | 转 | 过 | 身，看见 | 妻子 | 因 | 惊魂 | 未定 | 而 | 站立 | 在 | 门口。他 | 继续 | 用 | 香皂 | 擦 | 身，不 | 紧 | 不 | 慢 | 地 | 说：“伊 | 冯 | 娜，你 | 知道，如果 | 是 | 我们 | 之间 | 的 | 隐私，你 | 可以 | 叫 | 我 | 夏 | 尔，用不着 | 叫 | 我 | 上帝……”

你 | 可以 | 叫 | 我 | 夏 | 尔 | * | 将军 | 和 | 夫人 | 在 | 科隆 | 贝 | 双 | 教堂 | 村 | 过 | 周末 | 星期日 | 早晨 | 伊 | 冯 | 娜 | 无意中 | 走进 | 浴室 | 正巧 | 将军 | 在 | 洗 | 盆浴 | 她 | 感到 | 非常 | 意外 | 不禁 | 大 | 叫 | 一声 | 我的 | 上帝 | * | 于是 | 转 | 过 | 身 | 看见 | 妻子 | 因 | 惊魂 | 未定 | 而 | 站立 | 在 | 门口 | 他 | 继续 | 用 | 香皂 | 擦 | 身 | 不 | 紧 | 不 | 慢 | 地 | 说 | 伊 | 冯 | 娜 | 你 | 知道 | 如果 | 是 | 我们 | 之间 | 的 | 隐私 | 你 | 可以 | 叫 | 我 | 夏 | 尔 | 用不着 | 叫 | 我 | 上帝

#9

A good and fast way to segment Chinese text is based on Maximum Matching Segmentation, which is basically will test different length of words to see which combination of segmentation is most likely. It takes in a list of all possible words to do so.

分割中文文本的一种好的快速方法是基于最大匹配分割,它基本上将测试不同长度的单词以查看最有可能的分割组合。它会列出所有可能的单词。

Read more about it here: http://technology.chtsai.org/mmseg/

在这里阅读更多相关信息:http://technology.chtsai.org/mmseg/

That's the method I use in my 读者 (DuZhe) Text Analyzer ( http://duzhe.aaginskiy.com ). I don't use a database, actually I pre-load a list of words into an array which does take up about ~2MB of RAM, but executes very quickly.

这是我在我的读者(DuZhe)文本分析器(http://duzhe.aaginskiy.com)中使用的方法。我不使用数据库,实际上我将一个单词列表预加载到一个数组中,该数组占用大约2MB的RAM,但执行速度非常快。

If you are looking into using lexical segmentation over statistical (though statistical method can be as accurate as ~97% according to some research), a very good segmentation tool is ADSOtrans that can be found here:

如果你正在研究使用词法分词而不是统计学(虽然根据一些研究统计方法可以达到~97%),一个非常好的分词工具是ADSOtrans,可以在这里找到:

http://www.adsotrans.com

It uses a database but has a lot of redundant tables to speed up the segmentation. You can also provide grammatical definitions to assist the segmentation.

它使用数据库,但有很多冗余表来加速分段。您还可以提供语法定义以帮助进行细分。

Hope this helps.

希望这可以帮助。

#10

-1

This is a fairly standard task in computational linguistics. It goes by the name "tokenization" or "word segmentation." Try searching for "chinese word segmentation" or "chinese tokenization" and you'll find several tools that have been made to do this task, as well as papers about research systems to do it.

这是计算语言学中相当标准的任务。它的名称是“标记化”或“分词”。尝试搜索“中文分词”或“中文标记化”,您将找到为完成此任务而制作的几个工具,以及有关研究系统的文章。

To do this well, you typically will need to use a statistical model built by running a machine learning system on a fairly large training corpus. Several of the systems you can find on the web come with pre-trained models.

为了做到这一点,您通常需要使用通过在相当大的训练语料库上运行机器学习系统而构建的统计模型。您可以在网上找到的几个系统都配有预先训练的模型。

#11

-3

You can build very very long Regular Expression.

您可以构建非常长的正则表达式。

Edit: I meant to build it automatically with script from the DB. Not to write it by hand.

编辑:我的意思是使用DB中的脚本自动构建它。不要手工写。

#1