如何在文本中搜索某人的姓名? (启发式)

时间:2022-02-22 06:17:44

I have a huge list of person's full names that I must search in a huge text.

我有一个庞大的人名全名列表,我必须在一个巨大的文字中搜索。

Only part of the name may appear in the text. And it is possible to be misspelled, misstyped or abreviated. The text has no tokens, so I don't know where a person name starts in the text. And I don't if know if the name will appear or not in the text.

只有部分名称可能出现在文本中。并且可能会拼写错误,错误排版或缩减。该文本没有令牌,因此我不知道人名在文本中的起始位置。如果知道该名称是否会出现在文本中,我也不知道。

Example:

I have "Barack Hussein Obama" in my list, so I have to check for occurrences of that name in the following texts:

我的名单中有“Barack Hussein Obama”,因此我必须在以下文本中检查该名称的出现情况:

  • ...The candidate Barack Obama was elected the president of the United States... (incomplete)
  • ...候选人巴拉克奥巴马被选为美国总统......(不完整)

  • ...The candidate Barack Hussein was elected the president of the United States... (incomplete)
  • ......候选人巴拉克·侯赛因当选为美国总统......(不完整)

  • ...The candidate Barack H. O. was elected the president of the United States... (abbreviated)
  • ......候选人Barack H. O.当选为美国总统......(简称)

  • ...The candidate Barack ObaNa was elected the president of the United States... (misspelled)
  • ......候选人巴拉克·奥巴纳当选为美国总统......(拼写错误)

  • ...The candidate Barack OVama was elected the president of the United States... (misstyped, B is next to V)
  • ......候选人巴拉克·奥瓦马当选为美国总统......(错误排列,B与V相邻)

  • ...The candidate John McCain lost the the election... (no occurrences of Obama name)
  • ......候选人约翰麦凯恩在选举中失败了......(没有出现奥巴马的名字)

Certanily there isn't a deterministic solution for it, but...

Certanily没有确定性的解决方案,但......

What is a good heuristic for this kind of search?

这种搜索有什么好的启发式方法?

If you had to, how would you do it?

如果你不得不,你会怎么做?

8 个解决方案

#1


6  

You said it's about 200 pages.

你说它大概有200页。

Divide it into 200 one-page PDFs.

将其分为200页的单页PDF。

Put each page on Mechanical Turk, along with the list of names. Offer a reward of about $5 per page.

将每页都放在Mechanical Turk上,同时列出名称。每页约5美元的奖励。

#2


5  

Split everything on spaces removing special characters (commas, periods, etc). Then use something like soundex to handle misspellings. Or you could go with something like lucene if you need to search a lot of documents.

拆除空格中的所有内容,删除特殊字符(逗号,句号等)。然后使用像soundex这样的东西来处理拼写错误。或者,如果你需要搜索大量文档,你可以使用像lucene这样的东西。

#3


2  

What you want is a Natural Lanuage Processing library. You are trying to identify a subset of proper nouns. If names are the main source of proper nouns than it will be easy if there are a decent number of other proper nouns mixed in than it will be more difficult. If you are writing in JAVA look at OpenNLP or C# SharpNLP. After extracting all the proper nouns you could probably use Wordnet to remove most non-name proper nouns. You may be able to use wordnet to identify subparts of names like "John" and then search the neighboring tokens to suck up other parts of the name. You will have problems with something like "John Smith Industries". You will have to look at your underlying data to see if there are features that you can take advantage of to help narrow the problem.

你想要的是一个Natural Lanuage Processing库。您正在尝试识别专有名词的子集。如果名字是专有名词的主要来源,那么如果有相当数量的其他专有名词混合在一起比较容易,那就更难了。如果您正在使用JAVA编写,请查看OpenNLP或C#SharpNLP。在提取所有专有名词后,您可以使用Wordnet删除大多数非名称专有名词。您可以使用wordnet来识别名称的子部分,如“John”,然后搜索相邻的令牌以吸取名称的其他部分。你会遇到类似“John Smith Industries”的问题。您必须查看基础数据,看看是否有可以利用的功能来帮助缩小问题范围。

Using an NLP solution is the only real robust technique I have seen to similar problems. You may still have issues since 200 pages is actually fairly small. Ideally you would have more text and be able to use more statistical techniques to help disambiguate between names and non names.

使用NLP解决方案是我遇到的类似问题的唯一真正强大的技术。您可能仍有问题,因为200页实际上相当小。理想情况下,您将拥有更多文本,并能够使用更多统计技术来帮助消除名称和非名称之间的歧义。

#4


1  

At first blush I'm going for an indexing server. lucene, FAST or Microsoft Indexing Server.

乍一看,我要去索引服务器。 lucene,FAST或Microsoft Indexing Server。

#5


1  

I would use C# and LINQ. I'd tokenize all the words on space and then use LINQ to sort the text (and possibly use the Distinct() function) to isolate all the text that I'm interested in. When manipulating the text I'd keep track of the indexes (which you can do with LINQ) so that I could relocate the text in the original document - if that's a requirement.

我会使用C#和LINQ。我会对空间中的所有单词进行标记,然后使用LINQ对文本进行排序(并可能使用Distinct()函数)来隔离我感兴趣的所有文本。在操作文本时,我会跟踪索引(您可以使用LINQ),以便我可以重新定位原始文档中的文本 - 如果这是一个要求。

#6


0  

The best way I can think of would be to define grammars in python NLTK. However it can get quite complicated for what you want.

我能想到的最好的方法是在python NLTK中定义语法。然而,它可能会变得非常复杂,你想要什么。

I'd personnaly go for regular expressions while generating a list of permutations with some programming.

我个人会在使用一些编程生成排列列表时使用正则表达式。

#7


0  

Both SQL Server and Oracle have built-in SOUNDEX Functions.

SQL Server和Oracle都具有内置的SOUNDEX功能。

Additionally there is a built-in function for SQL Server called DIFFERENCE, that can be used.

此外,还有一个可以使用的SQL Server内置函数DIFFERENCE。

#8


-1  

pure old regular expression scripting will do the job.

纯旧的正则表达式脚本将完成这项工作。

use Ruby, it's quite fast. read lines and match words.

使用Ruby,它非常快。读取行和匹配单词。

cheers

#1


6  

You said it's about 200 pages.

你说它大概有200页。

Divide it into 200 one-page PDFs.

将其分为200页的单页PDF。

Put each page on Mechanical Turk, along with the list of names. Offer a reward of about $5 per page.

将每页都放在Mechanical Turk上,同时列出名称。每页约5美元的奖励。

#2


5  

Split everything on spaces removing special characters (commas, periods, etc). Then use something like soundex to handle misspellings. Or you could go with something like lucene if you need to search a lot of documents.

拆除空格中的所有内容,删除特殊字符(逗号,句号等)。然后使用像soundex这样的东西来处理拼写错误。或者,如果你需要搜索大量文档,你可以使用像lucene这样的东西。

#3


2  

What you want is a Natural Lanuage Processing library. You are trying to identify a subset of proper nouns. If names are the main source of proper nouns than it will be easy if there are a decent number of other proper nouns mixed in than it will be more difficult. If you are writing in JAVA look at OpenNLP or C# SharpNLP. After extracting all the proper nouns you could probably use Wordnet to remove most non-name proper nouns. You may be able to use wordnet to identify subparts of names like "John" and then search the neighboring tokens to suck up other parts of the name. You will have problems with something like "John Smith Industries". You will have to look at your underlying data to see if there are features that you can take advantage of to help narrow the problem.

你想要的是一个Natural Lanuage Processing库。您正在尝试识别专有名词的子集。如果名字是专有名词的主要来源,那么如果有相当数量的其他专有名词混合在一起比较容易,那就更难了。如果您正在使用JAVA编写,请查看OpenNLP或C#SharpNLP。在提取所有专有名词后,您可以使用Wordnet删除大多数非名称专有名词。您可以使用wordnet来识别名称的子部分,如“John”,然后搜索相邻的令牌以吸取名称的其他部分。你会遇到类似“John Smith Industries”的问题。您必须查看基础数据,看看是否有可以利用的功能来帮助缩小问题范围。

Using an NLP solution is the only real robust technique I have seen to similar problems. You may still have issues since 200 pages is actually fairly small. Ideally you would have more text and be able to use more statistical techniques to help disambiguate between names and non names.

使用NLP解决方案是我遇到的类似问题的唯一真正强大的技术。您可能仍有问题,因为200页实际上相当小。理想情况下,您将拥有更多文本,并能够使用更多统计技术来帮助消除名称和非名称之间的歧义。

#4


1  

At first blush I'm going for an indexing server. lucene, FAST or Microsoft Indexing Server.

乍一看,我要去索引服务器。 lucene,FAST或Microsoft Indexing Server。

#5


1  

I would use C# and LINQ. I'd tokenize all the words on space and then use LINQ to sort the text (and possibly use the Distinct() function) to isolate all the text that I'm interested in. When manipulating the text I'd keep track of the indexes (which you can do with LINQ) so that I could relocate the text in the original document - if that's a requirement.

我会使用C#和LINQ。我会对空间中的所有单词进行标记,然后使用LINQ对文本进行排序(并可能使用Distinct()函数)来隔离我感兴趣的所有文本。在操作文本时,我会跟踪索引(您可以使用LINQ),以便我可以重新定位原始文档中的文本 - 如果这是一个要求。

#6


0  

The best way I can think of would be to define grammars in python NLTK. However it can get quite complicated for what you want.

我能想到的最好的方法是在python NLTK中定义语法。然而,它可能会变得非常复杂,你想要什么。

I'd personnaly go for regular expressions while generating a list of permutations with some programming.

我个人会在使用一些编程生成排列列表时使用正则表达式。

#7


0  

Both SQL Server and Oracle have built-in SOUNDEX Functions.

SQL Server和Oracle都具有内置的SOUNDEX功能。

Additionally there is a built-in function for SQL Server called DIFFERENCE, that can be used.

此外,还有一个可以使用的SQL Server内置函数DIFFERENCE。

#8


-1  

pure old regular expression scripting will do the job.

纯旧的正则表达式脚本将完成这项工作。

use Ruby, it's quite fast. read lines and match words.

使用Ruby,它非常快。读取行和匹配单词。

cheers