Let's say I have a few billion lines of text, and a few million "keyword"s. The task is to go through these lines and see which line contains which keywords. In other words, given a map of (K1 -> V1)
and (K2 -> V2)
, create a map of (K2 -> K1)
where K1=lineID
, V1=text
, K2=keywordID
and V2=keyword
. Note also that:
假设我有几十亿行文本和几百万个“关键字”。任务是通过这些行,看看哪一行包含哪些关键字。换句话说,给定(K1 - > V1)和(K2 - > V2)的映射,创建(K2 - > K1)的映射,其中K1 = lineID,V1 = text,K2 = keywordID和V2 = keyword。还要注意:
- All text/keywords are English
- 所有文字/关键字均为英文
- Text (V1) may contain spelling mistakes.
- 文本(V1)可能包含拼写错误。
- Most keywords (V2) are single words, but some keywords may consist of more than one English word (e.g. "clean towel")
- 大多数关键字(V2)是单个单词,但有些关键字可能包含多个英文单词(例如“干净毛巾”)
So far my initial idea to solve this is as follows:
到目前为止,我最初的想法是解决这个问题如下:
1) Chop up all my keywords into single words and
create a large set of single words (K3)
2) Construct a BK-Tree out of these chopped up keywords,
using Levenshtein distance
3) For each line of data (V1),
3.1) Chop up the text (V1) into words
3.2) For each said word,
3.2.1) Retrieve words (K3) from the BK-Tree that
are close enough to said word
3.3) Since at this point we still have false positives,
(e.g. we would have matched "clean" from "clean water" against
keyword "clean towel"), we check all possible combination
using a trie of keyword (V2) to filter such false
positives out. We construct this trie so that at the
end of an successful match, the keywordID (K2) can be retrieved.
3.4) Return the correct set of keywordID (K2) for this line (V1)!
4) Profit!
My questions
我的问题
- Is this a good approach? Efficiency is very important -- are there any better ways? Anything to improve?
- 这是一个好方法吗?效率非常重要 - 有没有更好的方法?有什么需要改进的吗?
- Are there any libraries I could use? Preferably something that would work well with Java.
- 我可以使用任何库吗?最好是适合Java的东西。
Thanks in advance!
提前致谢!
2 个解决方案
#1
0
there are some optimised multi pattern / 2D searching algorithms. don't invent the wheel again. you should also think about distributing your computation. maybe hadoop and map/reduce?
有一些优化的多模式/ 2D搜索算法。不再发明*。你还应该考虑分配你的计算。也许hadoop和map / reduce?
#2
0
Not sure but what you are expecting here (K2->K1) is very similar to inverted index (http://en.wikipedia.org/wiki/Inverted_index).
不确定,但你在这里期待的是什么(K2-> K1)与倒排索引(http://en.wikipedia.org/wiki/Inverted_index)非常相似。
I believe Lucene/Solr uses same algorithms while indexing data (it also does pre data analyze/tokenize), you might need to figure out a way you can read Lucene built indexes (start with "IndexReader" javadoc for Lucene).
我相信Lucene / Solr在索引数据时使用相同的算法(它也会进行预数据分析/标记化),您可能需要找出一种可以读取Lucene构建索引的方法(以Lucene的“IndexReader”javadoc开头)。
While indexing your data consider each line as one document in Lucene index, create two fields in your indexes 1) line ID and 2) data - once you index all documents(lines) you already have K2->K1 created for you, you just need to find a way to parse it.
索引数据时将每行视为Lucene索引中的一个文档,在索引中创建两个字段1)行ID和2)数据 - 一旦索引所有文档(行),您已经为您创建了K2-> K1,您只需需要找到解析它的方法。
I am not sure what are your next steps after creating K2->K1, if its faster lookup than you dont need to parse your indexes you can just fire Lucene queries.
我不确定在创建K2-> K1之后你的下一步是什么,如果它比你不需要解析你的索引更快的查找,你可以解雇Lucene查询。
In SOLR you can also generated faceted search results on indexes, if it helps.
在SOLR中,如果有帮助,您还可以在索引上生成分面搜索结果。
EDIT: you can use LUKE tool for analyzing Lucene indexes (https://code.google.com/p/luke/)
编辑:您可以使用LUKE工具分析Lucene索引(https://code.google.com/p/luke/)
#1
0
there are some optimised multi pattern / 2D searching algorithms. don't invent the wheel again. you should also think about distributing your computation. maybe hadoop and map/reduce?
有一些优化的多模式/ 2D搜索算法。不再发明*。你还应该考虑分配你的计算。也许hadoop和map / reduce?
#2
0
Not sure but what you are expecting here (K2->K1) is very similar to inverted index (http://en.wikipedia.org/wiki/Inverted_index).
不确定,但你在这里期待的是什么(K2-> K1)与倒排索引(http://en.wikipedia.org/wiki/Inverted_index)非常相似。
I believe Lucene/Solr uses same algorithms while indexing data (it also does pre data analyze/tokenize), you might need to figure out a way you can read Lucene built indexes (start with "IndexReader" javadoc for Lucene).
我相信Lucene / Solr在索引数据时使用相同的算法(它也会进行预数据分析/标记化),您可能需要找出一种可以读取Lucene构建索引的方法(以Lucene的“IndexReader”javadoc开头)。
While indexing your data consider each line as one document in Lucene index, create two fields in your indexes 1) line ID and 2) data - once you index all documents(lines) you already have K2->K1 created for you, you just need to find a way to parse it.
索引数据时将每行视为Lucene索引中的一个文档,在索引中创建两个字段1)行ID和2)数据 - 一旦索引所有文档(行),您已经为您创建了K2-> K1,您只需需要找到解析它的方法。
I am not sure what are your next steps after creating K2->K1, if its faster lookup than you dont need to parse your indexes you can just fire Lucene queries.
我不确定在创建K2-> K1之后你的下一步是什么,如果它比你不需要解析你的索引更快的查找,你可以解雇Lucene查询。
In SOLR you can also generated faceted search results on indexes, if it helps.
在SOLR中,如果有帮助,您还可以在索引上生成分面搜索结果。
EDIT: you can use LUKE tool for analyzing Lucene indexes (https://code.google.com/p/luke/)
编辑:您可以使用LUKE工具分析Lucene索引(https://code.google.com/p/luke/)