How can I index a lot of txt files? (Java/C/C++)

I need to index a lot of text. The search results must give me the name of the files containing the query and all of the positions where the query matched in each file - so, I don't have to load the whole file to find the matching portion. What libraries can you recommend for doing this?

我需要索引很多文本。搜索结果必须给我包含查询的文件的名称以及查询在每个文件中匹配的所有位置 - 因此，我不必加载整个文件来查找匹配部分。你可以推荐哪些图书馆这样做？

update: Lucene has been suggested. Can you give me some info on how should I use Lucene to achieve this? (I have seen examples where the search query returned only the matching files)

更新：Lucene已被建议。你能告诉我一些关于我应该如何使用Lucene实现这一目标的信息吗？（我见过搜索查询只返回匹配文件的例子）

8 个解决方案

#1

I believe the lucene term for what you are looking for is highlighting. Here is a very recent report on Lucene highlighting. You will probably need to store word position information in order to get the snippets you are looking for. The Token API may help.

我相信你正在寻找的lucene术语是突出的。这是一份关于Lucene突出显示的最新报告。您可能需要存储单词位置信息才能获得您要查找的代码段。令牌API可能有所帮助。

#2

For java try Lucene

对于java尝试Lucene

#3

It all depends on how you are going to access it. And of course, how many are going to access it. Read up on MapReduce.

这一切都取决于你将如何访问它。当然，有多少人会访问它。阅读MapReduce。

If you are going to roll your own, you will need to create an index file which is sort of a map between unique words and a tuple like (file, line, offset). Of course, you can think of other in-memory data structures like a trie(prefix-tree) a Judy array and the like...

如果你打算自己动手，你需要创建一个索引文件，它是一个独特的单词和一个元组之间的映射（文件，行，偏移）。当然，你可以想到其他内存数据结构，如trie（前缀树），Judy数组等......

Some 3rd party solutions are listed here.

此处列出了一些第三方解决方案。

#4

Have a look at http://www.compass-project.org/ it can be looked on as a wrapper on top of Lucene, Compass simplifies common usage patterns of Lucene such as google-style search, index updates as well as more advanced concepts such as caching and index sharding (sub indexes). Compass also uses built in optimizations for concurrent commits and merges.

看看http://www.compass-project.org/它可以看作是Lucene之上的包装器，Compass简化了Lucene的常见使用模式，例如谷歌式搜索，索引更新以及更高级的缓存和索引分片（子索引）等概念。 Compass还使用内置优化进行并发提交和合并。

The Overview can give you more info http://www.compass-project.org/overview.html

概述可以为您提供更多信息http://www.compass-project.org/overview.html

I have integrated this into a spring project in no time. It is really easy to use and gives what your users will see as google like results.

我已经很快将它整合到一个弹簧项目中。它非常易于使用，并且可以让用户看到谷歌般的结果。

#5

Lucene - Java

It's open source as well so you are free to use and deploy in your application.

它也是开源的，因此您可以在应用程序中*使用和部署。

As far as I know, Eclipse IDE help file is powered by Lucene - It is tested by millions

据我所知，Eclipse IDE帮助文件由Lucene提供支持 - 经过数百万的测试

#6

Also take a look at Lemur Toolkit.

另请参阅Lemur Toolkit。

#7

Why don't you try and construct a state machine by reading all files ? Transitions between states will be letters, and states will be either final (some files contain the considered word, in which case the list is available there) or intermediate.

为什么不尝试通过读取所有文件来构建状态机？状态之间的转换将是字母，状态将是最终的（一些文件包含所考虑的单词，在这种情况下列表可用于那里）或中间。

As far as multiple-word lookups, you'll have to deal with them independently before intersecting the results.

就多字查找而言，您必须在交叉结果之前独立处理它们。

I believe the Boost::Statechart library may be of some help for that matter.

我相信Boost :: Statechart库可能对此有所帮助。

#8

I'm aware you asked for a library, just wanted to point you to the underlying concept of building an inverted index (from Introduction to Information Retrieval by Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze).

我知道您要求建立一个图书馆，只是想指出构建倒排索引的基本概念（由Christopher D. Manning，Prabhakar Raghavan和HinrichSchütze撰写的“信息检索简介”）。

#1

#2