快速实现非常大的索引文本搜索?

时间:2021-03-17 04:15:38

I have a single text file that is about 500GB (ie a very large log file) and would like to build an implementation to search it quickly.

我有一个大约500GB的文本文件(即一个非常大的日志文件),并希望构建一个快速搜索它的实现。

So far I have created my own inverted index with a SQLite Database but this doesn't scale well enough.

到目前为止,我已经使用SQLite数据库创建了自己的倒排索引,但这不能很好地扩展。

Can anyone suggest a fairly simple implementation that would allow quick searching of this massive document?

任何人都可以建议一个相当简单的实现,以便快速搜索这个庞大的文档?

I have looked at Solr and Lucene but these look too complicated for a quick solution, I'm thinking a database with built in full-text indexing (MySQl, Raven, Mongo etc.) may be the simplest solution but have no experience with this.

我看过Solr和Lucene,但这些看起来太复杂了,无法快速解决方案,我认为内置全文索引的数据库(MySQl,Raven,Mongo等)可能是最简单的解决方案但没有经验。

2 个解决方案

#1


1  

Since you are looking at text processing for log files I'd take a close look at the Elasticsearch Logstask Kibana stack. Elasticsearch provides the Lucene based text search. Logstash parses and loads the log file into Elasticsearch. And Kibana provides a visualization and query tool for searching and analyzing the data.

由于您正在查看日志文件的文本处理,因此我将仔细查看Elasticsearch Logstask Kibana堆栈。 Elasticsearch提供基于Lucene的文本搜索。 Logstash将日志文件解析并加载到Elasticsearch中。 Kibana提供了一个可视化和查询工具,用于搜索和分析数据。

This is a good webinar on the ELK stack by one of their trainers: http://www.elasticsearch.org/webinars/elk-stack-devops-environment/

这是他们的培训师之一在ELK堆栈上的一个很好的网络研讨会:http://www.elasticsearch.org/webinars/elk-stack-devops-environment/

As an experienced MongoDB, Solr and Elasticsearch user I was impressed by how it easy it was to get all three components up and functional analyzing log data. And it also has a robust user community, both here on * and elsewhere.

作为一名经验丰富的MongoDB,Solr和Elasticsearch用户,我对如何轻松获得所有三个组件以及功能分析日志数据印象深刻。它还拥有一个强大的用户社区,无论是在*还是其他地方。

You can download it here: http://www.elasticsearch.org/overview/elkdownloads/

你可以在这里下载:http://www.elasticsearch.org/overview/elkdownloads/

#2


0  

convert log file to csv then csv import to mysql, mongodb etc.

将日志文件转换为csv然后将csv导入到mysql,mongodb等。

mongodb:

MongoDB的:

for help :

求助 :

mongoimport --help

json file :

json文件:

mongoimport --db db --collection collection --file collection.json

csv file :

csv文件:

mongoimport --db db--collection collection --type csv --headerline --file collection.csv

Use the “--ignoreBlanks” option to ignore blank fields. For CSV and TSV imports, this option provides the desired functionality in most cases: it avoids inserting blank fields in MongoDB documents.

使用“--ignoreBlanks”选项忽略空白字段。对于CSV和TSV导入,此选项在大多数情况下提供所需的功能:它避免在MongoDB文档中插入空白字段。

link Guide: mongoimport , mongoimport v2.2

链接指南:mongoimport,mongoimport v2.2

then define index on collection and enjoy :-)

然后定义收集索引并享受:-)

#1


1  

Since you are looking at text processing for log files I'd take a close look at the Elasticsearch Logstask Kibana stack. Elasticsearch provides the Lucene based text search. Logstash parses and loads the log file into Elasticsearch. And Kibana provides a visualization and query tool for searching and analyzing the data.

由于您正在查看日志文件的文本处理,因此我将仔细查看Elasticsearch Logstask Kibana堆栈。 Elasticsearch提供基于Lucene的文本搜索。 Logstash将日志文件解析并加载到Elasticsearch中。 Kibana提供了一个可视化和查询工具,用于搜索和分析数据。

This is a good webinar on the ELK stack by one of their trainers: http://www.elasticsearch.org/webinars/elk-stack-devops-environment/

这是他们的培训师之一在ELK堆栈上的一个很好的网络研讨会:http://www.elasticsearch.org/webinars/elk-stack-devops-environment/

As an experienced MongoDB, Solr and Elasticsearch user I was impressed by how it easy it was to get all three components up and functional analyzing log data. And it also has a robust user community, both here on * and elsewhere.

作为一名经验丰富的MongoDB,Solr和Elasticsearch用户,我对如何轻松获得所有三个组件以及功能分析日志数据印象深刻。它还拥有一个强大的用户社区,无论是在*还是其他地方。

You can download it here: http://www.elasticsearch.org/overview/elkdownloads/

你可以在这里下载:http://www.elasticsearch.org/overview/elkdownloads/

#2


0  

convert log file to csv then csv import to mysql, mongodb etc.

将日志文件转换为csv然后将csv导入到mysql,mongodb等。

mongodb:

MongoDB的:

for help :

求助 :

mongoimport --help

json file :

json文件:

mongoimport --db db --collection collection --file collection.json

csv file :

csv文件:

mongoimport --db db--collection collection --type csv --headerline --file collection.csv

Use the “--ignoreBlanks” option to ignore blank fields. For CSV and TSV imports, this option provides the desired functionality in most cases: it avoids inserting blank fields in MongoDB documents.

使用“--ignoreBlanks”选项忽略空白字段。对于CSV和TSV导入,此选项在大多数情况下提供所需的功能:它避免在MongoDB文档中插入空白字段。

link Guide: mongoimport , mongoimport v2.2

链接指南:mongoimport,mongoimport v2.2

then define index on collection and enjoy :-)

然后定义收集索引并享受:-)