快速文本搜索超过600,000个文件

时间:2023-01-15 07:25:47

I have a php, linux server. It has a folder called notes_docs which contains over 600,000 txt files. The folder structure of notes_docs is as follows -

我有一个php,linux服务器。它有一个名为notes_docs的文件夹,其中包含超过600,000个txt文件。 notes_docs的文件夹结构如下 -

 - notes_docs
   - files_txt
     - 20170831
           - 1_837837472_abc_file.txt
           - 1_579374743_abc2_file.txt
           - 1_291838733_uridjdh.txt
           - 1_482737439_a8weele.txt
           - 1_733839474_dejsde.txt
     - 20170830
     - 20170829

I have to provide a fast text search utility which can show results on browser. So if my user searches for "new york", all the files which have "new york" in them, should be returned in an array. If user searches for "foo", all files with "foo" in them should be returned.

我必须提供一个快速文本搜索实用程序,可以在浏览器上显示结果。因此,如果我的用户搜索“new york”,那么其中包含“new york”的所有文件都应该以数组形式返回。如果用户搜索“foo”,则应返回其中包含“foo”的所有文件。

I already tried the code using scandir, and Directory Iterator, which is too slow. It is taking more than a minute to search, even then the search was not complete. I tried ubuntu find which was again slow taking over a minute to complete. because there are too many folder iterations, and notes_docs current size is over 20 GB.

我已经使用scandir和Directory Iterator尝试了代码,这太慢了。搜索时间超过一分钟,即使这样搜索也没有完成。我试过ubuntu发现哪个又慢了,花了一分钟才完成。因为文件夹迭代太多,notes_docs的当前大小超过20 GB。

Any solution which I can use to make it faster are welcome. I can make design changes, integrate my PHP code to curl to another language code. I can make infrastructure changes too in extreme cases (as in using in memory something).

我可以使用任何解决方案来加快速度。我可以进行设计更改,将我的PHP代码集成到另一种语言代码中。在极端情况下我也可以进行基础设施更改(就像在内存中使用一样)。

I want to know how people in industry do this? People at Indeed, Zip Recruiter all provide file search.

我想知道业内人士如何做到这一点?人们确实,Zip Recruiter都提供文件搜索。

Please note I have 2GB - 4GB to RAM, so loading all the files on RAM all the time is not acceptable.

请注意我有2GB - 4GB到RAM,所以不能一直加载RAM上的所有文件。

EDIT - All the below inputs are great. For those who come later, We ended up using Lucene for indexing and text-search. It performed really well

编辑 - 以下所有输入都很棒。对于那些后来来的人,我们最终使用Lucene进行索引和文本搜索。它表现得非常好

5 个解决方案

#1


22  

To keep it simple: There is no fast way to open, search and close 600k documents every time you want to do a search. Your benchmarks with "over a minute" are probably with single test accounts. If you plan to search these via a multi-user website, you can quickly forget about it, because your disk IO will be off the charts and block your entire server.

为了简单起见:每次想要搜索时,都没有快速的方法来打开,搜索和关闭600k文档。 “超过一分钟”的基准可能只有一个测试帐户。如果您计划通过多用户网站搜索这些内容,则可以快速忘记它,因为您的磁盘IO将脱离图表并阻止整个服务器。

So your only options is to index all files. Just as every other quick search utility does. No matter if you use Solr or ElasticSearch as mentioned in the comments, or build something of your own. The files will be indexed.

因此,您唯一的选择是索引所有文件。正如所有其他快速搜索工具一样。无论您是使用评论中提到的Solr还是ElasticSearch,还是构建自己的东西。这些文件将被编入索引。

Considering the txt files are text versions of pdf files you receive, I'm betting the easiest solution is to write the text to a database instead of a file. It won't take up much more disk space anyway.

考虑到txt文件是您收到的pdf文件的文本版本,我打赌最简单的解决方案是将文本写入数据库而不是文件。无论如何它不会占用更多的磁盘空间。

Then you can enable full text search on your database (mysql, mssql and others support it) and I'm sure the response times will be a lot better. Keep in mind that creating these indexes do require storage space, but the same goes for other solutions.

然后你可以在你的数据库上启用全文搜索(mysql,mssql和其他人支持它),我确信响应时间会好很多。请记住,创建这些索引确实需要存储空间,但其他解决方案也是如此。

Now if you really want to speed things up, you could try to parse the resumes on a more detailed level. Try and retrieve locations, education, spoken languages and other information you regularly search for and put them in separate tables/columns. This is a very difficult task and almost a project on it's own, but if you want a valuable search result, this is the way to go. Because searching in text without context gives very different results, just think of your example "new york":

现在,如果你真的想加快速度,你可以尝试在更详细的层面上解析简历。尝试并检索您经常搜索的位置,教育,口语和其他信息,并将它们放在单独的表格/列中。这是一项非常困难的任务,几乎是一个项目,但如果你想要一个有价值的搜索结果,这就是你要走的路。因为在没有上下文的情况下搜索文本会产生截然不同的结果,所以只要想想你的例子“new york”:

  1. I live in New York
  2. 我住在纽约
  3. I studied at New York University
  4. 我在纽约大学读书
  5. I love the song "new york" from Alicia Keys in a personal bio
  6. 我喜欢Alicia Keys的个人简历中的歌曲“new york”
  7. I worked for New York Pizza
  8. 我在New York Pizza工作
  9. I was born in new yorkshire, UK
  10. 我出生在英国的纽约郡
  11. I spent a summer breeding new yorkshire terriers.
  12. 我花了一个夏天培育新的约克夏梗犬。

#2


12  

I won't go too deep, but I will attempt to provide guidelines for creating a proof-of-concept.

我不会太深入,但我会尝试提供创建概念验证的指南。

1

First Download and extract elastic search from here: https://www.elastic.co/downloads/elasticsearch and then run it:

首先从这里下载并提取弹性搜索:https://www.elastic.co/downloads/elasticsearch然后运行它:

bin/elasticsearch

2

Download https://github.com/dadoonet/fscrawler#download-fscrawler extract it and run it:

下载https://github.com/dadoonet/fscrawler#download-fscrawler解压缩并运行它:

bin/fscrawler myCustomJob

Then stop it (Ctrl-C) and edit the corresponding myCustomJob/_settings.json (It has been created automatically and path was printed on the console).
You can edit the properties: "url" (path to be scanned), "update_rate" (you can make it 1m), "includes" (e.g. ["*.pdf","*.doc","*.txt"]), "index_content" (make it false, to only stay on the filename).

然后停止它(Ctrl-C)并编辑相应的myCustomJob / _settings.json(它已自动创建并且路径已在控制台上打印)。您可以编辑属性:“url”(要扫描的路径),“update_rate”(可以使其为1m),“includes”(例如[“* .pdf”,“*。doc”,“*。txt” ]),“index_content”(将其设为false,仅保留文件名)。

Run again:

再次运行:

bin/fscrawler myCustomJob

Note: Indexing is something you might later want to perform using code, but for now, it will be done automatically, using fscrawler, which directly talks to elastic.

注意:索引是您以后可能希望使用代码执行的操作,但是现在,它将使用fscrawler自动完成,它直接与弹性进行对话。

3

Now start adding files to the directory that you specified in "url" property.

现在开始将文件添加到您在“url”属性中指定的目录中。

4

Download advanced rest client for chrome and make the following POST:

下载chrome的高级rest客户端并进行以下POST:

URL: http://localhost:9200/_search

URL:http:// localhost:9200 / _search

Raw payload:

原始负载:

{
  "query": { "wildcard": {"file.filename":"aFileNameToSearchFor*"} }
}

You will get back the list of matched files. Note: fscrawler is indexing the filenames under key: file.filename.

您将获得匹配文件列表。注意:fscrawler正在索引key:file.filename下的文件名。

5

Now, instead of using advanced rest client you can use PHP, to perform this query. Either by a REST call to the url above, or by utilizing the php-client api: https://www.elastic.co/guide/en/elasticsearch/client/php-api/current/_search_operations.html

现在,您可以使用PHP来执行此查询,而不是使用高级rest客户端。通过REST调用上面的url,或者使用php-client api:https://www.elastic.co/guide/en/elasticsearch/client/php-api/current/_search_operations.html

The same stands for indexing: https://www.elastic.co/guide/en/elasticsearch/client/php-api/current/_indexing_documents.html

同样代表索引:https://www.elastic.co/guide/en/elasticsearch/client/php-api/current/_indexing_documents.html

#3


5  

If you want to save all files info into the database:

如果要将所有文件信息保存到数据库中:

<?php 
function deep_scandir( $dir, &$query, &$files) {

    $count = 0;

    if(is_dir($dir)) {
        if ($dh = opendir($dir)) {
            while (($item = readdir($dh)) !== false) {
                if($item != '.' && $item != '..') {
                    if(is_dir($dir.'/'.$item)){
                        deep_scandir($dir.'/'.$item, $query, $files);
                    }else{
                        $count++;
                        preg_match("/(\d\_\d+)\_(.*)\.txt/i", $item, $matches);
                        if(!empty($matches)){
                            $no = $matches[1];
                            $str = $matches[2];
                            $files[$dir][$no] = $str;
                            $content = addcslashes( htmlspecialchars( file_get_contents($dir.'/'.$item) ), "\\\'\"" );
                            $query[] =  "INSERT INTO `mytable` (id, key, value, path, content)
                            VALUES\n(NULL, '$no', '$str', '$dir/$item', '$content');";
                        }
                    }
                }
            }
            closedir($dh);
        }
    }
}

echo '<pre>';
$dir = 'notes_docs/files_txt';
$query = [];
$files = [];
deep_scandir($dir, $query, $files);
print_r($files);
echo '<br>';
print_r($query);

Now you can execute every line in array

现在您可以执行数组中的每一行

foreach($query as $no=>$line){
    mysql_query($line) or trigger_error("Couldn't execute query no: '$no' [$line]");
}

Output:

输出:

Array
(
    [notes_docs/files_txt/20170831] => Array
        (
            [1_291838733] => uridjdh
            [1_482737439] => a8weele
            [1_579374743] => abc2_file
            [1_733839474] => dejsde
            [1_837837472] => abc_file
        )

)

Array
(
    [0] => INSERT INTO `mytable` (id, key, value, path, content)
                            VALUES
(NULL, '1_291838733', 'uridjdh', 'notes_docs/files_txt/20170831/1_291838733_uridjdh.txt', 'Lorem ipsum dolor sit amet, consectetur adipiscing elit. Phasellus in nisl quis lectus sagittis ullamcorper at faucibus urna. Suspendisse tristique arcu sit amet ligula cursus pretium vitae eu elit. Nullam sed dolor ornare ex lobortis posuere. Quisque venenatis laoreet diam, in imperdiet arcu fermentum eu. Aenean molestie ligula id sem ultricies aliquet non a velit. Proin suscipit interdum vulputate. Nullam finibus gravida est, et fermentum est cursus eu. Integer sed metus ac urna molestie finibus. Aenean hendrerit ante quis diam ultrices pellentesque. Duis luctus turpis id ipsum dictum accumsan. Curabitur ornare nisi ligula, non pretium nulla venenatis sed. Aenean pharetra odio nec mi aliquam molestie. Fusce a condimentum nisl. Quisque mattis, nulla suscipit condimentum finibus, leo ex eleifend felis, vel efficitur eros turpis nec sem. ');
    [1] => INSERT INTO `mytable` (id, key, value, path, content)
                            VALUES
(NULL, '1_482737439', 'a8weele', 'notes_docs/files_txt/20170831/1_482737439_a8weele.txt', 'Nunc et odio sed odio rhoncus venenatis congue non nulla. Aliquam dictum, felis ac aliquam luctus, purus mi dignissim magna, vitae pharetra risus elit ac mi. Sed sodales dui semper commodo iaculis. Nunc vitae neque ut arcu gravida commodo. Fusce feugiat velit et felis pharetra posuere sit amet sit amet neque. Phasellus iaculis turpis odio, non consequat nunc consectetur a. Praesent ornare nisi non accumsan bibendum. Nunc vel ultricies enim, consectetur fermentum nisl. Sed eu augue ac massa efficitur ullamcorper. Ut hendrerit nisi arcu, a sagittis velit viverra ac. Quisque cursus nunc ac tincidunt sollicitudin. Cras eu rhoncus ante, ac varius velit. Mauris nibh lorem, viverra in porttitor at, interdum vel elit. Aliquam imperdiet lacus eu mi tincidunt volutpat. Vestibulum ut dolor risus. ');
    [2] => INSERT INTO `mytable` (id, key, value, path, content)
                            VALUES
(NULL, '1_579374743', 'abc2_file', 'notes_docs/files_txt/20170831/1_579374743_abc2_file.txt', 'Vivamus aliquet id elit vitae blandit. Proin laoreet ipsum sed tincidunt commodo. Fusce faucibus quam quam, in ornare ex fermentum et. Suspendisse dignissim, tortor at fringilla tempus, nibh lacus pretium metus, vel tempus dolor tellus ac orci. Vestibulum in congue dolor, nec porta elit. Donec pellentesque, neque sed commodo blandit, augue sapien dapibus arcu, sit amet hendrerit felis libero id ante. Praesent vitae elit at eros faucibus volutpat. Integer rutrum augue laoreet ex porta, ut faucibus elit accumsan. Donec in neque sagittis, auctor diam ac, viverra diam. Phasellus vel quam dolor. Nullam nisi tellus, faucibus a finibus et, blandit ac nisl. Vestibulum interdum malesuada sem, nec semper mi placerat quis. Nullam non bibendum sem, vitae elementum metus. Donec non ipsum quis turpis semper lobortis.');
    [3] => INSERT INTO `mytable` (id, key, value, path, content)
                            VALUES
(NULL, '1_733839474', 'dejsde', 'notes_docs/files_txt/20170831/1_733839474_dejsde.txt', 'Nunc faucibus, enim non luctus rutrum, lorem urna finibus turpis, sit amet dictum turpis ipsum pharetra ex. Donec at leo vitae massa consectetur viverra eget vel diam. Sed in neque tempor, vulputate quam sed, ullamcorper nisl. Fusce mollis libero in metus tincidunt interdum. Cras tempus porttitor nunc nec dapibus. Vestibulum condimentum, nisl eget venenatis tincidunt, nunc sem placerat dui, quis luctus nisl erat sed orci. Maecenas maximus finibus magna in facilisis. Maecenas maximus turpis eget dignissim fermentum. ');
    [4] => INSERT INTO `mytable` (id, key, value, path, content)
                            VALUES
(NULL, '1_837837472', 'abc_file', 'notes_docs/files_txt/20170831/1_837837472_abc_file.txt', 'Integer non ex condimentum, aliquet lectus id, accumsan nibh. Quisque aliquet, ante vitae convallis ullamcorper, velit diam tempus diam, et accumsan metus eros at tellus. Sed lacinia mauris sem, scelerisque efficitur mauris aliquam a. Nullam non auctor leo. In mattis mauris eu blandit varius. Phasellus interdum mi nec enim imperdiet tristique. In nec porttitor erat, tempor malesuada ante. Etiam scelerisque ligula et ex maximus, placerat consequat nunc consectetur. Phasellus suscipit ligula sed elit hendrerit laoreet. Suspendisse ex sem, placerat pharetra ligula eu, accumsan rhoncus ex. Sed luctus nisi vitae metus maximus scelerisque. Suspendisse porta nibh eget placerat tempus. Nunc efficitur gravida sagittis. ');
)

#4


3  

I'd try first creating a simple database. The main table would have the file name and the text contents. This way you can leverage the querying features of the DB engine and probably will be a substantial leap in performance over the current solution.

我首先尝试创建一个简单的数据库。主表将具有文件名和文本内容。通过这种方式,您可以利用数据库引擎的查询功能,并且可能会在当前解决方案上实现性能的巨大飞跃。

This would be the simplest solution and it could grow by adding complementary tables relating words with file names (this might be something done dynamically for example on the most frequent searches)

这将是最简单的解决方案,它可以通过添加将单词与文件名相关联的补充表来增长(这可能是动态完成的,例如在最频繁的搜索中)

#5


-2  

File system or Database?
In order to work with a large amount of data you need to use some sort of DB. As google does!

文件系统还是数据库?为了处理大量数据,您需要使用某种DB。就像谷歌一样!

So should I use relational databases like Mysql?
In many cases (like your case) the relational databases and SQL are not the right choice. Today most of search engines and big data analyzers use NOSQL Databases.
NOSQL DBs are so much faster than SQL DBs. But they usually use a lot of hardware resources (RAM, ...).

那么我应该使用像Mysql这样的关系数据库吗?在许多情况下(如您的情况),关系数据库和SQL不是正确的选择。今天,大多数搜索引擎和大数据分析器都使用NOSQL数据库。 NOSQL DB比SQL DB快得多。但是他们通常使用大量的硬件资源(RAM,......)。

Your solution

你的解决方案

  1. Search about nosql databases and check their differences. Also you should do a full text search in your db ,So search about full text search implementation in each of them. Choose a NOSQL DB: Aerospike (The fastest one) - Cassandra - MongoDB - ...
  2. 搜索有关nosql数据库并检查它们的差异。您还应该在数据库中进行全文搜索,因此请在每个搜索中搜索全文搜索实现。选择NOSQL数据库:Aerospike(最快的) - Cassandra - MongoDB - ......
  3. Import all of your files data in your DB.
  4. 导入数据库中的所有文件数据。
  5. Choose a right and fast programming language (I recommend not to use PHP).
  6. 选择正确快速的编程语言(我建议不要使用PHP)。
  7. Design and build your search engine as a service.
  8. 设计和构建您的搜索引擎即服务。
  9. Connect your main program to your search engine.
  10. 将主程序连接到搜索引擎。

#1


22  

To keep it simple: There is no fast way to open, search and close 600k documents every time you want to do a search. Your benchmarks with "over a minute" are probably with single test accounts. If you plan to search these via a multi-user website, you can quickly forget about it, because your disk IO will be off the charts and block your entire server.

为了简单起见:每次想要搜索时,都没有快速的方法来打开,搜索和关闭600k文档。 “超过一分钟”的基准可能只有一个测试帐户。如果您计划通过多用户网站搜索这些内容,则可以快速忘记它,因为您的磁盘IO将脱离图表并阻止整个服务器。

So your only options is to index all files. Just as every other quick search utility does. No matter if you use Solr or ElasticSearch as mentioned in the comments, or build something of your own. The files will be indexed.

因此,您唯一的选择是索引所有文件。正如所有其他快速搜索工具一样。无论您是使用评论中提到的Solr还是ElasticSearch,还是构建自己的东西。这些文件将被编入索引。

Considering the txt files are text versions of pdf files you receive, I'm betting the easiest solution is to write the text to a database instead of a file. It won't take up much more disk space anyway.

考虑到txt文件是您收到的pdf文件的文本版本,我打赌最简单的解决方案是将文本写入数据库而不是文件。无论如何它不会占用更多的磁盘空间。

Then you can enable full text search on your database (mysql, mssql and others support it) and I'm sure the response times will be a lot better. Keep in mind that creating these indexes do require storage space, but the same goes for other solutions.

然后你可以在你的数据库上启用全文搜索(mysql,mssql和其他人支持它),我确信响应时间会好很多。请记住,创建这些索引确实需要存储空间,但其他解决方案也是如此。

Now if you really want to speed things up, you could try to parse the resumes on a more detailed level. Try and retrieve locations, education, spoken languages and other information you regularly search for and put them in separate tables/columns. This is a very difficult task and almost a project on it's own, but if you want a valuable search result, this is the way to go. Because searching in text without context gives very different results, just think of your example "new york":

现在,如果你真的想加快速度,你可以尝试在更详细的层面上解析简历。尝试并检索您经常搜索的位置,教育,口语和其他信息,并将它们放在单独的表格/列中。这是一项非常困难的任务,几乎是一个项目,但如果你想要一个有价值的搜索结果,这就是你要走的路。因为在没有上下文的情况下搜索文本会产生截然不同的结果,所以只要想想你的例子“new york”:

  1. I live in New York
  2. 我住在纽约
  3. I studied at New York University
  4. 我在纽约大学读书
  5. I love the song "new york" from Alicia Keys in a personal bio
  6. 我喜欢Alicia Keys的个人简历中的歌曲“new york”
  7. I worked for New York Pizza
  8. 我在New York Pizza工作
  9. I was born in new yorkshire, UK
  10. 我出生在英国的纽约郡
  11. I spent a summer breeding new yorkshire terriers.
  12. 我花了一个夏天培育新的约克夏梗犬。

#2


12  

I won't go too deep, but I will attempt to provide guidelines for creating a proof-of-concept.

我不会太深入,但我会尝试提供创建概念验证的指南。

1

First Download and extract elastic search from here: https://www.elastic.co/downloads/elasticsearch and then run it:

首先从这里下载并提取弹性搜索:https://www.elastic.co/downloads/elasticsearch然后运行它:

bin/elasticsearch

2

Download https://github.com/dadoonet/fscrawler#download-fscrawler extract it and run it:

下载https://github.com/dadoonet/fscrawler#download-fscrawler解压缩并运行它:

bin/fscrawler myCustomJob

Then stop it (Ctrl-C) and edit the corresponding myCustomJob/_settings.json (It has been created automatically and path was printed on the console).
You can edit the properties: "url" (path to be scanned), "update_rate" (you can make it 1m), "includes" (e.g. ["*.pdf","*.doc","*.txt"]), "index_content" (make it false, to only stay on the filename).

然后停止它(Ctrl-C)并编辑相应的myCustomJob / _settings.json(它已自动创建并且路径已在控制台上打印)。您可以编辑属性:“url”(要扫描的路径),“update_rate”(可以使其为1m),“includes”(例如[“* .pdf”,“*。doc”,“*。txt” ]),“index_content”(将其设为false,仅保留文件名)。

Run again:

再次运行:

bin/fscrawler myCustomJob

Note: Indexing is something you might later want to perform using code, but for now, it will be done automatically, using fscrawler, which directly talks to elastic.

注意:索引是您以后可能希望使用代码执行的操作,但是现在,它将使用fscrawler自动完成,它直接与弹性进行对话。

3

Now start adding files to the directory that you specified in "url" property.

现在开始将文件添加到您在“url”属性中指定的目录中。

4

Download advanced rest client for chrome and make the following POST:

下载chrome的高级rest客户端并进行以下POST:

URL: http://localhost:9200/_search

URL:http:// localhost:9200 / _search

Raw payload:

原始负载:

{
  "query": { "wildcard": {"file.filename":"aFileNameToSearchFor*"} }
}

You will get back the list of matched files. Note: fscrawler is indexing the filenames under key: file.filename.

您将获得匹配文件列表。注意:fscrawler正在索引key:file.filename下的文件名。

5

Now, instead of using advanced rest client you can use PHP, to perform this query. Either by a REST call to the url above, or by utilizing the php-client api: https://www.elastic.co/guide/en/elasticsearch/client/php-api/current/_search_operations.html

现在,您可以使用PHP来执行此查询,而不是使用高级rest客户端。通过REST调用上面的url,或者使用php-client api:https://www.elastic.co/guide/en/elasticsearch/client/php-api/current/_search_operations.html

The same stands for indexing: https://www.elastic.co/guide/en/elasticsearch/client/php-api/current/_indexing_documents.html

同样代表索引:https://www.elastic.co/guide/en/elasticsearch/client/php-api/current/_indexing_documents.html

#3


5  

If you want to save all files info into the database:

如果要将所有文件信息保存到数据库中:

<?php 
function deep_scandir( $dir, &$query, &$files) {

    $count = 0;

    if(is_dir($dir)) {
        if ($dh = opendir($dir)) {
            while (($item = readdir($dh)) !== false) {
                if($item != '.' && $item != '..') {
                    if(is_dir($dir.'/'.$item)){
                        deep_scandir($dir.'/'.$item, $query, $files);
                    }else{
                        $count++;
                        preg_match("/(\d\_\d+)\_(.*)\.txt/i", $item, $matches);
                        if(!empty($matches)){
                            $no = $matches[1];
                            $str = $matches[2];
                            $files[$dir][$no] = $str;
                            $content = addcslashes( htmlspecialchars( file_get_contents($dir.'/'.$item) ), "\\\'\"" );
                            $query[] =  "INSERT INTO `mytable` (id, key, value, path, content)
                            VALUES\n(NULL, '$no', '$str', '$dir/$item', '$content');";
                        }
                    }
                }
            }
            closedir($dh);
        }
    }
}

echo '<pre>';
$dir = 'notes_docs/files_txt';
$query = [];
$files = [];
deep_scandir($dir, $query, $files);
print_r($files);
echo '<br>';
print_r($query);

Now you can execute every line in array

现在您可以执行数组中的每一行

foreach($query as $no=>$line){
    mysql_query($line) or trigger_error("Couldn't execute query no: '$no' [$line]");
}

Output:

输出:

Array
(
    [notes_docs/files_txt/20170831] => Array
        (
            [1_291838733] => uridjdh
            [1_482737439] => a8weele
            [1_579374743] => abc2_file
            [1_733839474] => dejsde
            [1_837837472] => abc_file
        )

)

Array
(
    [0] => INSERT INTO `mytable` (id, key, value, path, content)
                            VALUES
(NULL, '1_291838733', 'uridjdh', 'notes_docs/files_txt/20170831/1_291838733_uridjdh.txt', 'Lorem ipsum dolor sit amet, consectetur adipiscing elit. Phasellus in nisl quis lectus sagittis ullamcorper at faucibus urna. Suspendisse tristique arcu sit amet ligula cursus pretium vitae eu elit. Nullam sed dolor ornare ex lobortis posuere. Quisque venenatis laoreet diam, in imperdiet arcu fermentum eu. Aenean molestie ligula id sem ultricies aliquet non a velit. Proin suscipit interdum vulputate. Nullam finibus gravida est, et fermentum est cursus eu. Integer sed metus ac urna molestie finibus. Aenean hendrerit ante quis diam ultrices pellentesque. Duis luctus turpis id ipsum dictum accumsan. Curabitur ornare nisi ligula, non pretium nulla venenatis sed. Aenean pharetra odio nec mi aliquam molestie. Fusce a condimentum nisl. Quisque mattis, nulla suscipit condimentum finibus, leo ex eleifend felis, vel efficitur eros turpis nec sem. ');
    [1] => INSERT INTO `mytable` (id, key, value, path, content)
                            VALUES
(NULL, '1_482737439', 'a8weele', 'notes_docs/files_txt/20170831/1_482737439_a8weele.txt', 'Nunc et odio sed odio rhoncus venenatis congue non nulla. Aliquam dictum, felis ac aliquam luctus, purus mi dignissim magna, vitae pharetra risus elit ac mi. Sed sodales dui semper commodo iaculis. Nunc vitae neque ut arcu gravida commodo. Fusce feugiat velit et felis pharetra posuere sit amet sit amet neque. Phasellus iaculis turpis odio, non consequat nunc consectetur a. Praesent ornare nisi non accumsan bibendum. Nunc vel ultricies enim, consectetur fermentum nisl. Sed eu augue ac massa efficitur ullamcorper. Ut hendrerit nisi arcu, a sagittis velit viverra ac. Quisque cursus nunc ac tincidunt sollicitudin. Cras eu rhoncus ante, ac varius velit. Mauris nibh lorem, viverra in porttitor at, interdum vel elit. Aliquam imperdiet lacus eu mi tincidunt volutpat. Vestibulum ut dolor risus. ');
    [2] => INSERT INTO `mytable` (id, key, value, path, content)
                            VALUES
(NULL, '1_579374743', 'abc2_file', 'notes_docs/files_txt/20170831/1_579374743_abc2_file.txt', 'Vivamus aliquet id elit vitae blandit. Proin laoreet ipsum sed tincidunt commodo. Fusce faucibus quam quam, in ornare ex fermentum et. Suspendisse dignissim, tortor at fringilla tempus, nibh lacus pretium metus, vel tempus dolor tellus ac orci. Vestibulum in congue dolor, nec porta elit. Donec pellentesque, neque sed commodo blandit, augue sapien dapibus arcu, sit amet hendrerit felis libero id ante. Praesent vitae elit at eros faucibus volutpat. Integer rutrum augue laoreet ex porta, ut faucibus elit accumsan. Donec in neque sagittis, auctor diam ac, viverra diam. Phasellus vel quam dolor. Nullam nisi tellus, faucibus a finibus et, blandit ac nisl. Vestibulum interdum malesuada sem, nec semper mi placerat quis. Nullam non bibendum sem, vitae elementum metus. Donec non ipsum quis turpis semper lobortis.');
    [3] => INSERT INTO `mytable` (id, key, value, path, content)
                            VALUES
(NULL, '1_733839474', 'dejsde', 'notes_docs/files_txt/20170831/1_733839474_dejsde.txt', 'Nunc faucibus, enim non luctus rutrum, lorem urna finibus turpis, sit amet dictum turpis ipsum pharetra ex. Donec at leo vitae massa consectetur viverra eget vel diam. Sed in neque tempor, vulputate quam sed, ullamcorper nisl. Fusce mollis libero in metus tincidunt interdum. Cras tempus porttitor nunc nec dapibus. Vestibulum condimentum, nisl eget venenatis tincidunt, nunc sem placerat dui, quis luctus nisl erat sed orci. Maecenas maximus finibus magna in facilisis. Maecenas maximus turpis eget dignissim fermentum. ');
    [4] => INSERT INTO `mytable` (id, key, value, path, content)
                            VALUES
(NULL, '1_837837472', 'abc_file', 'notes_docs/files_txt/20170831/1_837837472_abc_file.txt', 'Integer non ex condimentum, aliquet lectus id, accumsan nibh. Quisque aliquet, ante vitae convallis ullamcorper, velit diam tempus diam, et accumsan metus eros at tellus. Sed lacinia mauris sem, scelerisque efficitur mauris aliquam a. Nullam non auctor leo. In mattis mauris eu blandit varius. Phasellus interdum mi nec enim imperdiet tristique. In nec porttitor erat, tempor malesuada ante. Etiam scelerisque ligula et ex maximus, placerat consequat nunc consectetur. Phasellus suscipit ligula sed elit hendrerit laoreet. Suspendisse ex sem, placerat pharetra ligula eu, accumsan rhoncus ex. Sed luctus nisi vitae metus maximus scelerisque. Suspendisse porta nibh eget placerat tempus. Nunc efficitur gravida sagittis. ');
)

#4


3  

I'd try first creating a simple database. The main table would have the file name and the text contents. This way you can leverage the querying features of the DB engine and probably will be a substantial leap in performance over the current solution.

我首先尝试创建一个简单的数据库。主表将具有文件名和文本内容。通过这种方式,您可以利用数据库引擎的查询功能,并且可能会在当前解决方案上实现性能的巨大飞跃。

This would be the simplest solution and it could grow by adding complementary tables relating words with file names (this might be something done dynamically for example on the most frequent searches)

这将是最简单的解决方案,它可以通过添加将单词与文件名相关联的补充表来增长(这可能是动态完成的,例如在最频繁的搜索中)

#5


-2  

File system or Database?
In order to work with a large amount of data you need to use some sort of DB. As google does!

文件系统还是数据库?为了处理大量数据,您需要使用某种DB。就像谷歌一样!

So should I use relational databases like Mysql?
In many cases (like your case) the relational databases and SQL are not the right choice. Today most of search engines and big data analyzers use NOSQL Databases.
NOSQL DBs are so much faster than SQL DBs. But they usually use a lot of hardware resources (RAM, ...).

那么我应该使用像Mysql这样的关系数据库吗?在许多情况下(如您的情况),关系数据库和SQL不是正确的选择。今天,大多数搜索引擎和大数据分析器都使用NOSQL数据库。 NOSQL DB比SQL DB快得多。但是他们通常使用大量的硬件资源(RAM,......)。

Your solution

你的解决方案

  1. Search about nosql databases and check their differences. Also you should do a full text search in your db ,So search about full text search implementation in each of them. Choose a NOSQL DB: Aerospike (The fastest one) - Cassandra - MongoDB - ...
  2. 搜索有关nosql数据库并检查它们的差异。您还应该在数据库中进行全文搜索,因此请在每个搜索中搜索全文搜索实现。选择NOSQL数据库:Aerospike(最快的) - Cassandra - MongoDB - ......
  3. Import all of your files data in your DB.
  4. 导入数据库中的所有文件数据。
  5. Choose a right and fast programming language (I recommend not to use PHP).
  6. 选择正确快速的编程语言(我建议不要使用PHP)。
  7. Design and build your search engine as a service.
  8. 设计和构建您的搜索引擎即服务。
  9. Connect your main program to your search engine.
  10. 将主程序连接到搜索引擎。