I don't want to use Lucene because i think it is to heavy.
我不想使用Lucene,因为我认为它很重。
Is there any easier way to implement this (Millons of data) ?
有没有更简单的方法来实现这一点(数百万的数据)?
3 个解决方案
#1
0
If you don't want to have to worry about performance, I recommend you take a look at Amazon Web Services new CloudSearch service. It's fast and scales as your needs scale. It can also handle millions of documents without a problem and supports wildcard searches (ex: quo*, would retrieve Quora).
如果您不想担心性能,我建议您查看Amazon Web Services的新CloudSearch服务。它快速,随着您的需求规模而扩展。它还可以毫无问题地处理数百万个文档,并支持通配符搜索(例如:quo *,将检索Quora)。
Check it out here.
看看这里。
#2
0
Obviously this isn't how it definitely works at either Quora or Google, as I haven't had the pleasure to work at either...this is just how I'd go about doing it.
显然,这并不是它在Quora或Google上的确如此有效,因为我没有乐于在其中工作......这就是我要去做的事情。
The first thing to obtain is a list of search terms - I'm assuming you don't want to know how this is done, as it will really depend on all sorts of things, but basically you're either going to do a select distinct title from pages
(in the case of the autocomplete on Wikipedia) or something much more advanced in the case of Google's.
首先要获得的是一个搜索词列表 - 我假设你不想知道这是怎么做的,因为它真的取决于各种各样的事情,但基本上你要么做一个选择来自页面的不同标题(在*上的自动完成的情况下)或在谷歌的情况下更先进的东西。
The next step is also pretty simple at a high level: you need to perform the query select title from titles where title like 'Qu%'
in the case of the user typing Qu
into the search box. The list of titles is then returned to the browser as the response to some kind of Ajax request, perhaps in the form of JSON or similar. And you need to do it as fast as possible - that's where it becomes difficult.
下一步在高级别上也非常简单:在用户在搜索框中键入Qu的情况下,您需要从标题中执行查询选择标题,其中标题为“Qu%”。然后将标题列表作为对某种Ajax请求的响应返回给浏览器,可能采用JSON或类似形式。而且你需要尽可能快地完成它 - 这就是变得困难的地方。
How do they do it so quickly? There are probably four things to bear in mind.
他们怎么这么快就这么做?可能有四件事要记住。
- They have LOTS of machines handling the requests. Bear in mind that Google's autocomplete is turned on by default and works in (almost?) all languages. That's a lot of searches against the autocomplete index. A lot more than there will be against the web index itself: for each web search request, Google will probably have processed 3 or 4 autocomplete requests.
- They're probably doing it in memory. Google is already known to store its web indexes in memory, so I would expect them to be doing the same with this.
- Specialised software (this is where it gets really interesting). While a traditional database or a NoSQL database could do this and do it quickly I would expect the big boys to actually be doing this with specialised code whose sole purpose is to provide autocomplete suggestions. The SQL statement I provided above was purely to demonstrate the logical request that would be needed. You're probably looking at some kind of specialised tree, such as a suffix tree, radix tree, or similar.
- Sharding. To cope with the quantity of data and the number of machines doing the requests you're going to need to shard. That is ensure that a certain subset of all the machines involved only process requests requests that begin with one or more letters. eg a group of X machines processing searches that begin with a certain letter or even 2 letters. That means that you've got more machines, but they don't each have to have the whole index to hand. How does a particular group of machines get chosen? You're either routing once the request is in your data centre, or you could route on the client side (eg in your Javascript decide which IP to query based upon the first X letters of the search term)
他们有很多机器处理请求。请记住,Google的自动填充功能默认情况下处于启用状态,并且(几乎?)适用于所有语言。这是对自动完成索引的大量搜索。不仅仅是针对网络索引本身:对于每个网络搜索请求,Google可能已经处理了3或4个自动完成请求。
他们可能是在记忆中做的。众所周知谷歌将其网络索引存储在内存中,所以我希望他们也会这样做。
专业软件(这是真正有趣的地方)。虽然传统数据库或NoSQL数据库可以做到这一点并快速完成,但我希望大男孩们能够使用专门的代码实现这一目标,其唯一的目的是提供自动完成建议。我上面提供的SQL语句纯粹是为了演示需要的逻辑请求。您可能正在查看某种特殊树,例如后缀树,基数树或类似树。
拆分。为了处理数据量和执行请求的机器数量,您需要进行分片。这是确保所涉及的所有机器的某个子集仅处理以一个或多个字母开头的请求请求。例如,一组X机器处理以某个字母或甚至2个字母开头的搜索。这意味着你有更多的机器,但他们并不是每个人都必须掌握整个索引。如何选择特定的机器组?您可以在请求位于数据中心后进行路由,也可以在客户端进行路由(例如,在Javascript中根据搜索词的前X个字母决定查询哪个IP)
So, that's how I would do it. Not having had the experience of the enormous datasets Google/Quora are dealing with, I'm sure there are things that I've not considered. But, it's a start.
所以,我就是这样做的。没有Google / Quora正在处理的庞大数据集的经验,我确信有些事情我没有考虑过。但是,这是一个开始。
And, here's how I have done it, purely in an experimental environment at home:
而且,这就是我如何完成它,纯粹是在家里的实验环境中:
I had a simple list of a good few hundred thousand titles to search. These were loaded into a dedicated MongoDB collection, which had a single index defined on it. I then had a Play Framework controller in front of it and used jQuery's autocomplete plugin to do the search.
我有一个很好的几十万个标题的简单列表来搜索。它们被加载到一个专门的MongoDB集合中,该集合上定义了一个索引。然后我在它前面有一个Play Framework控制器,并使用jQuery的自动完成插件来进行搜索。
Obviously this is tiny compared with what you are looking for, but MongoDB should provide the same kind of performance for your dataset provided you follow the recommendations (ie good hardware, lots of RAM, keep the indexes in memory). In addition, Mongo supports sharding, and the Play Framework is shared nothing, so adding new machines to cope with the load should your userbase grow would be straightforward in this situation.
显然,与您正在寻找的相比,这是微不足道的,但MongoDB应该为您的数据集提供相同的性能,前提是您遵循建议(即良好的硬件,大量的RAM,将索引保留在内存中)。此外,Mongo支持分片,并且Play Framework没有任何共享,因此在这种情况下,如果用户库增长会很简单,那么添加新机器来应对负载。
By the way, Mongo is by no means the only solution, traditional SQL databases will be up to the job too, of course - I was just using Mongo for other reasons.
顺便说一下,Mongo绝不是唯一的解决方案,传统的SQL数据库也可以胜任这项工作 - 当然 - 我只是因为其他原因而使用Mongo。
#3
0
First, for autocomplete you should aim to get the response back to the user in <= 100ms if you want something that appears fast. That should be your first concern. Any setup that can't do that probably won't be good enough for users. In my own tests in Firefox using Firebug, Google's autocomplete returned returns in about 50ms and Quora in about 65ms.
首先,对于自动完成,如果你想要快速出现的东西,你应该在<= 100ms内将响应回复给用户。这应该是你的第一个担忧。任何不能做到这一点的设置对用户来说可能都不够好。在我使用Firebug的Firefox测试中,Google的自动完成功能在大约50ms内返回,而Quora在大约65ms内返回。
See, e.g.
http://*.com/questions/536300/what-is-the-shortest-perceivable-application-response-delay
Apparently, Quora uses prefix matching, not full text search which makes it faster. To roll your own fast prefix-based autocomplete, which should be sufficient for many cases, but won't handle things like misspellings using fuzzy matching, etc., try an in-memory data store like Redis. The details can be seen here:
显然,Quora使用前缀匹配,而不是全文搜索,这使得它更快。要推出自己的快速基于前缀的自动完成功能,这对于很多情况应该已经足够了,但是不会使用模糊匹配等处理拼写错误等问题,请尝试像Redis这样的内存数据存储。细节可以在这里看到:
http://charlesleifer.com/blog/powerful-autocomplete-with-redis-in-under-200-lines-of-python/
I haven't been able to get CloudSearch (95-125ms in browser fetching from endpoint directly as measured by Firebug, and + 20-30ms longer accessing endpoint via cURL in PHP) down to the low latencies of Google and Quora I cited regardless of the simplicity of the search query. An Elasticsearch cluster is a bit faster. These statements obviously depend upon use case and probably don't generalize well, but something to think about.
我无法获得CloudSearch(通过Firebug直接从端点获取95-125ms,以及通过PHP中的cURL访问端点的时间长达20-30ms),直至谷歌和Quora的低延迟,无论是搜索查询的简单性。 Elasticsearch集群速度要快一些。这些陈述显然取决于用例,可能没有很好地概括,但需要考虑的事情。
#1
0
If you don't want to have to worry about performance, I recommend you take a look at Amazon Web Services new CloudSearch service. It's fast and scales as your needs scale. It can also handle millions of documents without a problem and supports wildcard searches (ex: quo*, would retrieve Quora).
如果您不想担心性能,我建议您查看Amazon Web Services的新CloudSearch服务。它快速,随着您的需求规模而扩展。它还可以毫无问题地处理数百万个文档,并支持通配符搜索(例如:quo *,将检索Quora)。
Check it out here.
看看这里。
#2
0
Obviously this isn't how it definitely works at either Quora or Google, as I haven't had the pleasure to work at either...this is just how I'd go about doing it.
显然,这并不是它在Quora或Google上的确如此有效,因为我没有乐于在其中工作......这就是我要去做的事情。
The first thing to obtain is a list of search terms - I'm assuming you don't want to know how this is done, as it will really depend on all sorts of things, but basically you're either going to do a select distinct title from pages
(in the case of the autocomplete on Wikipedia) or something much more advanced in the case of Google's.
首先要获得的是一个搜索词列表 - 我假设你不想知道这是怎么做的,因为它真的取决于各种各样的事情,但基本上你要么做一个选择来自页面的不同标题(在*上的自动完成的情况下)或在谷歌的情况下更先进的东西。
The next step is also pretty simple at a high level: you need to perform the query select title from titles where title like 'Qu%'
in the case of the user typing Qu
into the search box. The list of titles is then returned to the browser as the response to some kind of Ajax request, perhaps in the form of JSON or similar. And you need to do it as fast as possible - that's where it becomes difficult.
下一步在高级别上也非常简单:在用户在搜索框中键入Qu的情况下,您需要从标题中执行查询选择标题,其中标题为“Qu%”。然后将标题列表作为对某种Ajax请求的响应返回给浏览器,可能采用JSON或类似形式。而且你需要尽可能快地完成它 - 这就是变得困难的地方。
How do they do it so quickly? There are probably four things to bear in mind.
他们怎么这么快就这么做?可能有四件事要记住。
- They have LOTS of machines handling the requests. Bear in mind that Google's autocomplete is turned on by default and works in (almost?) all languages. That's a lot of searches against the autocomplete index. A lot more than there will be against the web index itself: for each web search request, Google will probably have processed 3 or 4 autocomplete requests.
- They're probably doing it in memory. Google is already known to store its web indexes in memory, so I would expect them to be doing the same with this.
- Specialised software (this is where it gets really interesting). While a traditional database or a NoSQL database could do this and do it quickly I would expect the big boys to actually be doing this with specialised code whose sole purpose is to provide autocomplete suggestions. The SQL statement I provided above was purely to demonstrate the logical request that would be needed. You're probably looking at some kind of specialised tree, such as a suffix tree, radix tree, or similar.
- Sharding. To cope with the quantity of data and the number of machines doing the requests you're going to need to shard. That is ensure that a certain subset of all the machines involved only process requests requests that begin with one or more letters. eg a group of X machines processing searches that begin with a certain letter or even 2 letters. That means that you've got more machines, but they don't each have to have the whole index to hand. How does a particular group of machines get chosen? You're either routing once the request is in your data centre, or you could route on the client side (eg in your Javascript decide which IP to query based upon the first X letters of the search term)
他们有很多机器处理请求。请记住,Google的自动填充功能默认情况下处于启用状态,并且(几乎?)适用于所有语言。这是对自动完成索引的大量搜索。不仅仅是针对网络索引本身:对于每个网络搜索请求,Google可能已经处理了3或4个自动完成请求。
他们可能是在记忆中做的。众所周知谷歌将其网络索引存储在内存中,所以我希望他们也会这样做。
专业软件(这是真正有趣的地方)。虽然传统数据库或NoSQL数据库可以做到这一点并快速完成,但我希望大男孩们能够使用专门的代码实现这一目标,其唯一的目的是提供自动完成建议。我上面提供的SQL语句纯粹是为了演示需要的逻辑请求。您可能正在查看某种特殊树,例如后缀树,基数树或类似树。
拆分。为了处理数据量和执行请求的机器数量,您需要进行分片。这是确保所涉及的所有机器的某个子集仅处理以一个或多个字母开头的请求请求。例如,一组X机器处理以某个字母或甚至2个字母开头的搜索。这意味着你有更多的机器,但他们并不是每个人都必须掌握整个索引。如何选择特定的机器组?您可以在请求位于数据中心后进行路由,也可以在客户端进行路由(例如,在Javascript中根据搜索词的前X个字母决定查询哪个IP)
So, that's how I would do it. Not having had the experience of the enormous datasets Google/Quora are dealing with, I'm sure there are things that I've not considered. But, it's a start.
所以,我就是这样做的。没有Google / Quora正在处理的庞大数据集的经验,我确信有些事情我没有考虑过。但是,这是一个开始。
And, here's how I have done it, purely in an experimental environment at home:
而且,这就是我如何完成它,纯粹是在家里的实验环境中:
I had a simple list of a good few hundred thousand titles to search. These were loaded into a dedicated MongoDB collection, which had a single index defined on it. I then had a Play Framework controller in front of it and used jQuery's autocomplete plugin to do the search.
我有一个很好的几十万个标题的简单列表来搜索。它们被加载到一个专门的MongoDB集合中,该集合上定义了一个索引。然后我在它前面有一个Play Framework控制器,并使用jQuery的自动完成插件来进行搜索。
Obviously this is tiny compared with what you are looking for, but MongoDB should provide the same kind of performance for your dataset provided you follow the recommendations (ie good hardware, lots of RAM, keep the indexes in memory). In addition, Mongo supports sharding, and the Play Framework is shared nothing, so adding new machines to cope with the load should your userbase grow would be straightforward in this situation.
显然,与您正在寻找的相比,这是微不足道的,但MongoDB应该为您的数据集提供相同的性能,前提是您遵循建议(即良好的硬件,大量的RAM,将索引保留在内存中)。此外,Mongo支持分片,并且Play Framework没有任何共享,因此在这种情况下,如果用户库增长会很简单,那么添加新机器来应对负载。
By the way, Mongo is by no means the only solution, traditional SQL databases will be up to the job too, of course - I was just using Mongo for other reasons.
顺便说一下,Mongo绝不是唯一的解决方案,传统的SQL数据库也可以胜任这项工作 - 当然 - 我只是因为其他原因而使用Mongo。
#3
0
First, for autocomplete you should aim to get the response back to the user in <= 100ms if you want something that appears fast. That should be your first concern. Any setup that can't do that probably won't be good enough for users. In my own tests in Firefox using Firebug, Google's autocomplete returned returns in about 50ms and Quora in about 65ms.
首先,对于自动完成,如果你想要快速出现的东西,你应该在<= 100ms内将响应回复给用户。这应该是你的第一个担忧。任何不能做到这一点的设置对用户来说可能都不够好。在我使用Firebug的Firefox测试中,Google的自动完成功能在大约50ms内返回,而Quora在大约65ms内返回。
See, e.g.
http://*.com/questions/536300/what-is-the-shortest-perceivable-application-response-delay
Apparently, Quora uses prefix matching, not full text search which makes it faster. To roll your own fast prefix-based autocomplete, which should be sufficient for many cases, but won't handle things like misspellings using fuzzy matching, etc., try an in-memory data store like Redis. The details can be seen here:
显然,Quora使用前缀匹配,而不是全文搜索,这使得它更快。要推出自己的快速基于前缀的自动完成功能,这对于很多情况应该已经足够了,但是不会使用模糊匹配等处理拼写错误等问题,请尝试像Redis这样的内存数据存储。细节可以在这里看到:
http://charlesleifer.com/blog/powerful-autocomplete-with-redis-in-under-200-lines-of-python/
I haven't been able to get CloudSearch (95-125ms in browser fetching from endpoint directly as measured by Firebug, and + 20-30ms longer accessing endpoint via cURL in PHP) down to the low latencies of Google and Quora I cited regardless of the simplicity of the search query. An Elasticsearch cluster is a bit faster. These statements obviously depend upon use case and probably don't generalize well, but something to think about.
我无法获得CloudSearch(通过Firebug直接从端点获取95-125ms,以及通过PHP中的cURL访问端点的时间长达20-30ms),直至谷歌和Quora的低延迟,无论是搜索查询的简单性。 Elasticsearch集群速度要快一些。这些陈述显然取决于用例,可能没有很好地概括,但需要考虑的事情。