Search for a term on amazon.com, for example "stack overflow", and the search results come back very quickly.
在amazon.com上搜索一个术语,例如“堆栈溢出”,搜索结果会很快恢复。
On the left hand side of the window, there is a faceted search that shows in certain categories, the count of products that match that term.
在窗口的左侧,有一个分面搜索,在某些类别中显示与该术语匹配的产品数。
You can then drill into those terms. For example, there are 1094 books that match the term, which is broken down into Computers & Internet (1003), Science, etc.
然后,您可以深入研究这些术语。例如,有1094本书符合该术语,分为计算机和互联网(1003),科学等。
Given that the search for books covers the contents of some of those books, it strikes me that this is a very impressive feat.
鉴于搜索书籍涵盖了其中一些书籍的内容,我觉得这是一个非常令人印象深刻的壮举。
How does amazon do this? Massive parallelization? eg each node knows about a few products?
亚马逊如何做到这一点?大规模并行化?例如,每个节点都知道一些产品?
Incidentally, I saw that "stack overflow" appears in the text of "Soul of a New Machine", a book I remember from 1981
顺便说一句,我看到“堆叠溢出”出现在“新机器的灵魂”的文本中,这是我从1981年开始记得的一本书
2 个解决方案
#1
The short answer is, a lot of indexing. The longer answer is, a lot of indexing, a lot of redundancy, a lot of caching, and smart partitioning.
简短的回答是,很多索引。更长的答案是,大量的索引,大量的冗余,大量的缓存和智能分区。
The real answer is -- read this book: http://www-csli.stanford.edu/~hinrich/information-retrieval-book.html
真正的答案是 - 阅读本书:http://www-csli.stanford.edu/~hinrich/information-retrieval-book.html
(It's free, and it's very good).
(这是免费的,而且非常好)。
#2
Well, there is parallelization, but one of the things that everyone does on the backend of these types of things is run slow processes (like semantic parsing of book contents) and put a fast lookup on top of it. They literally are caching the search results in some large databases, such that all they have to do is db lookups on your search results. Perhaps I misunderstood the question, but it's similar to what Google does. You don't think their spiders scour the web for your sites when you enter in a search term, right?
嗯,有并行化,但每个人在这些类型的事件的后端做的事情之一是运行缓慢的过程(如书籍内容的语义分析)并在其上面进行快速查找。他们确实在一些大型数据库中缓存搜索结果,这样他们所要做的就是在搜索结果上进行数据库查找。也许我误解了这个问题,但它与谷歌的做法相似。当您输入搜索字词时,您认为他们的蜘蛛不会为您的网站搜索网页,对吧?
#1
The short answer is, a lot of indexing. The longer answer is, a lot of indexing, a lot of redundancy, a lot of caching, and smart partitioning.
简短的回答是,很多索引。更长的答案是,大量的索引,大量的冗余,大量的缓存和智能分区。
The real answer is -- read this book: http://www-csli.stanford.edu/~hinrich/information-retrieval-book.html
真正的答案是 - 阅读本书:http://www-csli.stanford.edu/~hinrich/information-retrieval-book.html
(It's free, and it's very good).
(这是免费的,而且非常好)。
#2
Well, there is parallelization, but one of the things that everyone does on the backend of these types of things is run slow processes (like semantic parsing of book contents) and put a fast lookup on top of it. They literally are caching the search results in some large databases, such that all they have to do is db lookups on your search results. Perhaps I misunderstood the question, but it's similar to what Google does. You don't think their spiders scour the web for your sites when you enter in a search term, right?
嗯,有并行化,但每个人在这些类型的事件的后端做的事情之一是运行缓慢的过程(如书籍内容的语义分析)并在其上面进行快速查找。他们确实在一些大型数据库中缓存搜索结果,这样他们所要做的就是在搜索结果上进行数据库查找。也许我误解了这个问题,但它与谷歌的做法相似。当您输入搜索字词时,您认为他们的蜘蛛不会为您的网站搜索网页,对吧?