搜索引擎如何找到相关内容?

How does Google find relevant content when it's parsing the web?

在解析web时谷歌如何查找相关内容?

Let's say, for instance, Google uses the PHP native DOM Library to parse content. What methods would they be for it to find the most relevant content on a web page?

例如，谷歌使用PHP本机DOM库解析内容。他们会用什么方法在网页上找到最相关的内容?

My thoughts would be that it would search for all paragraphs, order by the length of each paragraph and then from possible search strings and query params work out the percentage of relevance each paragraph is.

我的想法是，它将搜索所有段落，按每个段落的长度排序，然后从可能的搜索字符串和查询params中计算出每个段落的相关度。

Let's say we had this URL:

假设我们有这个URL:

http://domain.tld/posts/*-dominates-the-world-wide-web.html

Now from that URL I would work out that the HTML file name would be of high relevance so then I would see how close that string compares with all the paragraphs in the page!

现在，通过这个URL，我可以计算出HTML文件名具有很高的相关性，这样我就可以看到这个字符串与页面中的所有段落相比有多接近!

A really good example of this would be Facebook share, when you share a page. Facebook quickly bots the link and brings back images, content, etc., etc.

Facebook share就是一个很好的例子，当你分享一个页面的时候。脸书很快就会破坏这个链接，并带回图片、内容等。

I was thinking that some sort of calculative method would be best, to work out the % of relevancy depending on surrounding elements and meta data.

我认为某种计算方法是最好的，根据周围的元素和元数据计算相关性的百分比。

Are there any books / information on the best practices of content parsing that covers how to get the best content from a site, any algorithms that may be talked about or any in-depth reply?

是否有关于内容解析的最佳实践的书籍/信息，包括如何从站点获得最佳内容、可能讨论的算法或任何深入的回复?

Some ideas that I have in mind are:

我想到的一些想法是:

Find all paragraphs and order by plain text length
根据纯文本长度查找所有段落和顺序
Somehow find the Width and Height of div containers and order by (W+H) - @Benoit
通过(W+H) - @Benoit找到div容器的宽度和高度
Check meta keywords, title, description and check relevancy within the paragraphs
检查元关键词，标题，描述和检查相关性在段落
Find all image tags and order by largest, and length of nodes away from main paragraph
找到所有的图像标记和顺序最大，和长度的节点从主要段落。
Check for object data, such as videos and count the nodes from the largest paragraph / content div
检查对象数据，如视频，并从最大的段落/内容div中计数节点
Work out resemblances from previous pages parsed
计算出与前面分析的页面相似的地方

The reason why I need this information:

我需要这个信息的原因是:

I'm building a website where webmasters send us links and then we list their pages, but I want the webmaster to submit a link, then I go and crawl that page finding the following information.

我正在建立一个网站，在那里站长给我们发送链接，然后我们列出他们的页面，但是我想让站长提交一个链接，然后我去搜索那个页面，找到以下信息。

An image (if applicable)
一个图像(如适用)
A < 255 paragraph from the best slice of text
从最好的文本片段中选出一个< 255段
Keywords that would be used for our search engine, (Stack Overflow style)
将用于搜索引擎的关键字(堆栈溢出样式)
Meta data Keywords, Description, all images, change-log (for moderation and administration purposes)
元数据关键字、描述、所有图像、变更日志(用于审核和管理目的)

Hope you guys can understand that this is not for a search engine but the way search engines tackle content discovery is in the same context as what I need it for.

希望你们能理解这不是针对搜索引擎的，而是搜索引擎处理内容发现的方式与我所需要的是相同的。

I'm not asking for trade secrets, I'm asking what your personal approach to this would be.

我不是在要求商业机密，我只是在问你个人的做法是什么。

13 个解决方案

#1

This is a very general question but a very nice topic! Definitely upvoted :) However I am not satisfied with the answers provided so far, so I decided to write a rather lengthy answer on this.

这是一个非常普遍的问题，但却是一个非常好的话题!但是我对目前所提供的答案并不满意，所以我决定就此写一个很长的回答。

The reason I am not satisfied is that the answers are basically all true (I especially like the answer of kovshenin (+1), which is very graph theory related...), but the all are either too specific on certain factors or too general.

我不满意的原因是，答案基本上都是真实的(我特别喜欢kovshenin(+1)的答案，这是与图论相关的…)，但是所有的答案要么过于具体，要么太笼统。

It's like asking how to bake a cake and you get the following answers:

就像问如何烤蛋糕，你会得到以下答案:

You make a cake and you put it in the oven.
你做一个蛋糕，然后把它放进烤箱。
You definitely need sugar in it!
你肯定需要糖!
What is a cake?
什么是蛋糕吗?
The cake is a lie!
这蛋糕是骗人的!

You won't be satisfied because you wan't to know what makes a good cake. And of course there are a lot or recipies.

你不会满意，因为你不知道什么是好蛋糕。当然也有很多或者是recipies。

Of course Google is the most important player, but, depending on the use case, a search engine might include very different factors or weight them differently.

当然谷歌是最重要的角色，但是，根据用例的不同，搜索引擎可能包含非常不同的因素，或者对它们的权重不同。

For example a search engine for discovering new independent music artists may put a malus on artists websites with a lots of external links in.

例如，一个搜索引擎发现新的独立音乐艺术家可能会在艺术家的网站上添加一个malus与许多外部链接。

A mainstream search engine will probably do the exact opposite to provide you with "relevant results".

主流搜索引擎可能会做相反的事情，为你提供“相关的结果”。

There are (as already said) over 200 factors that are published by Google. So webmasters know how to optimize their websites. There are very likely many many more that the public is not aware of (in Google's case).

谷歌发布了超过200个因子(如前所述)。所以网站管理员知道如何优化他们的网站。很可能还有更多的公众没有意识到(在谷歌的案例中)。

But in the very borad and abstract term SEO optimazation you can generally break the important ones apart into two groups:

但是在最基本的和抽象的术语SEO优化中你可以把重要的部分分成两组:

How well does the answer match the question? Or: How well does the pages content match the search terms?

答案与问题匹配的程度如何?或者:页面内容与搜索词匹配程度如何?
How popular/good is the answer? Or: What's the pagerank?

答案是什么?或者:网页级别是什么?

In both cases the important thing is that I am not talking about whole websites or domains, I am talking about single pages with a unique URL.

在这两种情况下，重要的是我不是在谈论整个网站或域，我是在谈论具有唯一URL的单个页面。

It's also important that pagerank doesn't represent all factors, only the ones that Google categorizes as Popularity. And by good I mean other factors that just have nothing to do with popularity.

重要的是，pagerank不能代表所有的因素，只有谷歌归类为受欢迎程度的因素。我说的好，是指其他与受欢迎程度无关的因素。

In case of Google the official statement is that they want to give relevant results to the user. Meaning that all algorithms will be optimized towards what the user wants.

在谷歌的情况下，官方声明是他们想给用户提供相关的结果。这意味着所有的算法都将根据用户的需要进行优化。

So after this long introduction (glad you are still with me...) I will give you a list of factors that I consider to be very important (at the moment):

所以在这漫长的介绍之后(很高兴你还和我在一起……)我将列出我认为非常重要的(目前)因素:

Category 1 (how good does the answer match the question?

第一类(答案与问题匹配程度如何?

You will notice that a lot comes down to the structure of the document!

您将注意到，很多都归结于文档的结构!

The page primarily deals with the exact question.
这一页主要讨论确切的问题。

Meaning: the question words appear in the pages title text or in heading paragraphs paragraphs. The same goes for the position of theese keywords. The earlier in the page the better. Repeated often as well (if not too much which goes under the name of keywords stuffing).

意思:问题词出现在页面标题文本或标题段落中。关键词的位置也是如此。页面越早越好。同样也要经常重复(如果不是太多的话，可以用关键词填充)。

The whole website deals with the topic (keywords appear in the domain/subdomain)

整个网站处理这个主题(关键词出现在域/子域)
The words are an important topic in this page (internal links anchor texts jump to positions of the keyword or anchor texts / link texts contain the keyword).

单词是这个页面的一个重要主题(内部链接锚文本跳到关键字的位置，或者锚文本/链接文本包含关键字)。
The same goes if external links use the keywords in link text to link to this page

同样，如果外部链接使用链接文本中的关键字链接到此页面。

Category 2 (how important/popular is the page?)

第二类(页面有多重要/受欢迎?)

You will notice that not all factors point towards this exact goal. Some are included (especially by Google) just to give pages a boost, that... well... that just deserved/earned it.

你会注意到并不是所有的因素都指向这个目标。其中包括了一些(尤其是谷歌)，只是为了给页面增加一些内容。嗯…应得的/获得它。

Content is king!
内容为王!

The existence of unique content that can't be found or only very little in the rest of the web gives a boost. This is mostly measured by unordered combinations of words on a website that are generally used very little (important words). But there are much more sophisticated methods as well.

独一无二的内容的存在，在网络的其他地方是找不到的，或者只有很少的内容。这主要是通过网站上的无序组合词来衡量的，这些词通常很少被使用(重要的词)。但也有更复杂的方法。

Recency - newer is better

近似值——更新的更好
Historical change (how often the page has updated in the past. Changing is good.)

历史变化(页面在过去更新的频率)。改变是好的。)
External link popularity (how many links in?)

外部链接流行度(多少个链接?)

If a page links another page the link is worth more if the page itself has a high pagerank.

如果一个页面链接了另一个页面，那么如果这个页面本身有一个高的pagerank，那么这个链接就更有价值。

External link diversity
外部链接的多样性

basically links from different root domains, but other factors play a role too. Factors like even how seperated are the webservers of linking sites geographically (according to their ip address).

基本来说，链接来自不同的根域，但是其他因素也起作用。一些因素，比如网站服务器的地理位置(根据它们的ip地址)。

Trust Rank
信任等级

For example if big, trusted, established sites with redactional content link to you, you get a trust rank. That's why a link from The New York Times is worth much more than some strange new website, even if it's PageRank is higher!

例如，如果大型的、受信任的、已建立的站点与您链接，您将获得信任级别。这就是为什么纽约时报的链接比一些奇怪的新网站更有价值，即使它的网页级别更高!

Domain trust
域的信任

Your whole website gives a boost to your content if your domain is trusted. Well different factors count here. Of course links from trusted sties to your domain, but it will even do good if you are in the same datacenter as important websites.

如果你的域名被信任，你的整个网站将会提升你的内容。这里有不同的因素。当然，从受信任的sties链接到您的域，但如果您与重要的网站处于相同的数据中心，它甚至会做得很好。

Topic specific links in.
主题特定的链接。

If websites that can be resolved to a topic link to you and the query can be resolved to this topic as well, it's good.

如果可以解析为主题链接的网站和查询也可以解析为这个主题，那就很好了。

Distribution of links in over time.
链接的分布。

If you earned a lot of links in in a short period of time, this will do you good at this time and the near future afterwards. But not so good later in time. If you slow and steady earn links it will do you good for content that is "timeless".

如果你在短时间内获得了很多链接，这将对你在此时以及之后的不久的将来有好处。但晚些时候不太好。如果你缓慢而稳定地获取链接，它将对“永恒”的内容有好处。

Links from restrited domains
restrited域的链接

A link from a .gov domain is worth a lot.

来自。gov域名的链接价值不菲。

User click behaviour
用户点击行为

Whats the clickrate of your search result?

你的搜索结果的点击率是多少?

Time spent on site
在现场的时间

Google analytics tracking, etc. It's also tracked if the user clicks back or clicks another result after opening yours.

谷歌分析跟踪等。如果用户在打开您的应用程序后单击或单击另一个结果，它也会被跟踪。

Collected user data
收集用户数据

Votes, rating, etc., references in Gmail, etc.

投票、评级等，Gmail中的引用等。

Now I will introduce a third category, and one or two points from above would go into this category, but I haven't thought of that... The category is:

现在我要介绍第三个类别，上面的一到两点可以归入这个类别，但是我还没有想到……类别是:

** How important/good is your website in general **

你的网站总的来说有多重要

All your pages will be ranked up a bit depending on the quality of your websites

你所有的页面都会根据你网站的质量进行排序

Factors include:

因素包括:

Good site architecture (easy to navgite, structured. Sitemaps, etc...)

良好的网站架构(易于浏览，结构化。站点地图等…)
How established (long existing domains are worth more).

如何建立(长期存在的领域价值更高)。
Hoster information (what other websites are hosted near you?

Hoster信息(你附近还有什么网站?)
Search frequency of your exact name.

搜索你确切名字的频率。

Last, but not least, I want to say that a lot of these theese factors can be enriched by semantic technology and new ones can be introduced.

最后，但同样重要的是，我想说的是，可以通过语义技术和新的语义技术来丰富这些因素。

For example someone may search for Titanic and you have a website about icebergs ... that can be set into correlation which may be reflected.

例如，有人可能会搜索泰坦尼克号，你有一个关于冰山的网站……这可以被设定成相关，可以被反映出来。

Newly introduced semantic identifiers. For example OWL tags may have a huge impact in the future.

新引入语义标识符。例如，OWL标记在未来可能会产生巨大的影响。

For example a blog about the movie Titanic could put a sign on this page that it's the same content as on the Wikipedia article about the same movie.

例如，一个关于电影《泰坦尼克号》的博客可以在这个页面上写一个标记，说明它和*上关于同一部电影的文章内容相同。

This kind of linking is currently under heavy development and establishment and nobody knows how it will be used.

这种联系目前正在大力发展和建立，没有人知道它将如何使用。

Maybe duplicate content is filtered, and only the most important of same content is displayed? Or maybe the other way round? That you get presented a lot of pages that match your query. Even if they dont contain your keywords?

可能重复的内容被过滤了，只有最重要的内容被显示?或者反过来?你会得到很多与你的查询相匹配的页面。即使他们不包含你的关键词?

Google even applies factors in different relevance depending on the topic of your search query!

谷歌甚至根据你的搜索查询的主题，在不同的相关性中应用因子!

#2

Tricky, but I'll take a stab:

有点棘手，但我要试一试:

An image (If applicable)

一个图像(如适用)

The first image on the page
页面上的第一个图像
the image with a name that includes the letters "logo"
带有包含字母“logo”的名称的图像
the image that renders closest to the top-left (or top-right)
最接近左上角(或右上角)的图像
the image that appears most often on other pages of the site
通常出现在站点其他页面上的图像
an image smaller than some maximum dimensions
一个小于某个最大尺寸的图像。

A < 255 paragraph from the best slice of text

从最好的文本片段中选出一个< 255段

contents of the title tag
标题标签的内容
contents of the meta content description tag
元内容描述标签的内容
contents of the first h1 tag
第一个h1标签的内容
contents of the first p tag
第一个p标签的内容

Keywords that would be used for our search engine, (stack overflow style)

将用于搜索引擎的关键字(堆栈溢出样式)

substring of the domain name
域名的子串。
substring of the url
url的子串
substring of the title tag
标题标记的子字符串
proximity between the term and the most common word on the page and the top of the page
术语与页面上最常见的词和页面顶部的接近程度

Meta data Keywords,Description, all images, change-log (for moderation and administration purposes)

元数据关键字、描述、所有图像、变更日志(用于审核和管理目的)

ak! gag! Syntax Error.
ak党!呕吐!语法错误。

#3

I don't work at Google but around a year ago I read they had over 200 factors for ranking their search results. Of course the top ranking would be relevance, so your question is quite interesting in that sense.

我不在谷歌公司工作，但大约一年前我读到他们有超过200个因素来排列他们的搜索结果。当然排名靠前的是相关性，所以你的问题在这个意义上很有趣。

What is relevance and how do you calculate it? There are several algorithms and I bet Google have their own, but ones I'm aware of are Pearson Correlation and Euclidean Distance.

什么是相关性，你如何计算它?有几种算法，我打赌谷歌也有自己的算法，但是我知道的有皮尔逊相关和欧几里得距离。

A good book I'd suggest on this topic (not necessarily search engines) is Programming Collective Intelligence by Toby Segaran (O'Reilly). A few samples from the book show how to fetch data from third-party websites via APIs or screen-scraping, and finding similar entries, which is quite nice.

关于这个话题(不一定是搜索引擎)，我建议的一本好书是托比·塞加兰(托比·奥莱利饰)编写的《集体智慧》。该书的一些示例展示了如何通过api或屏幕抓取来从第三方网站获取数据，并找到类似的条目，这非常好。

Anyways, back to Google. Other relevance techniques are of course full-text searching and you may want to get a good book on MySQL or Sphinx for that matter. Suggested by @Chaoley was TSEP which is also quite interesting.

无论如何,回到谷歌。其他相关技术当然是全文搜索，你可能想要一本关于MySQL或Sphinx的好书。@Chaoley建议的是TSEP，这也很有趣。

But really, I know people from a Russian search engine called Yandex here, and everything they do is under NDA, so I guess you can get close, but you cannot get perfect, unless you work at Google ;)

但我认识一个叫Yandex的俄罗斯搜索引擎的人，他们所做的一切都在NDA之下，所以我猜你可以接近，但你不可能完美，除非你在谷歌工作;

Cheers.

欢呼。

#4

Actually answering your question (and not just generally about search engines):

实际上回答你的问题(不仅仅是关于搜索引擎的):

I believe going bit like Instapaper does would be the best option.

我相信像Instapaper那样做会是最好的选择。

Logic behind instapaper (I didn't create it so I certainly don't know inner-workings, but it's pretty easy to predict how it works):

instapaper背后的逻辑(我没有创建它，所以我当然不知道内部工作原理，但是很容易预测它是如何工作的):

Find biggest bunch of text in text-like elements (relying on paragraph tags, while very elegant, won't work with those crappy sites that use div's instead of p's). Basically, you need to find good balance between block elements (divs, ps, etc.) and amount of text. Come up with some threshold: if X number of words stays undivided by markup, that text belongs to main body text. Then expand to siblings keeping the text / markup threshold of some sort.

在类似文本的元素中找到大量的文本(依赖于段落标记，虽然非常优雅，但不能与那些使用div而不是p的糟糕站点合作)。基本上，您需要在块元素(div, ps等)和文本数量之间找到良好的平衡。提出一些阈值:如果X个单词的数量不被标记分割，则该文本属于主体文本。然后扩展到保持文本/标记阈值的兄弟姐妹。
Once you do the most difficult part — find what text belongs to actual article — it becomes pretty easy. You can find first image around that text and use it as you thumbnail. This way you will avoid ads, because they will not be that close to body text markup-wise.

一旦你做了最困难的部分——找到什么文本属于实际的文章——它就变得相当容易。你可以在文本周围找到第一个图像，并将其作为缩略图使用。这样你就可以避免广告，因为它们不会太接近正文文本标记。
Finally, coming up with keywords is the fun part. You can do tons of things: order words by frequency, remove noise (ands, ors and so on) and you have something nice. Mix that with "prominent short text element above detected body text area" (i.e. your article's heading), page title, meta and you have something pretty tasty.

最后，提出关键字是有趣的部分。你可以做很多事情:按频率排序单词，去除噪音(and, ors等)，这样你就有了一些不错的东西。混合“突出的短文本元素上面检测到的正文区域”(即你文章的标题)，页标题，元，你有一些相当美味的东西。

All these ideas, if implemented properly, will be very bullet-proof, because they do not rely on semantic markup — by making your code complex you ensure even very sloppy-coded websites will be detected properly.

所有这些想法，如果得到正确的实现，将是非常容易被发现的，因为它们不依赖于语义标记——通过使代码复杂，您可以确保即使是非常杂乱的网站也能被正确地检测到。

Of course, it comes with downside of poor performance, but I guess it shouldn't be that poor.

当然，它也有不好的一面，但我想它不应该那么差。

Tip: for large-scale websites, to which people link very often, you can set HTML element that contains the body text (that I was describing on point #1) manually. This will ensure correctness and speed things up.

提示:对于人们经常链接的大型网站，您可以手动设置包含正文文本的HTML元素(我在第1点描述过)。这将确保正确性和速度。

Hope this helps a bit.

希望这能有所帮助。

#5

Most search engines look for the title and meta description in the head of the document, then heading one and text content in the body. Image alt tags and link titles are also considered. Last I read Yahoo was using the meta keyword tag but most don't.

大多数搜索引擎在文档的头部查找标题和元描述，然后在正文中查找标题和文本内容。图像alt标签和链接标题也被考虑。最后我读到雅虎使用的是meta关键词标签，但大多数都没有。

You might want to download the open source files from The Search Engine Project (TSEP) on Sourceforge https://sourceforge.net/projects/tsep/ and have a look at how they do it.

您可能想要从Sourceforge上的搜索引擎项目(TSEP)下载开源文件https://sourceforge.net/projects/tsep/，并看看他们是如何做的。

#6

I'd just grab the first 'paragraph' of text. The way most people write stories/problems/whatever is that they first state the most important thing, and then elaborate. If you look at any random text and you can see it makes sense most of the time.

我只需要抓住文本的第一段。大多数人写故事/问题的方式是他们先陈述最重要的事情，然后再详细阐述。如果你看任何随机的文本，你会发现它在大多数时候都是有意义的。

For example, you do it yourself in your original question. If you take the first three sentences of your original question, you have a pretty good summary of what you are trying to do.

例如，你在最初的问题中自己做。如果你把问题的前三句话放在一起，你就能很好地总结出你要做的事情。

And, I just did it myself too: the gist of my comment is summarized in the first paragraph. The rest is just examples and elaborations. If you're not convinced, take a look at a few recent articles I semi-randomly picked from Google News. Ok, that last one was not semi-random, I admit ;)

而且，我自己也这么做了:我的评论的要点在第一段中进行了总结。剩下的只是示例和精化。如果你不信，看看我最近从谷歌新闻中随机挑选的几篇文章。好吧，我承认最后一个不是半随机的;

Anyway, I think that this is a really simple approach that works most of the time. You can always look at meta-descriptions, titles and keywords, but if they aren't there, this might be an option.

无论如何，我认为这是一种非常简单的方法，大多数时候都是有效的。你总是可以查看元描述、标题和关键字，但是如果它们不存在，这可能是一个选项。

Hope this helps.

希望这个有帮助。

#7

There are lots of highly sophisticated algorithms for extracting the relevant content from a tag soup. If you're looking to build something usable your self, you could take a look at the source code for readability and port it over to php. I did something similar recently (Can't share the code, unfortunately).

有很多非常复杂的算法可以从标签汤中提取相关内容。如果您希望自己构建一些有用的东西，可以查看源代码以获得可读性，并将其移植到php。我最近做了类似的事情(不幸的是，不能共享代码)。

The basic logic of readability is to find all block level tags and count the length of text in them, not counting children. Then each parent node is awarded a fragment (half) of the weight of each of its children. This is used to fund the largest block level tag that has the largest amount of plain text. From here, the content is further cleaned up.

可读性的基本逻辑是查找所有块级标记并计算它们中的文本长度，而不计算子标记。然后，每个父节点被授予其每个子节点的重量片段(一半)。这用于为拥有最多纯文本量的最大块级标记提供资金。从这里，内容被进一步清理。

It's not bullet proof by any means, but it works well in the majority of cases.

这并不是一种防弹的方法，但在大多数情况下都是有效的。

#8

I would consider these building the code

我会考虑这些代码的构建

Check for synonyms and acronyms
检查同义词和首字母缩略词。
applying OCR on images to search as text(Abby Fine Reader and Recostar are nice, Tesseract is free and fine(no so fine as fine reader :) )
在图像上应用OCR以文本形式进行搜索(Abby Fine Reader和Recostar很好，Tesseract很*也很好(不像Fine Reader那么好)
weight Fonts as well(size, boldness, underline, color)
字体也要有分量(大小、粗细、下划线、颜色)
weight content depending on its place on page(like contents on upper side of page is more relevant)
权重内容取决于页面上的位置(如页面上方的内容更相关)

Also:

另外:

An optinal text asked from the webmaster to define the page
从网站管理员那里要求定义页面的optinal文本

You can also check if you can find anything useful at Google search API: http://code.google.com/intl/tr/apis/ajaxsearch/

您还可以检查是否在谷歌搜索API中找到有用的东西:http://code.google.com/intl/tr/apis/ajaxsearch/。

#9

I'm facing the same problem right now, and after some tries I found something that works for creating a webpage snippet (must be fine-tuned):

我现在正面临着同样的问题，在一些尝试之后，我发现了一些可以用来创建网页片段的东西(必须进行微调):

take all the html
把所有的html
remove script and style tags inside the body WITH THEIR CONTENT (important)
将脚本和样式标签及其内容删除(重要)
remove unnecessary spaces, tabs, newlines.
删除不必要的空格、制表符和换行符。
now navigate through the DOM to catch div, p, article, td (others?) and, for each one . take the html of the current element . take a "text only" version of the element content . assign to this element the score: text lenght * text lenght / html lenght
现在浏览DOM，以捕获div、p、article、td(其他?)和每个。以当前元素的html为例。取元素内容的“纯文本”版本。给这个元素赋值:text lenght * text lenght / html lenght
now sort all the scores, take the greatest.
现在整理所有的分数，取最大的。

This is a quick (and dirty) way to identify longest texts with a relatively low balance of markup, like what happens in normal contents. In my tests this seems really good. Just add water ;)

这是一种快速(而且肮脏)的方法，可以识别标记相对较低的最长文本，就像在正常内容中发生的那样。在我的测试中，这看起来真的很不错。只加水。

In addition to this you can search for "og:" meta tags, title and description, h1 and a lot of other minor techniques.

除此之外，您还可以搜索“og:”元标记、标题和描述、h1和许多其他次要技术。

#10

Google for 'web crawlers, robots, Spiders, and Intelligent Agents', might try them separately as well to get individual results.

谷歌用于“网络爬虫、机器人、蜘蛛和智能代理”，也可以分别尝试它们以获得单独的结果。

Web Crawler
网络爬虫
User-Agents
用户代理
Bots
机器人
Data/Screen Scraping
数据/屏幕抓取

What I think you're looking for is Screen Scraping (with DOM) which Stack has a ton of Q&A on.

我认为您正在寻找的是屏幕抓取(使用DOM)，堆栈上有大量的问答。

#11

Google also uses a system called Page Rank, where it examines how many links to a site there are. Let's say that you're looking for a C++ tutorial, and you search Google for one. You find one as the top result, an it's a great tutorial. Google knows this because it searched through its cache of the web and saw that everyone was linking to this tutorial, while ranting how good it was. Google deceides that it's a good tutorial, and puts it as the top result.

谷歌还使用一个名为Page Rank的系统，它检查到站点的链接数量。假设你在寻找一个c++教程，你在谷歌中搜索一个。你可以找到最上面的结果，一个很好的教程。谷歌知道这一点，因为它搜索了它的网络缓存，并且看到每个人都链接到这个教程，同时还说它有多好。谷歌欺骗它是一个好的教程，并把它作为最高的结果。

It actually does that as it caches everything, giving each page a Page Rank, as said before, based on links to it.

它实际上是这样做的，因为它缓存了所有内容，并根据链接给每个页面一个页面等级。

Hope this helps!

希望这可以帮助!

#12

To answer one of your questions, I am reading the following book right now, and I recommend it: Google's PageRank and Beyond, by Amy Langville and Carl Meyer.

为了回答你的一个问题，我现在正在读下面的书，我推荐这本书:谷歌的PageRank and Beyond, Amy Langville和Carl Meyer著。

Mildly mathematical. Uses some linear algebra in a graph theoretic context, eigenanalysis, Markov models, etc. I enjoyed the parts that talk about iterative methods for solving linear equations. I had no idea Google employed these iterative methods.

温和的数学。在图论、特征分析、马尔可夫模型等方面使用了一些线性代数。我不知道谷歌使用了这些迭代方法。

Short book, just 200 pages. Contains "asides" that diverge from the main flow of the text, plus historical perspective. Also points to other recent ranking systems.

这本书很短，只有200页。包含偏离正文主要内容的“旁注”，以及历史观点。还指出了其他最近的排名系统。

#13

There are some good answers on here, but it sounds like they don't answer your question. Perhaps this one will.

这里有一些很好的答案，但是听起来他们并没有回答你的问题。也许这个会。

What your looking for is called Information Retrieval

你要找的是信息检索

It usually uses the Bag Of Words model

它通常使用单词包模型

Say you have two documents:

假设你有两份文件:

DOCUMENT A  
Seize the time, Meribor. Live now; make now always the most precious time. Now will never come again

and this one

这一个

DOCUMENT B  
Worf, it was what it was glorious and wonderful and all that, but it doesn't mean anything

and you have a query, or something you want to find other relevant documents for

你有一个查询，或者你想找其他相关文件的东西

QUERY aka DOCUMENT C
precious wonderful life

Anyways, how do you calculate the most "relevant" of the two documents? Here's how:

无论如何，如何计算这两个文档中最“相关”的部分?方法如下:

tokenize each document (break into words, removing all non letters)
使每个文档都有标记(分解成单词，去掉所有非字母)
lowercase everything
小写的一切
remove stopwords (and, the etc)
删除停止符(and等)
consider stemming (removing the suffix, see Porter or Snowball stemming algorithms)
考虑词干(删除后缀，参见波特或雪球词干算法)
consider using n-grams
考虑使用字格

You can count the word frequency, to get the "keywords".

你可以计算单词的频率，得到“关键词”。

Then, you make one column for each word, and calculate the word's importance to the document, with respect to its importance in all the documents. This is called the TF-IDF metric.

然后，为每个单词做一列，计算单词在所有文档中的重要性。这叫做TF-IDF度量。

Now you have this:

现在你有这个:

Doc precious worf life...
A   0.5      0.0  0.2 
B   0.0      0.9  0.0
C   0.7      0.0  0.9

Then, you calculate the similarity between the documents, using the Cosine Similarity measure. The document with the highest similarity to DOCUMENT C is the most relevant.

然后，使用余弦相似度度量计算文档之间的相似度。与C文档最相似的文档是最相关的。

Now, you seem to want to want to find the most similar paragraphs, so just call each paragraph a document, or consider using Sliding Windows over the document instead.

现在，您似乎希望找到最相似的段落，因此只需将每个段落称为文档，或者考虑在文档上使用滑动窗口。

You can see my video here. It uses a graphical Java tool, but explains the concepts:

你可以在这里看到我的视频。它使用图形化Java工具，但解释了概念:

http://vancouverdata.blogspot.com/2010/11/text-analytics-with-rapidminer-part-4.html

http://vancouverdata.blogspot.com/2010/11/text——分析————rapidminer部分- 4. - html

here is a decent IR book:

这是一本不错的IR书:

http://nlp.stanford.edu/IR-book/pdf/irbookonlinereading.pdf

#1