从全文搜索结果中提取小的相关位文本(如Google所做的那样)

时间:2022-08-22 11:11:16

I have implemented a full text search in a discussion forum database and I want to display the search results in a way Google does. Even for a very long html page only a two or three lines of the texts displayed in a search result list. Usually these are the lines which contain a search terms.

我在讨论论坛数据库中实现了全文搜索,我希望以Google的方式显示搜索结果。即使对于非常长的html页面,在搜索结果列表中也只显示两行或三行文本。通常这些是包含搜索词的行。

What would be the good algorithm of how to extract a few lines of the text based on the text itself and a search terms. I could think of something as easy as just using one line of text before the search term occurrence in a text and a line after - but that seems to be too simple to work.

如何根据文本本身和搜索术语提取几行文本的好算法是什么?我可以想到一些简单的事情就像在文本中的搜索词出现之前使用一行文本一样简单 - 但是这似乎太简单了。

Would like to get a few directions, ideas and insights.

想获得一些方向,想法和见解。

Thank you.

3 个解决方案

#1


If you are looking for something fancier than the 'line before/after' approach, a summarizer might do the trick.

如果你正在寻找比'line / after'方法更精彩的东西,总结者可能会做到这一点。

Here's a Naive Bayes based system: http://classifier4j.sourceforge.net/

这是一个朴素贝叶斯系统:http://classifier4j.sourceforge.net/

Bayes is the statistical system used by many spam filters - I researched Bayes summarizers a few years back, and found that they do a pretty good job of summarizing text, as long as there is a decent amount of text to process. I haven't actually tried the above library, though, so your mileage may vary.

贝叶斯是许多垃圾邮件过滤器使用的统计系统 - 几年前我研究了贝叶斯汇总器,发现它们在汇总文本方面做得非常好,只要有大量的文本需要处理。我没有真正尝试过上面的图书馆,所以你的里程可能会有所不同。

#2


Have you tried the "line before/after search term occurrance" in code to see if for that simple coding investment the results are good enough for what you want? Might already be enough?

您是否尝试过代码中“搜索词出现之前/之后的行”以查看是否对于简单的编码投资,结果是否足以满足您的需求?可能已经够了吗?

Otherwise, you could go for pieces of sentences: so don't split on lines, but on newlines, full stops, comma's, spaced out hyphens etc. Then show the pieces that contain the search terms. You could separate each matching sentence piece with "..." or something.

否则,您可以选择句子:所以不要在行上分割,而是在换行符,句号,逗号,间隔连字符等处。然后显示包含搜索词的片段。你可以用“......”或其他东西分隔每个匹配的句子。

If you get a lot of these pieces, you could try to prioritize the pieces, sort on descending priority and only show the first n of them. And/or cut down the pieces to just the search term and a couple of words around the search term.

如果您获得了很多这些部分,您可以尝试优先考虑这些部分,按优先顺序排序并仅显示前n个部分。和/或将片段缩减为搜索词和搜索词周围的几个单词。

Just a couple of informal ideas that might get you started?

只是一些非正式的想法可能会让你开始?

#3


Concentrate on the beginning of the content. Think of where you would look when you visit a blog. The beginning para tells you whether the article is in the right direction. So in your algorithm it will make sense to reflect this.

专注于内容的开头。想一想您访问博客时的样子。起始段告诉您文章是否在正确的方向。因此,在您的算法中,反映这一点是有意义的。

Check for occurrences of the search term in headings (H1,H2 etc) and give more priority to them.

检查标题(H1,H2等)中搜索词的出现次数,并给予他们更多的优先权。

This should get you started.

这应该让你开始。

#1


If you are looking for something fancier than the 'line before/after' approach, a summarizer might do the trick.

如果你正在寻找比'line / after'方法更精彩的东西,总结者可能会做到这一点。

Here's a Naive Bayes based system: http://classifier4j.sourceforge.net/

这是一个朴素贝叶斯系统:http://classifier4j.sourceforge.net/

Bayes is the statistical system used by many spam filters - I researched Bayes summarizers a few years back, and found that they do a pretty good job of summarizing text, as long as there is a decent amount of text to process. I haven't actually tried the above library, though, so your mileage may vary.

贝叶斯是许多垃圾邮件过滤器使用的统计系统 - 几年前我研究了贝叶斯汇总器,发现它们在汇总文本方面做得非常好,只要有大量的文本需要处理。我没有真正尝试过上面的图书馆,所以你的里程可能会有所不同。

#2


Have you tried the "line before/after search term occurrance" in code to see if for that simple coding investment the results are good enough for what you want? Might already be enough?

您是否尝试过代码中“搜索词出现之前/之后的行”以查看是否对于简单的编码投资,结果是否足以满足您的需求?可能已经够了吗?

Otherwise, you could go for pieces of sentences: so don't split on lines, but on newlines, full stops, comma's, spaced out hyphens etc. Then show the pieces that contain the search terms. You could separate each matching sentence piece with "..." or something.

否则,您可以选择句子:所以不要在行上分割,而是在换行符,句号,逗号,间隔连字符等处。然后显示包含搜索词的片段。你可以用“......”或其他东西分隔每个匹配的句子。

If you get a lot of these pieces, you could try to prioritize the pieces, sort on descending priority and only show the first n of them. And/or cut down the pieces to just the search term and a couple of words around the search term.

如果您获得了很多这些部分,您可以尝试优先考虑这些部分,按优先顺序排序并仅显示前n个部分。和/或将片段缩减为搜索词和搜索词周围的几个单词。

Just a couple of informal ideas that might get you started?

只是一些非正式的想法可能会让你开始?

#3


Concentrate on the beginning of the content. Think of where you would look when you visit a blog. The beginning para tells you whether the article is in the right direction. So in your algorithm it will make sense to reflect this.

专注于内容的开头。想一想您访问博客时的样子。起始段告诉您文章是否在正确的方向。因此,在您的算法中,反映这一点是有意义的。

Check for occurrences of the search term in headings (H1,H2 etc) and give more priority to them.

检查标题(H1,H2等)中搜索词的出现次数,并给予他们更多的优先权。

This should get you started.

这应该让你开始。