如何仅从HTML页面中提取主要文本内容?

时间:2023-02-15 21:15:14

Update

Boilerpipe appears to work really well, but I realized that I don't need only the main content because many pages don't have an article, but only links with some short description to the entire texts (this is common in news portals) and I don't want to discard these shorts text.

Boilerpipe似乎工作得很好,但我意识到我不仅需要主要内容,因为许多页面没有文章,只有一些简短的描述链接到整个文本(这在新闻门户网站中很常见)和我不想丢弃这些短文。

So if an API does this, get the different textual parts/the blocks splitting each one in some manner that differ from a single text (all in only one text is not useful), please report.

因此,如果API执行此操作,请获取不同的文本部分/块以不同于单个文本的某种方式分割每个部分(仅在一个文本中都没有用),请报告。


The Question

I download some pages from random sites, and now I want to analyze the textual content of the page.

我从随机网站下载了一些页面,现在我想分析页面的文本内容。

The problem is that a web page have a lot of content like menus, publicity, banners, etc.

问题是网页上有很多内容,如菜单,宣传,横幅等。

I want to try to exclude all that is not related with the content of the page.

我想尝试排除与页面内容无关的所有内容。

Taking this page as example, I don't want the menus above neither the links in the footer.

以此页为例,我不希望上面的菜单既不是页脚中的链接。

Important: All pages are HTML and are pages from various differents sites. I need suggestion of how to exclude these contents.

重要提示:所有页面都是HTML,是来自各种不同网站的页面。我需要建议如何排除这些内容。

At moment, I think in excluding content inside "menu" and "banner" classes from the HTML and consecutive words that looks like a proper name (first capital letter).

目前,我认为从HTML中删除“menu”和“banner”类中的内容以及看起来像正确名称(第一个大写字母)的连续单词。

The solutions can be based in the the text content(without HTML tags) or in the HTML content (with the HTML tags)

解决方案可以基于文本内容(没有HTML标记)或HTML内容(使用HTML标记)

Edit: I want to do this inside my Java code, not an external application (if this can be possible).

编辑:我想在我的Java代码中执行此操作,而不是外部应用程序(如果可以的话)。

I tried a way parsing the HTML content described in this question : https://*.com/questions/7035150/how-to-traverse-the-dom-tree-using-jsoup-doing-some-content-filtering

我尝试了解析此问题中描述的HTML内容的方法:https://*.com/questions/7035150/how-to-traverse-the-dom-tree-using-jsoup-doing-some-content-filtering

9 个解决方案

#1


22  

Take a look at Boilerpipe. It is designed to do exactly what your looking for, remove the surplus "clutter" (boilerplate, templates) around the main textual content of a web page.

看看Boilerpipe。它的设计完全符合您的要求,在网页的主要文本内容周围删除多余的“混乱”(样板,模板)。

There are a few ways to feed HTML into Boilerpipe and extract HTML.

有几种方法可以将HTML提供给Boilerpipe并提取HTML。

You can use a URL:

您可以使用以下网址:

ArticleExtractor.INSTANCE.getText(url);

You can use a String:

你可以使用一个字符串:

ArticleExtractor.INSTANCE.getText(myHtml);

There are also options to use a Reader, which opens up a large number of options.

还有使用Reader的选项,这会打开大量选项。

#2


7  

You can also use boilerpipe to segment the text into blocks of full-text/non-full-text, instead of just returning one of them (essentially, boilerpipe segments first, then returns a String).

您还可以使用samppipe将文本分段为全文/非全文块,而不是仅返回其中一个(实际上,首先是samppipe段,然后返回一个String)。

Assuming you have your HTML accessible from a java.io.Reader, just let boilerpipe segment the HTML and classify the segments for you:

假设您可以从java.io.Reader访问HTML,只需让samppipe对HTML进行分段并为您分段:

Reader reader = ...
InputSource is = new InputSource(reader);

// parse the document into boilerpipe's internal data structure
TextDocument doc = new BoilerpipeSAXInput(is).getTextDocument();

// perform the extraction/classification process on "doc"
ArticleExtractor.INSTANCE.process(doc);

// iterate over all blocks (= segments as "ArticleExtractor" sees them) 
for (TextBlock block : getTextBlocks()) {
    // block.isContent() tells you if it's likely to be content or not 
    // block.getText() gives you the block's text
}

TextBlock has some more exciting methods, feel free to play around!

TextBlock有一些更令人兴奋的方法,随意玩!

#3


5  

There appears to be a possible problem with Boilerpipe. Why? Well, it appears that is suited to certain kinds of web pages, such as web pages that have a single body of content.

Boilerpipe似乎可能存在问题。为什么?好吧,它似乎适用于某些类型的网页,例如具有单一内容的网页。

So one can crudely classify web pages into three kinds in respect to Boilerpipe:

因此,人们可以粗略地将网页分为三种类型的Boilerpipe:

  1. a web page with a single article in it (Boilerpipe worthy!)
  2. 一个包含一篇文章的网页(Boilerpipe值得!)
  3. a web with multiple articles in it, such as the front page of the New York times
  4. 一个包含多篇文章的网页,例如纽约时报的头版
  5. a web page that really doesn't have any article in it, but has some content in respect to links, but may also have some degree of clutter.
  6. 一个网页,其中没有任何文章,但有一些关于链接的内容,但也可能有一定程度的混乱。

Boilerpipe works on case #1. But if one is doing a lot of automated text processing, then how does one's software "know" what kind of web page it is dealing with? If the web page itself could be classified into one of these three buckets, then Boilerpipe could be applied for case #1. Case #2 is a problem, and case#3 is a problem as well - it might require an aggregate of related web pages to determine what is clutter and what isn't.

Boilerpipe适用于案例#1。但是,如果一个人正在进行大量的自动文本处理,那么一个人的软件如何“知道”它正在处理什么样的网页?如果网页本身可以归类为这三个桶中的一个,那么Boilerpipe可以应用于案例#1。案例#2是一个问题,案例#3也是一个问题 - 它可能需要一组相关的网页来确定什么是混乱和什么不是。

#4


1  

You can use some libs like goose. It works best on articles/news. You can also check javascript code that does similar extraction as goose with the readability bookmarklet

你可以使用一些像鹅一样的libs。它最适用于文章/新闻。您还可以使用可读性bookmarklet检查与goose进行类似提取的javascript代码

#5


1  

My first instinct was to go with your initial method of using Jsoup. At least with that, you can use selectors and retrieve only the elements that you want (i.e. Elements posts = doc.select("p"); and not have to worry about the other elements with random content.

我的第一直觉是使用你最初使用Jsoup的方法。至少可以使用选择器并仅检索所需的元素(即Elements posts = doc.select(“p”);而不必担心具有随机内容的其他元素。

On the matter of your other post, was the issue of false positives your only reasoning for straying away from Jsoup? If so, couldn't you just tweak the number of MIN_WORDS_SEQUENCE or be more selective with your selectors (i.e. do not retrieve div elements)

关于你的另一篇文章,误报的问题是你偏离Jsoup的唯一理由吗?如果是这样,你不能只调整MIN_WORDS_SEQUENCE的数量或者选择器的选择性更强(即不检索div元素)

#6


1  

http://kapowsoftware.com/products/kapow-katalyst-platform/robo-server.php

http://kapowsoftware.com/products/kapow-katalyst-platform/robo-server.php

Proprietary software, but it makes it very easy to extract from webpages and integrates well with java.

专有软件,但它可以很容易地从网页中提取并与java很好地集成。

You use a provided application to design xml files read by the roboserver api to parse webpages. The xml files are built by you analyzing the pages you wish to parse inside the provided application (fairly easy) and applying rules for gathering the data (generally, websites follow the same patterns). You can setup the scheduling, running, and db integration using the provided Java API.

您使用提供的应用程序来设计roboserver api读取的xml文件以解析网页。 xml文件是通过分析您希望在提供的应用程序中解析的页面(相当简单)并应用收集数据的规则(通常,网站遵循相同的模式)构建的。您可以使用提供的Java API设置调度,运行和数据库集成。

If you're against using software and doing it yourself, I'd suggest not trying to apply 1 rule to all sites. Find a way to separate tags and then build per-site

如果您反对使用软件并自行完成,我建议您不要尝试将1规则应用于所有网站。找到一种方法来分隔标签然后构建每个站点

#7


0  

You're looking for what are known as "HTML scrapers" or "screen scrapers". Here are a couple of links to some options for you:

您正在寻找所谓的“HTML scrapers”或“screen scrapers”。以下是一些指向您的选项的链接:

Tag Soup

标签汤

HTML Unit

HTML单元

#8


0  

You can filter the html junk and then parse the required details or use the apis of the existing site. Refer the below link to filter the html, i hope it helps. http://thewiredguy.com/wordpress/index.php/2011/07/dont-have-an-apirip-dat-off-the-page/

您可以过滤html垃圾,然后解析所需的详细信息或使用现有站点的api。请参考以下链接过滤html,我希望它有所帮助。 http://thewiredguy.com/wordpress/index.php/2011/07/dont-have-an-apirip-dat-off-the-page/

#9


0  

You could use the textracto api, it extracts the main 'article' text and there is also the opportunity to extract all other textual content. By 'subtracting' these texts you could split the navigation texts, preview texts, etc. from the main textual content.

您可以使用textracto api,它会提取主要的“文章”文本,还有机会提取所有其他文本内容。通过“减去”这些文本,您可以从主要文本内容中分离导航文本,预览文本等。

#1


22  

Take a look at Boilerpipe. It is designed to do exactly what your looking for, remove the surplus "clutter" (boilerplate, templates) around the main textual content of a web page.

看看Boilerpipe。它的设计完全符合您的要求,在网页的主要文本内容周围删除多余的“混乱”(样板,模板)。

There are a few ways to feed HTML into Boilerpipe and extract HTML.

有几种方法可以将HTML提供给Boilerpipe并提取HTML。

You can use a URL:

您可以使用以下网址:

ArticleExtractor.INSTANCE.getText(url);

You can use a String:

你可以使用一个字符串:

ArticleExtractor.INSTANCE.getText(myHtml);

There are also options to use a Reader, which opens up a large number of options.

还有使用Reader的选项,这会打开大量选项。

#2


7  

You can also use boilerpipe to segment the text into blocks of full-text/non-full-text, instead of just returning one of them (essentially, boilerpipe segments first, then returns a String).

您还可以使用samppipe将文本分段为全文/非全文块,而不是仅返回其中一个(实际上,首先是samppipe段,然后返回一个String)。

Assuming you have your HTML accessible from a java.io.Reader, just let boilerpipe segment the HTML and classify the segments for you:

假设您可以从java.io.Reader访问HTML,只需让samppipe对HTML进行分段并为您分段:

Reader reader = ...
InputSource is = new InputSource(reader);

// parse the document into boilerpipe's internal data structure
TextDocument doc = new BoilerpipeSAXInput(is).getTextDocument();

// perform the extraction/classification process on "doc"
ArticleExtractor.INSTANCE.process(doc);

// iterate over all blocks (= segments as "ArticleExtractor" sees them) 
for (TextBlock block : getTextBlocks()) {
    // block.isContent() tells you if it's likely to be content or not 
    // block.getText() gives you the block's text
}

TextBlock has some more exciting methods, feel free to play around!

TextBlock有一些更令人兴奋的方法,随意玩!

#3


5  

There appears to be a possible problem with Boilerpipe. Why? Well, it appears that is suited to certain kinds of web pages, such as web pages that have a single body of content.

Boilerpipe似乎可能存在问题。为什么?好吧,它似乎适用于某些类型的网页,例如具有单一内容的网页。

So one can crudely classify web pages into three kinds in respect to Boilerpipe:

因此,人们可以粗略地将网页分为三种类型的Boilerpipe:

  1. a web page with a single article in it (Boilerpipe worthy!)
  2. 一个包含一篇文章的网页(Boilerpipe值得!)
  3. a web with multiple articles in it, such as the front page of the New York times
  4. 一个包含多篇文章的网页,例如纽约时报的头版
  5. a web page that really doesn't have any article in it, but has some content in respect to links, but may also have some degree of clutter.
  6. 一个网页,其中没有任何文章,但有一些关于链接的内容,但也可能有一定程度的混乱。

Boilerpipe works on case #1. But if one is doing a lot of automated text processing, then how does one's software "know" what kind of web page it is dealing with? If the web page itself could be classified into one of these three buckets, then Boilerpipe could be applied for case #1. Case #2 is a problem, and case#3 is a problem as well - it might require an aggregate of related web pages to determine what is clutter and what isn't.

Boilerpipe适用于案例#1。但是,如果一个人正在进行大量的自动文本处理,那么一个人的软件如何“知道”它正在处理什么样的网页?如果网页本身可以归类为这三个桶中的一个,那么Boilerpipe可以应用于案例#1。案例#2是一个问题,案例#3也是一个问题 - 它可能需要一组相关的网页来确定什么是混乱和什么不是。

#4


1  

You can use some libs like goose. It works best on articles/news. You can also check javascript code that does similar extraction as goose with the readability bookmarklet

你可以使用一些像鹅一样的libs。它最适用于文章/新闻。您还可以使用可读性bookmarklet检查与goose进行类似提取的javascript代码

#5


1  

My first instinct was to go with your initial method of using Jsoup. At least with that, you can use selectors and retrieve only the elements that you want (i.e. Elements posts = doc.select("p"); and not have to worry about the other elements with random content.

我的第一直觉是使用你最初使用Jsoup的方法。至少可以使用选择器并仅检索所需的元素(即Elements posts = doc.select(“p”);而不必担心具有随机内容的其他元素。

On the matter of your other post, was the issue of false positives your only reasoning for straying away from Jsoup? If so, couldn't you just tweak the number of MIN_WORDS_SEQUENCE or be more selective with your selectors (i.e. do not retrieve div elements)

关于你的另一篇文章,误报的问题是你偏离Jsoup的唯一理由吗?如果是这样,你不能只调整MIN_WORDS_SEQUENCE的数量或者选择器的选择性更强(即不检索div元素)

#6


1  

http://kapowsoftware.com/products/kapow-katalyst-platform/robo-server.php

http://kapowsoftware.com/products/kapow-katalyst-platform/robo-server.php

Proprietary software, but it makes it very easy to extract from webpages and integrates well with java.

专有软件,但它可以很容易地从网页中提取并与java很好地集成。

You use a provided application to design xml files read by the roboserver api to parse webpages. The xml files are built by you analyzing the pages you wish to parse inside the provided application (fairly easy) and applying rules for gathering the data (generally, websites follow the same patterns). You can setup the scheduling, running, and db integration using the provided Java API.

您使用提供的应用程序来设计roboserver api读取的xml文件以解析网页。 xml文件是通过分析您希望在提供的应用程序中解析的页面(相当简单)并应用收集数据的规则(通常,网站遵循相同的模式)构建的。您可以使用提供的Java API设置调度,运行和数据库集成。

If you're against using software and doing it yourself, I'd suggest not trying to apply 1 rule to all sites. Find a way to separate tags and then build per-site

如果您反对使用软件并自行完成,我建议您不要尝试将1规则应用于所有网站。找到一种方法来分隔标签然后构建每个站点

#7


0  

You're looking for what are known as "HTML scrapers" or "screen scrapers". Here are a couple of links to some options for you:

您正在寻找所谓的“HTML scrapers”或“screen scrapers”。以下是一些指向您的选项的链接:

Tag Soup

标签汤

HTML Unit

HTML单元

#8


0  

You can filter the html junk and then parse the required details or use the apis of the existing site. Refer the below link to filter the html, i hope it helps. http://thewiredguy.com/wordpress/index.php/2011/07/dont-have-an-apirip-dat-off-the-page/

您可以过滤html垃圾,然后解析所需的详细信息或使用现有站点的api。请参考以下链接过滤html,我希望它有所帮助。 http://thewiredguy.com/wordpress/index.php/2011/07/dont-have-an-apirip-dat-off-the-page/

#9


0  

You could use the textracto api, it extracts the main 'article' text and there is also the opportunity to extract all other textual content. By 'subtracting' these texts you could split the navigation texts, preview texts, etc. from the main textual content.

您可以使用textracto api,它会提取主要的“文章”文本,还有机会提取所有其他文本内容。通过“减去”这些文本,您可以从主要文本内容中分离导航文本,预览文本等。