从ASP.NET中的URL获取文本

时间:2022-09-24 23:45:53

I am looking for a reliable way of extracting text given the web address, in ASP.NET/C#. Can anyone point me the right direction?

我正在寻找一种在ASP.NET / C#中提供给定Web地址的文本的可靠方法。谁能指出我正确的方向?

Also, the web address could be say a news site that might have a lot of ads and menus etc. I need some intelligent way of extracting only the relevant content. Not sure how this could be done as how would I define what relevance is?

此外,网址可以说是一个可能有很多广告和菜单等的新闻网站。我需要一些只提取相关内容的智能方式。不知道如何做到这一点我将如何定义相关性?

Should I maybe read from a RSS feed? Any thoughts on this?

我应该从RSS提要中读取吗?有什么想法吗?

EDIT I Have added a bounty. I am looking to extract "relevant" text from a URL. From "relevant" I mean, it should exclude text from ads (and other irrelevant info). The input will be similar to a news site. I need to extract only the news info and get rid of the extraneous text

编辑我添加了赏金。我希望从URL中提取“相关”文本。从“相关”我的意思是,它应该从广告(和其他不相关的信息)中排除文本。输入将类似于新闻网站。我只需要提取新闻信息并删除无关的文本

6 个解决方案

#1


4  

Once you have downloaded the page, and started using a library like HTML Agility Pack to parse the html, then your work starts :)

下载页面后,开始使用HTML Agility Pack等库来解析html,然后开始工作:)

Screen scraping is divided into two parts.

屏幕抓取分为两部分。

First the webcrawler (lots of information on this on the web, and simple code provided here with WebClient by some other answers). The crawler has to traverse links and download pages. If you are downloading a lot of pages and have the start url you could roll your own, or use an existing one. Check out Wikipedia for a list of open source webcrawlers/spiders.

首先是webcrawler(网上有很多关于这个的信息,以及通过其他答案提供的WebClient简单代码)。爬虫必须遍历链接和下载页面。如果您要下载大量页面并拥有开始网址,则可以自行滚动或使用现有网页。查看Wikipedia以获取开源webcrawler / spider的列表。

The second part is parsing the html and pulling out only the text you want, and omit any noise (headers, banners, footers etc). Just traversing the DOM is easy with existing libraries, figuring out what to do with what you parse is the hard part.

第二部分是解析html并只提取你想要的文本,并省略任何噪音(标题,横幅,页脚等)。只需遍历DOM就可以轻松使用现有的库,弄清楚如何处理解析的内容是困难的部分。

I've written a bit about it before at another SO question and it might give you some ideas how to manually grab the content you want. From my experience there is no 100% way to find the main content of a page, and more often than not you need to manually give it some pointers. The difficult part is that if the html layout of the page change, then your screen scraper will start to fail.

我之前在另一个SO问题上写了一些关于它的内容,它可能会给你一些如何手动获取你想要的内容的想法。根据我的经验,没有100%的方法可以找到页面的主要内容,而且通常需要手动给它一些指针。困难的部分是,如果页面的html布局发生变化,那么你的屏幕抓取器将开始失败。

You could apply statistics and compare the html of several pages in order to deduce where the ads, menus etc are, in order to eliminate those.

您可以应用统计数据并比较几个页面的html,以便推断广告,菜单等的位置,以便消除这些。

Since you mention news sites, there are two other approaches which should be easier to apply to these sites compared to parsing out the text from the original html.

由于您提到新闻网站,与解析原始html中的文本相比,还有两种方法应该更容易应用于这些网站。

  1. Check if the page has a print url. Eg. a link on CNN has an equivalent print url which is much easier to parse.
  2. 检查页面是否有打印网址。例如。 CNN上的链接具有等效的打印URL,更容易解析。
  3. Check if the page has a RSS representation, and pick the article text from the RSS feed instead. If the feed don't have all the content, it should give you enough text to locate the text in the full html page.
  4. 检查页面是否具有RSS表示,然后从RSS源中选择文章文本。如果Feed没有所有内容,它应该为您提供足够的文本以在完整的html页面中找到文本。

Also check out The Easy Way to Extract Useful Text from Arbitrary HTML for input to how to create a more general parser. The code is in Python but you should be able to convert it without too much trouble.

还可以查看从任意HTML中提取有用文本的简单方法,以输入如何创建更通用的解析器。代码是在Python中,但你应该能够毫不费力地转换它。

#2


3  

I think you need a html parser like HTMLAgilityPack or you can use the new born baby.. YQL, its a new tool develop by Yahoo its syntax is like SQL and you need a little knowledge of XPATH...

我认为你需要像HTMLAgilityPack这样的html解析器,或者你可以使用新生的婴儿.. YQL,它是雅虎开发的一个新工具,它的语法就像SQL,你需要一点XPATH知识......

http://developer.yahoo.com/yql/

http://developer.yahoo.com/yql/

Thank

谢谢

#3


2  

Use a WebClient instance to get your markup...

使用WebClient实例获取标记...

Dim Markup As String

Using Client As New WebClient()
    Markup = Client.DownloadString("http://www.google.com")
End Using

And then use the HtmlAgilityPack to parse the response with XPath...

然后使用HtmlAgilityPack用XPath解析响应...

Dim Doc As New HtmlDocument()
Doc.LoadXML(Markup)

If Doc.ParseErrors.Count = 0 Then 
    Dim Node As HtmlNode = Doc.DocumentNode.SelectSingleNode("//body");

    If Node IsNot Nothing Then
        'Do something with Node   
    End If
End If

#4


0  

In order to get the actual html markup, try the WebClient object. Something like this will get you the markup:

为了获得实际的html标记,请尝试使用WebClient对象。这样的东西会得到你的标记:

System.Net.WebClient client = new System.Net.WebClient ();

        // Add a user agent header in case the 
        // requested URI contains a query.

        client.Headers.Add ("user-agent", "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; .NET CLR 1.0.3705;)");

        Stream data = client.OpenRead ("http://www.google.com");
        StreamReader reader = new StreamReader (data);
        string s = reader.ReadToEnd ();
        //"s" now contains your entire html page source
        data.Close ();
        reader.Close ();

Then like isc-fausto said, you can use regular expressions to parse the output as needed.

然后就像isc-fausto所说,你可以使用正则表达式来根据需要解析输出。

#5


0  

Text summarization techniques are what you're probably after. But as a rough heuristic, you can do this with some relatively simple steps as long as you aren't counting on 100% perfect results all of the time.

文本摘要技术是你可能会追求的。但作为一种粗略的启发式方法,只要你不是一直指望100%完美的结果,你就可以用一些相对简单的步骤来做到这一点。

As long as you don't need to support writing systems that don't have spaces between words (Chinese, Japanese), you can get pretty good results by looking for the first couple of runs of a consecutive word sequences with an arbitrary threshold that you'll spend a few days tuning. (Chinese and Japanese would require a reasonable word break identification algorithm in addition to this heuristic).

只要您不需要支持在单词(中文,日文)之间没有空格的书写系统,您可以通过查找具有任意阈值的连续单词序列的前几次运行来获得相当好的结果你会花几天时间调整。 (除了这种启发式,中文和日文还需要合理的分词识别算法)。

I would start with an HTML Parser (HTML Agility Pack in Dotnet, or something like Ruby's Nokogiri or Python's BeautifulSoup if you'd like to experiment with the algorithms in a more interactive environment before committing to your C# solution).

我将从HTML Parser(Dotnet中的HTML Agility Pack,或Ruby的Nokogiri或Python的BeautifulSoup之类的东西开始,如果你想在更加交互的环境中试验算法,然后再提交你的C#解决方案)。

To reduce the search space, sequences of links with little or no surrounding text using the features of your HTML parser. That should eliminate most navigation panels and certain types of ads. You could further extend this to look for links that have words after them but no punctuation; this would eliminate descriptive links.

使用HTML解析器的功能减少搜索空间,使用很少或没有周围文本的链接序列。这应该会消除大多数导航面板和某些类型的广告。您可以进一步扩展它以查找在其后面有单词但没有标点符号的链接;这将消除描述性链接。

If you start to see runs of text followed by "." or "," with say, 5 or more words (which you can try tuning later), you'd start scoring that as a potential sentence or sentence fragment. When you find several runs in a row, that has pretty good odds of being the most important part of the page. You could score text with <p> tags around it a bit higher. Once you have a fair amount of these types of sequences, The odds are pretty good that you've got "content" rather than layout chrome.

如果你开始看到文本的运行,后跟“。”或“,”或者说,5个或更多的单词(您可以稍后尝试调整),您将开始评分为潜在的句子或句子片段。当你发现连续几次运行时,它很可能成为页面中最重要的部分。您可以使用

标签对文本进行评分,使其高一些。一旦你有相当数量的这些类型的序列,你有“内容”而不是布局铬的可能性非常大。

This won't be perfect, and you may need to add a mechanism to tweak the heuristic based on problematic page structures that you regularly scan. But if you build something based on this approach, it should provide pretty reasonable results for 80% or so of your content.

这不是完美的,您可能需要添加一种机制来根据您经常扫描的有问题的页面结构调整启发式。但是如果你基于这种方法构建一些东西,它应该为80%左右的内容提供非常合理的结果。

If you find this kind of method inadequate, you may want to look at Bayesian probability or Hidden Markov Models as a way of improving the results.

如果您发现这种方法不合适,您可能需要查看贝叶斯概率或隐马尔可夫模型作为改进结果的方法。

#6


-4  

Once you have the web pages html code, you coud use Regular Expressions

一旦你有了网页的html代码,就可以使用正则表达式

#1


4  

Once you have downloaded the page, and started using a library like HTML Agility Pack to parse the html, then your work starts :)

下载页面后,开始使用HTML Agility Pack等库来解析html,然后开始工作:)

Screen scraping is divided into two parts.

屏幕抓取分为两部分。

First the webcrawler (lots of information on this on the web, and simple code provided here with WebClient by some other answers). The crawler has to traverse links and download pages. If you are downloading a lot of pages and have the start url you could roll your own, or use an existing one. Check out Wikipedia for a list of open source webcrawlers/spiders.

首先是webcrawler(网上有很多关于这个的信息,以及通过其他答案提供的WebClient简单代码)。爬虫必须遍历链接和下载页面。如果您要下载大量页面并拥有开始网址,则可以自行滚动或使用现有网页。查看Wikipedia以获取开源webcrawler / spider的列表。

The second part is parsing the html and pulling out only the text you want, and omit any noise (headers, banners, footers etc). Just traversing the DOM is easy with existing libraries, figuring out what to do with what you parse is the hard part.

第二部分是解析html并只提取你想要的文本,并省略任何噪音(标题,横幅,页脚等)。只需遍历DOM就可以轻松使用现有的库,弄清楚如何处理解析的内容是困难的部分。

I've written a bit about it before at another SO question and it might give you some ideas how to manually grab the content you want. From my experience there is no 100% way to find the main content of a page, and more often than not you need to manually give it some pointers. The difficult part is that if the html layout of the page change, then your screen scraper will start to fail.

我之前在另一个SO问题上写了一些关于它的内容,它可能会给你一些如何手动获取你想要的内容的想法。根据我的经验,没有100%的方法可以找到页面的主要内容,而且通常需要手动给它一些指针。困难的部分是,如果页面的html布局发生变化,那么你的屏幕抓取器将开始失败。

You could apply statistics and compare the html of several pages in order to deduce where the ads, menus etc are, in order to eliminate those.

您可以应用统计数据并比较几个页面的html,以便推断广告,菜单等的位置,以便消除这些。

Since you mention news sites, there are two other approaches which should be easier to apply to these sites compared to parsing out the text from the original html.

由于您提到新闻网站,与解析原始html中的文本相比,还有两种方法应该更容易应用于这些网站。

  1. Check if the page has a print url. Eg. a link on CNN has an equivalent print url which is much easier to parse.
  2. 检查页面是否有打印网址。例如。 CNN上的链接具有等效的打印URL,更容易解析。
  3. Check if the page has a RSS representation, and pick the article text from the RSS feed instead. If the feed don't have all the content, it should give you enough text to locate the text in the full html page.
  4. 检查页面是否具有RSS表示,然后从RSS源中选择文章文本。如果Feed没有所有内容,它应该为您提供足够的文本以在完整的html页面中找到文本。

Also check out The Easy Way to Extract Useful Text from Arbitrary HTML for input to how to create a more general parser. The code is in Python but you should be able to convert it without too much trouble.

还可以查看从任意HTML中提取有用文本的简单方法,以输入如何创建更通用的解析器。代码是在Python中,但你应该能够毫不费力地转换它。

#2


3  

I think you need a html parser like HTMLAgilityPack or you can use the new born baby.. YQL, its a new tool develop by Yahoo its syntax is like SQL and you need a little knowledge of XPATH...

我认为你需要像HTMLAgilityPack这样的html解析器,或者你可以使用新生的婴儿.. YQL,它是雅虎开发的一个新工具,它的语法就像SQL,你需要一点XPATH知识......

http://developer.yahoo.com/yql/

http://developer.yahoo.com/yql/

Thank

谢谢

#3


2  

Use a WebClient instance to get your markup...

使用WebClient实例获取标记...

Dim Markup As String

Using Client As New WebClient()
    Markup = Client.DownloadString("http://www.google.com")
End Using

And then use the HtmlAgilityPack to parse the response with XPath...

然后使用HtmlAgilityPack用XPath解析响应...

Dim Doc As New HtmlDocument()
Doc.LoadXML(Markup)

If Doc.ParseErrors.Count = 0 Then 
    Dim Node As HtmlNode = Doc.DocumentNode.SelectSingleNode("//body");

    If Node IsNot Nothing Then
        'Do something with Node   
    End If
End If

#4


0  

In order to get the actual html markup, try the WebClient object. Something like this will get you the markup:

为了获得实际的html标记,请尝试使用WebClient对象。这样的东西会得到你的标记:

System.Net.WebClient client = new System.Net.WebClient ();

        // Add a user agent header in case the 
        // requested URI contains a query.

        client.Headers.Add ("user-agent", "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; .NET CLR 1.0.3705;)");

        Stream data = client.OpenRead ("http://www.google.com");
        StreamReader reader = new StreamReader (data);
        string s = reader.ReadToEnd ();
        //"s" now contains your entire html page source
        data.Close ();
        reader.Close ();

Then like isc-fausto said, you can use regular expressions to parse the output as needed.

然后就像isc-fausto所说,你可以使用正则表达式来根据需要解析输出。

#5


0  

Text summarization techniques are what you're probably after. But as a rough heuristic, you can do this with some relatively simple steps as long as you aren't counting on 100% perfect results all of the time.

文本摘要技术是你可能会追求的。但作为一种粗略的启发式方法,只要你不是一直指望100%完美的结果,你就可以用一些相对简单的步骤来做到这一点。

As long as you don't need to support writing systems that don't have spaces between words (Chinese, Japanese), you can get pretty good results by looking for the first couple of runs of a consecutive word sequences with an arbitrary threshold that you'll spend a few days tuning. (Chinese and Japanese would require a reasonable word break identification algorithm in addition to this heuristic).

只要您不需要支持在单词(中文,日文)之间没有空格的书写系统,您可以通过查找具有任意阈值的连续单词序列的前几次运行来获得相当好的结果你会花几天时间调整。 (除了这种启发式,中文和日文还需要合理的分词识别算法)。

I would start with an HTML Parser (HTML Agility Pack in Dotnet, or something like Ruby's Nokogiri or Python's BeautifulSoup if you'd like to experiment with the algorithms in a more interactive environment before committing to your C# solution).

我将从HTML Parser(Dotnet中的HTML Agility Pack,或Ruby的Nokogiri或Python的BeautifulSoup之类的东西开始,如果你想在更加交互的环境中试验算法,然后再提交你的C#解决方案)。

To reduce the search space, sequences of links with little or no surrounding text using the features of your HTML parser. That should eliminate most navigation panels and certain types of ads. You could further extend this to look for links that have words after them but no punctuation; this would eliminate descriptive links.

使用HTML解析器的功能减少搜索空间,使用很少或没有周围文本的链接序列。这应该会消除大多数导航面板和某些类型的广告。您可以进一步扩展它以查找在其后面有单词但没有标点符号的链接;这将消除描述性链接。

If you start to see runs of text followed by "." or "," with say, 5 or more words (which you can try tuning later), you'd start scoring that as a potential sentence or sentence fragment. When you find several runs in a row, that has pretty good odds of being the most important part of the page. You could score text with <p> tags around it a bit higher. Once you have a fair amount of these types of sequences, The odds are pretty good that you've got "content" rather than layout chrome.

如果你开始看到文本的运行,后跟“。”或“,”或者说,5个或更多的单词(您可以稍后尝试调整),您将开始评分为潜在的句子或句子片段。当你发现连续几次运行时,它很可能成为页面中最重要的部分。您可以使用

标签对文本进行评分,使其高一些。一旦你有相当数量的这些类型的序列,你有“内容”而不是布局铬的可能性非常大。

This won't be perfect, and you may need to add a mechanism to tweak the heuristic based on problematic page structures that you regularly scan. But if you build something based on this approach, it should provide pretty reasonable results for 80% or so of your content.

这不是完美的,您可能需要添加一种机制来根据您经常扫描的有问题的页面结构调整启发式。但是如果你基于这种方法构建一些东西,它应该为80%左右的内容提供非常合理的结果。

If you find this kind of method inadequate, you may want to look at Bayesian probability or Hidden Markov Models as a way of improving the results.

如果您发现这种方法不合适,您可能需要查看贝叶斯概率或隐马尔可夫模型作为改进结果的方法。

#6


-4  

Once you have the web pages html code, you coud use Regular Expressions

一旦你有了网页的html代码,就可以使用正则表达式