使用C#刮取html文档中JavaScript动态生成的数据

时间:2022-02-09 01:23:43

How can I scrape data that are dynamically generated by JavaScript in html document using C#?

如何使用C#抓取html文档中JavaScript动态生成的数据?

Using WebRequest and HttpWebResponse in the C# library, I'm able to get the whole html source code as a string, but the difficulty is that the data I want isn't contained in the source code; the data are generated dynamically by JavaScript.

在C#库中使用WebRequest和HttpWebResponse,我能够将整个html源代码作为字符串,但难点在于我想要的数据不包含在源代码中;数据由JavaScript动态生成。

On the other hand, if the data I want are already in the source code, then I'm able to get them easily using Regular Expressions.

另一方面,如果我想要的数据已经在源代码中,那么我可以使用正则表达式轻松获取它们。

I have downloaded HtmlAgilityPack, but I don't know if it would take care of the case where items are generated dynamically by JavaScript...

我已经下载了HtmlAgilityPack,但我不知道它是否会处理由JavaScript动态生成项目的情况......

Thank you very much!

非常感谢你!

2 个解决方案

#1


10  

When you make the WebRequest you're asking the server to give you the page file, this file's content hasn't yet been parsed/executed by a web browser and so the javascript on it hasn't yet done anything.

当您创建WebRequest时,您要求服务器为您提供页面文件,此文件的内容尚未被Web浏览器解析/执行,因此其上的javascript尚未执行任何操作。

You need to use a tool to execute the JavaScript on the page if you want to see what the page looks like after being parsed by a browser. One option you have is using the built in .net web browser control: http://msdn.microsoft.com/en-au/library/aa752040(v=vs.85).aspx

如果要在浏览器解析后查看页面的外观,则需要使用工具在页面上执行JavaScript。您有一个选择是使用内置的.net Web浏览器控件:http://msdn.microsoft.com/en-au/library/aa752040(v = vs。85).aspx

The web browser control can navigate to and load the page and then you can query it's DOM which will have been altered by the JavaScript on the page.

Web浏览器控件可以导航到并加载页面,然后您可以查询它的DOM,这些DOM将被页面上的JavaScript更改。

EDIT (example):

Uri uri = new Uri("http://www.somewebsite.com/somepage.htm");

webBrowserControl.AllowNavigation = true;
// optional but I use this because it stops javascript errors breaking your scraper
webBrowserControl.ScriptErrorsSuppressed = true;
// you want to start scraping after the document is finished loading so do it in the function you pass to this handler
webBrowserControl.DocumentCompleted += new WebBrowserDocumentCompletedEventHandler(webBrowserControl_DocumentCompleted);
webBrowserControl.Navigate(uri);

private void webBrowserControl_DocumentCompleted(object sender, WebBrowserDocumentCompletedEventArgs e)
{
    HtmlElementCollection divs = webBrowserControl.Document.GetElementsByTagName("div");

    foreach (HtmlElement div in divs)
    {
        //do something
    }
}

#2


4  

You could take a look at a tool like Selenium for scraping pages which has Javascript.

你可以看看像Selenium这样的工具来抓取有Javascript的页面。

http://www.andykelk.net/tech/headless-browser-testing-with-phantomjs-selenium-webdriver-c-nunit-and-mono

#1


10  

When you make the WebRequest you're asking the server to give you the page file, this file's content hasn't yet been parsed/executed by a web browser and so the javascript on it hasn't yet done anything.

当您创建WebRequest时,您要求服务器为您提供页面文件,此文件的内容尚未被Web浏览器解析/执行,因此其上的javascript尚未执行任何操作。

You need to use a tool to execute the JavaScript on the page if you want to see what the page looks like after being parsed by a browser. One option you have is using the built in .net web browser control: http://msdn.microsoft.com/en-au/library/aa752040(v=vs.85).aspx

如果要在浏览器解析后查看页面的外观,则需要使用工具在页面上执行JavaScript。您有一个选择是使用内置的.net Web浏览器控件:http://msdn.microsoft.com/en-au/library/aa752040(v = vs。85).aspx

The web browser control can navigate to and load the page and then you can query it's DOM which will have been altered by the JavaScript on the page.

Web浏览器控件可以导航到并加载页面,然后您可以查询它的DOM,这些DOM将被页面上的JavaScript更改。

EDIT (example):

Uri uri = new Uri("http://www.somewebsite.com/somepage.htm");

webBrowserControl.AllowNavigation = true;
// optional but I use this because it stops javascript errors breaking your scraper
webBrowserControl.ScriptErrorsSuppressed = true;
// you want to start scraping after the document is finished loading so do it in the function you pass to this handler
webBrowserControl.DocumentCompleted += new WebBrowserDocumentCompletedEventHandler(webBrowserControl_DocumentCompleted);
webBrowserControl.Navigate(uri);

private void webBrowserControl_DocumentCompleted(object sender, WebBrowserDocumentCompletedEventArgs e)
{
    HtmlElementCollection divs = webBrowserControl.Document.GetElementsByTagName("div");

    foreach (HtmlElement div in divs)
    {
        //do something
    }
}

#2


4  

You could take a look at a tool like Selenium for scraping pages which has Javascript.

你可以看看像Selenium这样的工具来抓取有Javascript的页面。

http://www.andykelk.net/tech/headless-browser-testing-with-phantomjs-selenium-webdriver-c-nunit-and-mono