使用C#抓取javascript生成的网页

时间:2021-11-06 01:23:45

I have a webBrowser, and a label in Visual Studio, and basically what I'm trying to do is grab a section from another webpage.

我有一个webBrowser,以及Visual Studio中的标签,基本上我要做的是从另一个网页抓取一个部分。

I tried using WebClient.DownloadString and WebClient.DownloadFile, and both of them give me the source code of the webpage before the javascript loads the content. My next idea was to use a WebBrowser tool and just call webBrowser.DocumentText after the page loaded and that did not work, it still gives me the original source of the page.

我尝试使用WebClient.DownloadString和WebClient.DownloadFile,在javascript加载内容之前,它们都给了我网页的源代码。我的下一个想法是使用一个WebBrowser工具,只是在页面加载后调用webBrowser.DocumentText并且它不起作用,它仍然提供了页面的原始来源。

Is there a way I can grab the page post-javascriptload?

有没有办法可以抓住javascriptload后的页面?

Here is the page I'm trying to scrape.

这是我要抓的页面。

http://www.regulations.gov/#!documentDetail;D=APHIS-2013-0013-0083

I need to get the comment off of that page, which is generated.

我需要从该页面获取评论,该页面是生成的。

2 个解决方案

#1


32  

The problem is the browser usually executes the javascript and it results with an updated DOM. Unless you can analyze the javascript or intercept the data it uses, you will need to execute the code as a browser would. In the past I ran into the same issue, I utilized selenium and PhantomJS to render the page. After it renders the page, I would use the WebDriver client to navigate the DOM and retrieve the content I needed, post AJAX.

问题是浏览器通常执行javascript,结果是更新的DOM。除非您可以分析javascript或拦截它使用的数据,否则您将需要像浏览器那样执行代码。在过去我遇到了同样的问题,我利用selenium和PhantomJS来渲染页面。在呈现页面之后,我将使用WebDriver客户端来导航DOM并检索我需要的内容,发布在AJAX之后。

At a high-level, these are the steps:

在高层次上,这些是以下步骤:

  1. Installed selenium: http://docs.seleniumhq.org/
  2. 安装的硒:http://docs.seleniumhq.org/

  3. Started the selenium hub as a service
  4. 开始使用硒中心作为服务

  5. Downloaded phantomjs (a headless browser, that can execute the javascript): http://phantomjs.org/
  6. 下载的phantomjs(无头浏览器,可以执行javascript):http://phantomjs.org/

  7. Started phantomjs in webdriver mode pointing to the selenium hub
  8. 在webdriver模式下启动了指向selenium hub的phantomjs

  9. In my scraping application installed the webdriver client nuget package: Install-Package Selenium.WebDriver
  10. 在我的抓取应用程序中安装了webdriver客户端nuget包:Install-Package Selenium.WebDriver

Here is an example usage of the phantomjs webdriver:

以下是phantomjs webdriver的示例用法:

var options = new PhantomJSOptions();
options.AddAdditionalCapability("IsJavaScriptEnabled",true);

var driver = new RemoteWebDriver( new URI(Configuration.SeleniumServerHub),
                    options.ToCapabilities(),
                    TimeSpan.FromSeconds(3)
                  );
driver.Url = "http://www.regulations.gov/#!documentDetail;D=APHIS-2013-0013-0083";
driver.Navigate();
//the driver can now provide you with what you need (it will execute the script)
//get the source of the page
var source = driver.PageSource;
//fully navigate the dom
var pathElement = driver.FindElementById("some-id");

More info on selenium, phantomjs and webdriver can be found at the following links:

有关selenium,phantomjs和webdriver的更多信息,请访问以下链接:

http://docs.seleniumhq.org/

http://docs.seleniumhq.org/projects/webdriver/

http://phantomjs.org/

EDIT: Easier Method

编辑:更简单的方法

It appears there is a nuget package for the phantomjs, such that you don't need the hub (I used a cluster to do massive scrapping in this manner):

似乎有一个用于phantomjs的nuget包,这样你就不需要集线器了(我使用集群以这种方式进行大规模报废):

Install web driver:

安装Web驱动程序:

Install-Package Selenium.WebDriver

Install embedded exe:

安装嵌入式exe:

Install-Package phantomjs.exe

Updated code:

var driver = new PhantomJSDriver();
driver.Url = "http://www.regulations.gov/#!documentDetail;D=APHIS-2013-0013-0083";
driver.Navigate();
//the driver can now provide you with what you need (it will execute the script)
//get the source of the page
var source = driver.PageSource;
//fully navigate the dom
var pathElement = driver.FindElementById("some-id");

#2


1  

ok i will show you how to enable javascript using phantomjs and selenuim with c#

好的,我将告诉你如何使用phantomjs和selenuim与c#启用javascript

  1. create a new console project name it as you want
  2. 根据需要创建一个新的控制台项目名称

  3. go to solution explorer in your right hand
  4. 用右手去解决方案资源管理器

  5. a right click on References click on Manage NuGet packages
  6. 右键单击References,单击Manage NuGet packages

  7. a windows will shows click on browse than install Selenium.WebDriver
  8. Windows将显示单击浏览而不是安装Selenium.WebDriver

  9. downold phantomjs from here Phantomjs
  10. 来自Phantomjs的downold phantomjs

  11. in your main function type this code

    在您的主函数中键入此代码

        var options = new PhantomJSOptions();
        options.AddAdditionalCapability("IsJavaScriptEnabled", true);
        IWebDriver driver = new PhantomJSDriver("phantomjs Folder Path", options);
        driver.Navigate().GoToUrl("https://www.yourwebsite.com/");
    
        try
        {
            string pagesource = driver.PageSource;
            driver.FindElement(By.Id("yourelement"));
            Console.Write("yourelement founded");
    
        }
        catch (Exception e)
        {
            Console.WriteLine(e.Message);
    
        }
    
        Console.Read();
    

don't forget to put yourwebsite and the element that you loooking for and the phantomjs.exe path in you machine in this code below

不要忘记在下面的代码中放置你的网站和你所追求的元素以及你机器中的phantomjs.exe路径

have great time of coding and thanks wbennett

有很好的编码时间,感谢wbennett

#1


32  

The problem is the browser usually executes the javascript and it results with an updated DOM. Unless you can analyze the javascript or intercept the data it uses, you will need to execute the code as a browser would. In the past I ran into the same issue, I utilized selenium and PhantomJS to render the page. After it renders the page, I would use the WebDriver client to navigate the DOM and retrieve the content I needed, post AJAX.

问题是浏览器通常执行javascript,结果是更新的DOM。除非您可以分析javascript或拦截它使用的数据,否则您将需要像浏览器那样执行代码。在过去我遇到了同样的问题,我利用selenium和PhantomJS来渲染页面。在呈现页面之后,我将使用WebDriver客户端来导航DOM并检索我需要的内容,发布在AJAX之后。

At a high-level, these are the steps:

在高层次上,这些是以下步骤:

  1. Installed selenium: http://docs.seleniumhq.org/
  2. 安装的硒:http://docs.seleniumhq.org/

  3. Started the selenium hub as a service
  4. 开始使用硒中心作为服务

  5. Downloaded phantomjs (a headless browser, that can execute the javascript): http://phantomjs.org/
  6. 下载的phantomjs(无头浏览器,可以执行javascript):http://phantomjs.org/

  7. Started phantomjs in webdriver mode pointing to the selenium hub
  8. 在webdriver模式下启动了指向selenium hub的phantomjs

  9. In my scraping application installed the webdriver client nuget package: Install-Package Selenium.WebDriver
  10. 在我的抓取应用程序中安装了webdriver客户端nuget包:Install-Package Selenium.WebDriver

Here is an example usage of the phantomjs webdriver:

以下是phantomjs webdriver的示例用法:

var options = new PhantomJSOptions();
options.AddAdditionalCapability("IsJavaScriptEnabled",true);

var driver = new RemoteWebDriver( new URI(Configuration.SeleniumServerHub),
                    options.ToCapabilities(),
                    TimeSpan.FromSeconds(3)
                  );
driver.Url = "http://www.regulations.gov/#!documentDetail;D=APHIS-2013-0013-0083";
driver.Navigate();
//the driver can now provide you with what you need (it will execute the script)
//get the source of the page
var source = driver.PageSource;
//fully navigate the dom
var pathElement = driver.FindElementById("some-id");

More info on selenium, phantomjs and webdriver can be found at the following links:

有关selenium,phantomjs和webdriver的更多信息,请访问以下链接:

http://docs.seleniumhq.org/

http://docs.seleniumhq.org/projects/webdriver/

http://phantomjs.org/

EDIT: Easier Method

编辑:更简单的方法

It appears there is a nuget package for the phantomjs, such that you don't need the hub (I used a cluster to do massive scrapping in this manner):

似乎有一个用于phantomjs的nuget包,这样你就不需要集线器了(我使用集群以这种方式进行大规模报废):

Install web driver:

安装Web驱动程序:

Install-Package Selenium.WebDriver

Install embedded exe:

安装嵌入式exe:

Install-Package phantomjs.exe

Updated code:

var driver = new PhantomJSDriver();
driver.Url = "http://www.regulations.gov/#!documentDetail;D=APHIS-2013-0013-0083";
driver.Navigate();
//the driver can now provide you with what you need (it will execute the script)
//get the source of the page
var source = driver.PageSource;
//fully navigate the dom
var pathElement = driver.FindElementById("some-id");

#2


1  

ok i will show you how to enable javascript using phantomjs and selenuim with c#

好的,我将告诉你如何使用phantomjs和selenuim与c#启用javascript

  1. create a new console project name it as you want
  2. 根据需要创建一个新的控制台项目名称

  3. go to solution explorer in your right hand
  4. 用右手去解决方案资源管理器

  5. a right click on References click on Manage NuGet packages
  6. 右键单击References,单击Manage NuGet packages

  7. a windows will shows click on browse than install Selenium.WebDriver
  8. Windows将显示单击浏览而不是安装Selenium.WebDriver

  9. downold phantomjs from here Phantomjs
  10. 来自Phantomjs的downold phantomjs

  11. in your main function type this code

    在您的主函数中键入此代码

        var options = new PhantomJSOptions();
        options.AddAdditionalCapability("IsJavaScriptEnabled", true);
        IWebDriver driver = new PhantomJSDriver("phantomjs Folder Path", options);
        driver.Navigate().GoToUrl("https://www.yourwebsite.com/");
    
        try
        {
            string pagesource = driver.PageSource;
            driver.FindElement(By.Id("yourelement"));
            Console.Write("yourelement founded");
    
        }
        catch (Exception e)
        {
            Console.WriteLine(e.Message);
    
        }
    
        Console.Read();
    

don't forget to put yourwebsite and the element that you loooking for and the phantomjs.exe path in you machine in this code below

不要忘记在下面的代码中放置你的网站和你所追求的元素以及你机器中的phantomjs.exe路径

have great time of coding and thanks wbennett

有很好的编码时间,感谢wbennett