页面内容装载了javascript,而Jsoup没有看到它

时间:2022-10-06 23:02:31

One block on the page is filled with content by javascript and after loading page with Jsoup there is none of that inforamtion. Is there a way to get also javascript generated content when parsing page with Jsoup?

页面上的一个块被javascript填充,在加载了Jsoup页面之后,就没有这种信息了。在使用Jsoup解析页面时,是否也有方法获得javascript生成的内容?

Special UPD for Marcin:
Can't paste page code here, since it is too long: http://pastebin.com/qw4Rfqgw

Marcin的特殊UPD:不能在这里粘贴页面代码,因为它太长了:http://pastebin.com/qw4Rfqgw

Here's element which content I need: <div id='tags_list'></div>

这里是我需要的内容:

I need to get this information in Java. Preferebaly using Jsoup. Element is field with help of javascript:

我需要用Java来获取这些信息。使用Jsoup Preferebaly。在javascript的帮助下,元素是字段:

<div id="tags_list">
    <a href="/tagsc0t20099.html" style="font-size:14;">разведчик</a>
    <a href="/tagsc0t1879.html" style="font-size:14;">Sr</a>
    <a href="/tagsc0t3140.html" style="font-size:14;">стратегический</a>
</div>

Java code:

Java代码:

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import java.io.IOException;

public class Test
{
    public static void main( String[] args )
    {
        try
        {
            Document Doc = Jsoup.connect( "http://www.bestreferat.ru/referat-32558.html" ).get();
            Elements Tags = Doc.select( "#tags_list a" );

            for ( Element Tag : Tags )
            {
                System.out.println( Tag.text() );
            }
        }
        catch ( IOException e )
        {
            e.printStackTrace();
        }
    }
}

6 个解决方案

#1


16  

JSoup is an HTML parser, not some kind of embedded browser engine. This means that it's completely unaware of any content that is added to the DOM by Javascript after the initial page load.

JSoup是一个HTML解析器,而不是某种嵌入式浏览器引擎。这意味着在初始页面加载之后,它完全不知道Javascript添加到DOM的任何内容。

To get access to that type of content you will need an embedded browser component, there are a number of discussions on SO regarding that kind of component, eg Is there a way to embed a browser in Java?

要访问那种类型的内容,您将需要一个嵌入式浏览器组件,有很多关于这类组件的讨论,例如,是否有办法在Java中嵌入浏览器?

#2


13  

Solved in my case with com.codeborne.phantomjsdriver NOTE: it is groovy code.

用com。codeborne解决了我的案子。注:这是groovy代码。

pom.xml

pom.xml

        <dependency>
          <groupId>com.codeborne</groupId>
          <artifactId>phantomjsdriver</artifactId>
          <version> <here goes last version> </version>
        </dependency>

PhantomJsUtils.groovy

PhantomJsUtils.groovy

import org.jsoup.Jsoup
import org.jsoup.nodes.Document
import org.openqa.selenium.WebDriver
import org.openqa.selenium.phantomjs.PhantomJSDriver

class PhantomJsUtils {
    private static String filePath = 'data/temp/';

    public static Document renderPage(String filePath) {
        System.setProperty("phantomjs.binary.path", 'libs/phantomjs') // path to bin file. NOTE: platform dependent
        WebDriver ghostDriver = new PhantomJSDriver();
        try {
            ghostDriver.get(filePath);
            return Jsoup.parse(ghostDriver.getPageSource());
        } finally {
            ghostDriver.quit();
        }
    }

    public static Document renderPage(Document doc) {
        String tmpFileName = "$filePath${Calendar.getInstance().timeInMillis}.html";
        FileUtils.writeToFile(tmpFileName, doc.toString());
        return renderPage(tmpFileName);
    }
}

ClassInProject.groovy

ClassInProject.groovy

Document doc = PhantomJsUtils.renderPage(Jsoup.parse(yourSource))

#3


4  

You need to understand what is happening :

你需要了解正在发生的事情:

  • When you query a page from a website, whether using Jsoup or your browser, what gets sent back to you is some HTML. Jsoup is able to parse that.
  • 当您从一个网站查询一个页面时,无论是使用Jsoup还是您的浏览器,返回给您的是一些HTML。Jsoup可以解析它。
  • However, most websites include Javascript in that HTML, or linked from that HTML, which will populate the page with content. Your browser is able to execute the Javascript, and thus populate the page. Jsoup is not.
  • 但是,大多数网站都包含HTML中的Javascript,或者链接到HTML中,这些HTML会填充页面内容。浏览器能够执行Javascript,从而填充页面。Jsoup不是。

The way to understand this is the following : parsing HTML code is easy. Executing Javascript code and updating corresponding HTML code is a lot more complex, and is the work of a browser.

理解这一点的方法如下:解析HTML代码很容易。执行Javascript代码和更新相应的HTML代码要复杂得多,这是浏览器的工作。

Here are some solutions for this kind of problems:

以下是解决这类问题的一些方法:

  1. If you can find what are the Ajax calls that Javascript code is making, that is loading content, you might be able to use the URL of these calls with Jsoup. In order to do that, use Developer Tools from your browser. But this is not guaranteed to work:

    如果您可以找到Javascript代码正在进行的Ajax调用,即加载内容,那么您可能可以使用Jsoup来使用这些调用的URL。要做到这一点,请使用浏览器中的开发人员工具。但这并不能保证有效:

    • it might be that the url is dynamic, and depends on what is on the page at that time
    • url可能是动态的,取决于当时页面上的内容
    • if the content is not public, cookies will be involved, and simply querying the resource URL will not be enough
    • 如果内容不是公共的,那么将涉及cookie,仅仅查询资源URL是不够的
  2. In these cases, you will need to "simulate" the work of a browser. Fortunately, such tools exist. The one I know, and recommend, is PhantomJS. It works with Javascript, and you would need to launch it from Java by starting a new process. If you want to stick to Java, this post lists some Java alternatives.

    在这些情况下,您将需要“模拟”浏览器的工作。幸运的是,这些工具存在。我认识并推荐的是PhantomJS。它与Javascript一起工作,您需要从Java启动一个新进程来启动它。如果您希望继续使用Java,本文列出了一些Java替代方法。

#4


1  

I fact there is a "way"! Maybe it is more "a workaround" than a "way... The code below checks both for meta attribute "REFRESH" and javascript redirects... If either of them exists RedirectedUrl variable is set. So you know your target... Then you can retrieve the target page and go on...

我其实还有一个“方法”!也许这更多的是一种“变通”,而不是一种“方法……”下面的代码检查元属性“REFRESH”和javascript重定向…如果它们中的任何一个存在,RedirectedUrl变量就会被设置。然后你可以检索目标页面并继续……

    String RedirectedUrl=null;
    Elements meta = page.select("html head meta");
    if (meta.attr("http-equiv").contains("REFRESH")) {
        RedirectedUrl = meta.attr("content").split("=")[1];
    } else {
        if (page.toString().contains("window.location.href")) {
            meta = page.select("script");
            for (Element script:meta) {
                String s = script.data();
                if (!s.isEmpty() && s.startsWith("window.location.href")) {
                    int start = s.indexOf("=");
                    int end = s.indexOf(";");
                    if (start>0 && end >start) {
                        s = s.substring(start+1,end);
                        s =s.replace("'", "").replace("\"", "");        
                        RedirectedUrl = s.trim();
                        break;
                    }
                }
            }
        }
    }

... now retrieve the redirected page again...

#5


0  

Is there a way to get also javascript generated content when parsing page with Jsoup?

在使用Jsoup解析页面时,是否也有方法获得javascript生成的内容?

I am going to guess NO, thinking about how difficult this would be, without building an entire javascript interpreter in Java.

我将会猜测,如果没有在Java中构建完整的javascript解释器,这将是多么困难。

#6


-1  

Try:

试一试:

Document Doc = Jsoup.connect(url)
    .header("Accept-Encoding", "gzip, deflate")
    .userAgent("Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0")
    .maxBodySize(0)
    .timeout(600000)
    .get();

#1


16  

JSoup is an HTML parser, not some kind of embedded browser engine. This means that it's completely unaware of any content that is added to the DOM by Javascript after the initial page load.

JSoup是一个HTML解析器,而不是某种嵌入式浏览器引擎。这意味着在初始页面加载之后,它完全不知道Javascript添加到DOM的任何内容。

To get access to that type of content you will need an embedded browser component, there are a number of discussions on SO regarding that kind of component, eg Is there a way to embed a browser in Java?

要访问那种类型的内容,您将需要一个嵌入式浏览器组件,有很多关于这类组件的讨论,例如,是否有办法在Java中嵌入浏览器?

#2


13  

Solved in my case with com.codeborne.phantomjsdriver NOTE: it is groovy code.

用com。codeborne解决了我的案子。注:这是groovy代码。

pom.xml

pom.xml

        <dependency>
          <groupId>com.codeborne</groupId>
          <artifactId>phantomjsdriver</artifactId>
          <version> <here goes last version> </version>
        </dependency>

PhantomJsUtils.groovy

PhantomJsUtils.groovy

import org.jsoup.Jsoup
import org.jsoup.nodes.Document
import org.openqa.selenium.WebDriver
import org.openqa.selenium.phantomjs.PhantomJSDriver

class PhantomJsUtils {
    private static String filePath = 'data/temp/';

    public static Document renderPage(String filePath) {
        System.setProperty("phantomjs.binary.path", 'libs/phantomjs') // path to bin file. NOTE: platform dependent
        WebDriver ghostDriver = new PhantomJSDriver();
        try {
            ghostDriver.get(filePath);
            return Jsoup.parse(ghostDriver.getPageSource());
        } finally {
            ghostDriver.quit();
        }
    }

    public static Document renderPage(Document doc) {
        String tmpFileName = "$filePath${Calendar.getInstance().timeInMillis}.html";
        FileUtils.writeToFile(tmpFileName, doc.toString());
        return renderPage(tmpFileName);
    }
}

ClassInProject.groovy

ClassInProject.groovy

Document doc = PhantomJsUtils.renderPage(Jsoup.parse(yourSource))

#3


4  

You need to understand what is happening :

你需要了解正在发生的事情:

  • When you query a page from a website, whether using Jsoup or your browser, what gets sent back to you is some HTML. Jsoup is able to parse that.
  • 当您从一个网站查询一个页面时,无论是使用Jsoup还是您的浏览器,返回给您的是一些HTML。Jsoup可以解析它。
  • However, most websites include Javascript in that HTML, or linked from that HTML, which will populate the page with content. Your browser is able to execute the Javascript, and thus populate the page. Jsoup is not.
  • 但是,大多数网站都包含HTML中的Javascript,或者链接到HTML中,这些HTML会填充页面内容。浏览器能够执行Javascript,从而填充页面。Jsoup不是。

The way to understand this is the following : parsing HTML code is easy. Executing Javascript code and updating corresponding HTML code is a lot more complex, and is the work of a browser.

理解这一点的方法如下:解析HTML代码很容易。执行Javascript代码和更新相应的HTML代码要复杂得多,这是浏览器的工作。

Here are some solutions for this kind of problems:

以下是解决这类问题的一些方法:

  1. If you can find what are the Ajax calls that Javascript code is making, that is loading content, you might be able to use the URL of these calls with Jsoup. In order to do that, use Developer Tools from your browser. But this is not guaranteed to work:

    如果您可以找到Javascript代码正在进行的Ajax调用,即加载内容,那么您可能可以使用Jsoup来使用这些调用的URL。要做到这一点,请使用浏览器中的开发人员工具。但这并不能保证有效:

    • it might be that the url is dynamic, and depends on what is on the page at that time
    • url可能是动态的,取决于当时页面上的内容
    • if the content is not public, cookies will be involved, and simply querying the resource URL will not be enough
    • 如果内容不是公共的,那么将涉及cookie,仅仅查询资源URL是不够的
  2. In these cases, you will need to "simulate" the work of a browser. Fortunately, such tools exist. The one I know, and recommend, is PhantomJS. It works with Javascript, and you would need to launch it from Java by starting a new process. If you want to stick to Java, this post lists some Java alternatives.

    在这些情况下,您将需要“模拟”浏览器的工作。幸运的是,这些工具存在。我认识并推荐的是PhantomJS。它与Javascript一起工作,您需要从Java启动一个新进程来启动它。如果您希望继续使用Java,本文列出了一些Java替代方法。

#4


1  

I fact there is a "way"! Maybe it is more "a workaround" than a "way... The code below checks both for meta attribute "REFRESH" and javascript redirects... If either of them exists RedirectedUrl variable is set. So you know your target... Then you can retrieve the target page and go on...

我其实还有一个“方法”!也许这更多的是一种“变通”,而不是一种“方法……”下面的代码检查元属性“REFRESH”和javascript重定向…如果它们中的任何一个存在,RedirectedUrl变量就会被设置。然后你可以检索目标页面并继续……

    String RedirectedUrl=null;
    Elements meta = page.select("html head meta");
    if (meta.attr("http-equiv").contains("REFRESH")) {
        RedirectedUrl = meta.attr("content").split("=")[1];
    } else {
        if (page.toString().contains("window.location.href")) {
            meta = page.select("script");
            for (Element script:meta) {
                String s = script.data();
                if (!s.isEmpty() && s.startsWith("window.location.href")) {
                    int start = s.indexOf("=");
                    int end = s.indexOf(";");
                    if (start>0 && end >start) {
                        s = s.substring(start+1,end);
                        s =s.replace("'", "").replace("\"", "");        
                        RedirectedUrl = s.trim();
                        break;
                    }
                }
            }
        }
    }

... now retrieve the redirected page again...

#5


0  

Is there a way to get also javascript generated content when parsing page with Jsoup?

在使用Jsoup解析页面时,是否也有方法获得javascript生成的内容?

I am going to guess NO, thinking about how difficult this would be, without building an entire javascript interpreter in Java.

我将会猜测,如果没有在Java中构建完整的javascript解释器,这将是多么困难。

#6


-1  

Try:

试一试:

Document Doc = Jsoup.connect(url)
    .header("Accept-Encoding", "gzip, deflate")
    .userAgent("Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0")
    .maxBodySize(0)
    .timeout(600000)
    .get();