i have a severe concern here. i have searched all through stack overflow and many other sites. every where they give the same solution and i have tried all those but mi am not able to resolve this issue.
我在这里有一个严重的问题。我搜索了所有的栈溢出和其他网站。每次他们都给出相同的解决方案,我都试过了,但我没办法解决这个问题。
i have the following code,
我有以下代码,
Document doc = Jsoup.connect(url).timeout(30000).get();
Here m using Jsoup library and the result that i am getting is not equal to the actual page source that we can see but right click on the page -> page source. Many parts are missing in the result that i am getting with the above line of code. After searching some sites on Google, i saw this methid,
这里m使用Jsoup库,我得到的结果不等于我们可以看到的实际页面源,而是右键单击页面——>页面源。在上面这行代码的结果中,有许多部分是缺失的。在谷歌上搜索了一些网站后,我发现了这个方法,
URL url = new URL(webPage);
URLConnection urlConnection = url.openConnection();
urlConnection.setConnectTimeout(10000);
urlConnection.setReadTimeout(10000);
InputStream is = urlConnection.getInputStream();
InputStreamReader isr = new InputStreamReader(is);
int numCharsRead;
char[] charArray = new char[1024];
StringBuffer sb = new StringBuffer();
while ((numCharsRead = isr.read(charArray)) > 0) {
sb.append(charArray, 0, numCharsRead);
}
String result = sb.toString();
System.out.println(result);
But no Luck. While i was searching over the internet for this problem i saw many sites where it said i had to set the proper charSet and encoding types of the webpage while downloading the page source of a web page. but how will i get to know these things from my code dynamically?? is there any classes in java for that. i went through crawler4j also a bit but it did not to much for me. Please help guys. m stuck with this problem for over a month now. i have tried all my ways i can. so final hope is on the gods of stack overflow who have always helped!!
但没有运气。当我在互联网上搜索这个问题时,我看到很多网站说我必须在下载网页的网页源时设置正确的网页字符集和编码类型。但是我如何从我的代码中动态地了解这些东西呢?在java中有这样的类吗?我也讲了一点crawler4j,但是对我来说没有多大用处。请帮助。我被这个问题困了一个多月了。我已经尽力了。所以最后的希望是在神的堆叠溢出谁总是帮助!!
3 个解决方案
#1
4
I had this recently. I'd run into some sort of robot protection. Change your original line to:
最近我有这个。我会遇到一些机器人的保护。把你原来的台词改为:
Document doc = Jsoup.connect(url)
.userAgent("Mozilla/5.0")
.timeout(30000)
.get();
#2
3
The problem might be that your web page is rendered by Javascript which is run in a browser, JSoup alone can't help you with this, so you may try using HtmlUnit which uses Selenium to emulate the browser: using Jsoup to sign in and crawl data.
问题可能是,您的web页面是由在浏览器中运行的Javascript呈现的,JSoup本身无法帮助您实现这一点,因此您可以尝试使用HtmlUnit,它使用Selenium来模拟浏览器:使用JSoup登录并抓取数据。
UPDATE
更新
There are several reasons why HTML is different. The most probable is that this web page contains <javascript>
elements which contains dynamic page logic. This could be an application inside your web page which sends requests to the server and add or removes content depending on the responses.
HTML与众不同的原因有很多。最可能的情况是这个web页面包含
JSoup would never render such pages because it's a job for a browser like Chrome, Firefox or IE. JSoup is a lightweight parser for plaintext html you get from the server.
JSoup永远不会呈现这样的页面,因为它是Chrome、Firefox或IE等浏览器的工作。JSoup是用于从服务器获得的纯文本html的轻量级解析器。
So what you could do is you could use a web driver which emulates a web browser and renders a page in memory, so it would have the same content as shown to the user. You may even do mouse clicks with this driver.
所以你能做的就是使用一个web驱动程序来模拟一个web浏览器并在内存中呈现一个页面,这样它的内容就会和用户看到的一样。你甚至可以用这个驱动程序点击鼠标。
And the proposed implementation for the web driver in the linked answer is HtmlUnit. It's the most lightweight solution, however, it's might give you unexpected results: Selenium vs HtmlUnit?.
在链接的答案中,web驱动程序的建议实现是HtmlUnit。它是最轻量级的解决方案,但是,它可能会给您带来意想不到的结果:Selenium vs HtmlUnit?
If you want the most real page rendering, you might want to consider Selenium WebDriver.
如果您想要最真实的页面呈现,您可能需要考虑Selenium WebDriver。
#3
1
Why do you want to parse a web page this way? If there is a consumable service available from the website, the website might have an REST API.
为什么要这样解析web页面?如果网站上有可消费的服务,网站可能有一个REST API。
To answer your question, A webpage viewed using the web-browser may not be same, as the same webpage is downloaded using a URLConnection.
要回答你的问题,使用web浏览器浏览的网页可能不相同,因为使用URLConnection下载的网页是相同的。
The following could be few of the reasons that cause these differences:
以下是造成这些差异的原因:
-
Request Headers: when the client (java application/browser) makes a request for a URL, it sets various headers as part of the request and the webserver may change the content of the response accordingly.
请求标头:当客户端(java应用程序/浏览器)对URL发出请求时,它会设置不同的头作为请求的一部分,而webserver可能会相应地改变响应的内容。
-
Java Script: once the response is recieved, if there are java script elements present in the response it's executed by the browsers javascript engine, which may change the contents of DOM.
Java脚本:一旦收到响应,如果在响应中出现了Java脚本元素,浏览器javascript引擎就会执行,这可能会改变DOM的内容。
-
Browser Plugins, such as IE Browser Helper Objects, Firefox Extensions or Chrome Extensions may change the contents of the DOM.
浏览器插件,例如IE浏览器助手对象、Firefox扩展或Chrome扩展可能会改变DOM的内容。
in simple terms, when you request a URL using a URLConnection you are recieving raw data, however when you request the same URL using a browser's addressbar you get processed (by javascript/browser plugins) webpage.
简单地说,当您使用URLConnection请求URL时,您会收到原始数据,但是当您使用浏览器的addressbar请求相同的URL时,您会得到处理(通过javascript/浏览器插件)的网页。
URLConnection/JSoup will allow you to set request headers as required, but you may still get the different response due to points 2 & 3. Selenium allows you to remote control a browser and has a api to access the rendered page. Selenium is used for automated testing of web applications.
URLConnection/JSoup将允许您根据需要设置请求头,但是您可能仍然会得到不同的响应,因为点2和3。Selenium允许您远程控制浏览器,并具有访问呈现页面的api。Selenium用于web应用程序的自动测试。
#1
4
I had this recently. I'd run into some sort of robot protection. Change your original line to:
最近我有这个。我会遇到一些机器人的保护。把你原来的台词改为:
Document doc = Jsoup.connect(url)
.userAgent("Mozilla/5.0")
.timeout(30000)
.get();
#2
3
The problem might be that your web page is rendered by Javascript which is run in a browser, JSoup alone can't help you with this, so you may try using HtmlUnit which uses Selenium to emulate the browser: using Jsoup to sign in and crawl data.
问题可能是,您的web页面是由在浏览器中运行的Javascript呈现的,JSoup本身无法帮助您实现这一点,因此您可以尝试使用HtmlUnit,它使用Selenium来模拟浏览器:使用JSoup登录并抓取数据。
UPDATE
更新
There are several reasons why HTML is different. The most probable is that this web page contains <javascript>
elements which contains dynamic page logic. This could be an application inside your web page which sends requests to the server and add or removes content depending on the responses.
HTML与众不同的原因有很多。最可能的情况是这个web页面包含
JSoup would never render such pages because it's a job for a browser like Chrome, Firefox or IE. JSoup is a lightweight parser for plaintext html you get from the server.
JSoup永远不会呈现这样的页面,因为它是Chrome、Firefox或IE等浏览器的工作。JSoup是用于从服务器获得的纯文本html的轻量级解析器。
So what you could do is you could use a web driver which emulates a web browser and renders a page in memory, so it would have the same content as shown to the user. You may even do mouse clicks with this driver.
所以你能做的就是使用一个web驱动程序来模拟一个web浏览器并在内存中呈现一个页面,这样它的内容就会和用户看到的一样。你甚至可以用这个驱动程序点击鼠标。
And the proposed implementation for the web driver in the linked answer is HtmlUnit. It's the most lightweight solution, however, it's might give you unexpected results: Selenium vs HtmlUnit?.
在链接的答案中,web驱动程序的建议实现是HtmlUnit。它是最轻量级的解决方案,但是,它可能会给您带来意想不到的结果:Selenium vs HtmlUnit?
If you want the most real page rendering, you might want to consider Selenium WebDriver.
如果您想要最真实的页面呈现,您可能需要考虑Selenium WebDriver。
#3
1
Why do you want to parse a web page this way? If there is a consumable service available from the website, the website might have an REST API.
为什么要这样解析web页面?如果网站上有可消费的服务,网站可能有一个REST API。
To answer your question, A webpage viewed using the web-browser may not be same, as the same webpage is downloaded using a URLConnection.
要回答你的问题,使用web浏览器浏览的网页可能不相同,因为使用URLConnection下载的网页是相同的。
The following could be few of the reasons that cause these differences:
以下是造成这些差异的原因:
-
Request Headers: when the client (java application/browser) makes a request for a URL, it sets various headers as part of the request and the webserver may change the content of the response accordingly.
请求标头:当客户端(java应用程序/浏览器)对URL发出请求时,它会设置不同的头作为请求的一部分,而webserver可能会相应地改变响应的内容。
-
Java Script: once the response is recieved, if there are java script elements present in the response it's executed by the browsers javascript engine, which may change the contents of DOM.
Java脚本:一旦收到响应,如果在响应中出现了Java脚本元素,浏览器javascript引擎就会执行,这可能会改变DOM的内容。
-
Browser Plugins, such as IE Browser Helper Objects, Firefox Extensions or Chrome Extensions may change the contents of the DOM.
浏览器插件,例如IE浏览器助手对象、Firefox扩展或Chrome扩展可能会改变DOM的内容。
in simple terms, when you request a URL using a URLConnection you are recieving raw data, however when you request the same URL using a browser's addressbar you get processed (by javascript/browser plugins) webpage.
简单地说,当您使用URLConnection请求URL时,您会收到原始数据,但是当您使用浏览器的addressbar请求相同的URL时,您会得到处理(通过javascript/浏览器插件)的网页。
URLConnection/JSoup will allow you to set request headers as required, but you may still get the different response due to points 2 & 3. Selenium allows you to remote control a browser and has a api to access the rendered page. Selenium is used for automated testing of web applications.
URLConnection/JSoup将允许您根据需要设置请求头,但是您可能仍然会得到不同的响应,因为点2和3。Selenium允许您远程控制浏览器,并具有访问呈现页面的api。Selenium用于web应用程序的自动测试。