I am trying to get the source code for the following page: http://www.amazon.com/gp/offer-listing/082470732X/ref=dp_olp_0?ie=UTF8&redirect=true&condition=all (Please note that Amazon takes you to another page if you click on the link. To get to the page that I am interested in reading please copy the link and paste it to an empty tab in your browser. Thanks!)
我想获取以下页面的源代码:http://www.amazon.com/gp/offer-listing/082470732X/ref=dp_olp_0?ie = UTF8&redirect = true&condition = all(请注意亚马逊带你到如果你点击链接,请转到另一页。要进入我有兴趣阅读的页面,请复制链接并将其粘贴到浏览器中的空白标签页。谢谢!)
Normally using java.net API, I can get the source code for most of the URLs with almost no problem, however for the above link I get nothing. It turned out that the input stream generated by the connection is encoded by gzip, so I tried the following:
通常使用java.net API,我可以获得大多数URL的源代码几乎没有问题,但是对于上面的链接,我什么也得不到。原来,连接生成的输入流是用gzip编码的,所以我尝试了以下方法:
URL url = new URL(urlString);
HttpURLConnection urlConnection = (HttpURLConnection) url.openConnection();
InputStream is = urlConnection.getInputStream();
HttpURLConnection.setFollowRedirects(true);
urlConnection.setRequestProperty("Accept-Encoding", "gzip, deflate");
String encoding = urlConnection.getContentEncoding();
if (encoding != null && encoding.equalsIgnoreCase("gzip")) {
is = new GZIPInputStream(is);
} else if (encoding != null && encoding.equalsIgnoreCase("deflate")) {
is = new InflaterInputStream((is), new Inflater(true));
}
However this time I get the following error deterministically:
但是这次我确定地得到以下错误:
java.io.EOFException
at java.util.zip.GZIPInputStream.readUByte(GZIPInputStream.java:249)
at java.util.zip.GZIPInputStream.readUShort(GZIPInputStream.java:239)
at java.util.zip.GZIPInputStream.readHeader(GZIPInputStream.java:142)
at java.util.zip.GZIPInputStream.<init>(GZIPInputStream.java:58)
at java.util.zip.GZIPInputStream.<init>(GZIPInputStream.java:67)
at domain.logic.ItemScraper.loadURL(ItemScraper.java:405)
at domain.logic.ItemScraper.main(ItemScraper.java:510)
Can anybody see my mistake? Is there another way to read this particular page? Can somebody explain me why my browser (firefox) can read it, however I cannot read the source using Java?
任何人都可以看到我的错误吗?有没有其他方式来阅读这个特定的页面?有人可以解释一下为什么我的浏览器(firefox)可以读取它,但是我无法使用Java读取源代码?
Thanks in advance, best regards,
在此先感谢,最诚挚的问候,
2 个解决方案
#1
0
Instead of
is = new GZIPInputStream(is);
try
is = new GZIPInputStream(urlConnection.getInputStream());
As for the EOFException
, if you add
至于EOFException,如果你添加
urlConnection.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.24 (KHTML, like Gecko) Chrome/11.0.696.50 Safari/534.24");
it would go away.
它会消失。
#2
0
You can use a standard BufferedReader to read the response of a webserver of a given URL.
您可以使用标准的BufferedReader来读取给定URL的Web服务器的响应。
URLIn = new BufferedReader(new InputStreamReader(new URL(URLOrFilename).openStream()));
Then use ...
然后使用......
while ((incomingLine = URLIn.readLine()) != null) {
...
}
... to get the response.
......得到答复。
#1
0
Instead of
is = new GZIPInputStream(is);
try
is = new GZIPInputStream(urlConnection.getInputStream());
As for the EOFException
, if you add
至于EOFException,如果你添加
urlConnection.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.24 (KHTML, like Gecko) Chrome/11.0.696.50 Safari/534.24");
it would go away.
它会消失。
#2
0
You can use a standard BufferedReader to read the response of a webserver of a given URL.
您可以使用标准的BufferedReader来读取给定URL的Web服务器的响应。
URLIn = new BufferedReader(new InputStreamReader(new URL(URLOrFilename).openStream()));
Then use ...
然后使用......
while ((incomingLine = URLIn.readLine()) != null) {
...
}
... to get the response.
......得到答复。