今天在用jsoup做一个小爬虫。Jsoup连接普通的http网站还是没问题的,但是一碰到https就跪了。查了一下api,不知道是不是我的原因,没发现Jsoup有提供相应api呀??excuse me??

1 public static void iAmStudent(){ 2 String url = "https://www.v2ex.com/t/116724"; 3 Connection connect = Jsoup.connect(url); 4 try { 5 Response response = connect.execute(); 6 System.out.println(response.body()); 7 } catch (IOException e) { 8 e.printStackTrace(); 9 } 10 }
javax.net.ssl.SSLHandshakeException: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target at com.sun.net.ssl.internal.ssl.Alerts.getSSLException(Alerts.java:174)
注意其中里面的一句:unable to find valid certification path to requested target。找不到合法的证书去请求目标url,显然目标网站没有被信任。本地的网管并不知道这网站是干嘛的,心想着不会是什么成人网站吧…为了保护青 少年儿童的健康成长,本次请求自然就失败了。

1 public static void iAm20() { 2 try { 3 HttpsURLConnection.setDefaultHostnameVerifier(new HostnameVerifier() { 4 //验证证书时发现真正请求和服务器的证书域名不一致 5 //网管问,你是你爸爸吗?我说,是呀~ 6 public boolean verify(String hostname, SSLSession session) { 7 return true; 8 } 9 }); 10 11 SSLContext context = SSLContext.getInstance("SSL"); 12 context.init(null, new X509TrustManager[] { new X509TrustManager() { 13 //客户端对SSL证书的有效性进行校验 14 //网管问你满18了吗,我默默的嗯.. 15 public void checkClientTrusted( 16 X509Certificate[] chain, String authType) throws 17 CertificateException { 18 //我啥也没干... 19 } 20 //服务端认证 21 //网管问,你老爸同意你上网吗,我说恩.. 22 public void checkServerTrusted( 23 X509Certificate[] chain, 24 String authType) throws CertificateException { 25 //我啥也没干... 26 } 27 //网管要检查身份证,给他一张地摊上买的假证 28 public X509Certificate[] getAcceptedIssuers() { 29 return new X509Certificate[0]; 30 } 31 } }, new SecureRandom()); 32 HttpsURLConnection.setDefaultSSLSocketFactory(context.getSocketFactory()); 33 } catch (Exception e) { 34 // e.printStackTrace(); 35 } 36 }
Connection conn = HttpConnection.connect(url); conn.timeout(timeout); conn.header("Accept-Encoding", "gzip,deflate,sdch"); conn.header("Connection", "close"); String yellowNews = conn.execute().body();
但是要注意,这里的Connection 是org.jsoup.Connection下的。

/** * Initialise Trust manager that does not validate certificate chains and * add it to current SSLContext. * <p/> * please not that this method will only perform action if sslSocketFactory is not yet * instantiated. * * @throws IOException */ private static synchronized void initUnSecureTSL() throws IOException { if (sslSocketFactory == null) { // Create a trust manager that does not validate certificate chains final TrustManager[] trustAllCerts = new TrustManager[]{new X509TrustManager() { public void checkClientTrusted(final X509Certificate[] chain, final String authType) { } public void checkServerTrusted(final X509Certificate[] chain, final String authType) { } public X509Certificate[] getAcceptedIssuers() { return null; } }}; // Install the all-trusting trust manager final SSLContext sslContext; try { sslContext = SSLContext.getInstance("SSL"); sslContext.init(null, trustAllCerts, new java.security.SecureRandom()); // Create an ssl socket factory with our all-trusting manager sslSocketFactory = sslContext.getSocketFactory(); } catch (NoSuchAlgorithmException e) { throw new IOException("Can't create unsecure trust manager"); } catch (KeyManagementException e) { throw new IOException("Can't create unsecure trust manager"); } } }
if (conn instanceof HttpsURLConnection) { if (!req.validateTLSCertificates()) { initUnSecureTSL(); ((HttpsURLConnection)conn).setSSLSocketFactory(sslSocketFactory); ((HttpsURLConnection)conn).setHostnameVerifier(getInsecureVerifier()); } }
也就是!req.validateTLSCertificates()关闭的了情况下,才会去默认信任https网站,通过进入 validateTLSCertificates()方法发现,这方法就是简单是返回Request类中的 validateTSLCertificates 成员变量而已。
public void validateTLSCertificates(boolean value) { validateTSLCertificates = value; }
所以,只要设置这个validateTSLCertificates 为false就可以了。然后我在HttpConnection中找到了这个方法:
public Connection validateTLSCertificates(boolean value) { req.validateTLSCertificates(value); return this; }

1 public static void iAmMyDaddy(){ 2 String url = "https://www.v2ex.com/t/116724"; 3 Connection connect = HttpConnection.connect(url); 4 connect.timeout(3000); 5 connect.header("Accept-Encoding", "gzip,deflate,sdch"); 6 connect.header("Connection", "close"); 7 connect.validateTLSCertificates(false); 8 try { 9 connect.execute(); 10 //Document parse = connect.post(); 11 System.out.println(connect.get().html()); 12 } catch (IOException e) { 13 e.printStackTrace(); 14 } 15 }

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html lang="zh-CN">
<meta charset="UTF-8">
<meta content="True" name="HandheldFriendly">
<meta name="detectify-verification" content="d0264f228155c7a1f72c3d91c17ce8fb">
<meta name="alexaVerifyID" content="OFc8dmwZo7ttU4UCnDh1rKDtLlY">