分析

打开页面http://www.coobobo.com/free-http-proxy/，端口数字一看就不对劲，老规律ctrl+shift+c选一下：

酷伯伯实时免费HTTP代理ip爬取（端口图片显示+document.write）

这就很悲剧了，端口数字都是用图片显示的：

酷伯伯实时免费HTTP代理ip爬取（端口图片显示+document.write）

不过没关系，看这些图片长得这么清秀纯天然无杂质，识别是很容易的。

然后再来选一下ip地址：

酷伯伯实时免费HTTP代理ip爬取（端口图片显示+document.write）

很可能ip地址是用这个js现写进来的，要确定的话还得看一眼返回的原始html，查看源码定位这一个ip：

酷伯伯实时免费HTTP代理ip爬取（端口图片显示+document.write）

看来只能从这段js中提取ip地址了，并不是很难，只需要把引号、加号、括号、document.write、空白符抹掉即可，一个正则表达式可以搞定。

代码实现

端口图片比较麻烦，之前写过一个类似的小工具库，对于这种简单字符的识别可以节省一些工作量，这里就使用这个工具库。

因为识别原理就是先收集一些图片标记好谁是啥字符作为依据，然后后面再来的新的都来参考这些已经标记好的，所以需要先收集一些图片来标记：

/**

 * 收集需要标注的字符图片

 */

public static void grabTrainImage(String basePath) {

    for (int i = 1; i <= 10; i++) {

        System.out.println("page " + i);

        Document document = getDocument(url + i);

        Elements images = document.select("table.table-condensed tbody tr img");

        images.forEach(elt -> {

            String imgLink = host + elt.attr("src");

            byte[] imgBytes = download(imgLink);

            try {

                String outputPath = basePath + System.currentTimeMillis() + ".png";

                BufferedImage img = ImageIO.read(new ByteArrayInputStream(imgBytes));

                ImageIO.write(img, "png", new File(outputPath));

                System.out.println(imgLink);

            } catch (IOException e) {

                e.printStackTrace();

            }

        });

    }

}

抓取图片到本地并生成要标注的图片：

public static void main(String[] args) throws IOException {

    String rawImageSaveDir = "E:/test/proxy/kubobo/raw/";

    String distinctCharSaveDir = "E:/test/proxy/kubobo/char/";

    grabTrainImage(rawImageSaveDir);

    ocrUtil.init(rawImageSaveDir, distinctCharSaveDir);

}

然后打开E:/test/proxy/kubobo/char/，之前下载的全部图片中用到的所有字符都被分割出来放到了这个目录下：

酷伯伯实时免费HTTP代理ip爬取（端口图片显示+document.write）

现在需要将文件名修改为这张图片表示的意思：

酷伯伯实时免费HTTP代理ip爬取（端口图片显示+document.write）

需要注意不要标记错了不然后面的就全是错的了。

然后告诉ocrUtil上面这个目录的位置让其知道去哪里加载：

ocrUtil.loadDictionaryMap("E:/test/proxy/kubobo/char/");

然后就可以使用了，只需要把图片传入给ocrUtil.ocr(BufferedImage)即返回这种图片对应的字符，完整的代码如下：

package org.cc11001100.t1;

import cc11001100.ocr.OcrUtil;

import org.apache.commons.lang3.StringUtils;

import org.jsoup.Jsoup;

import org.jsoup.nodes.Document;

import org.jsoup.select.Elements;

import javax.imageio.ImageIO;

import java.awt.image.BufferedImage;

import java.io.ByteArrayInputStream;

import java.io.File;

import java.io.IOException;

import java.nio.charset.StandardCharsets;

import java.util.ArrayList;

import java.util.List;

import static java.util.stream.Collectors.toList;

/**

 * @author CC11001100

 */

public class KuboboProxyGrab {

    private static String host = "http://www.coobobo.com";

    private static String url = "http://www.coobobo.com/free-http-proxy/";

    private static OcrUtil ocrUtil;

    static {

        ocrUtil = new OcrUtil();

        ocrUtil.loadDictionaryMap("E:/test/proxy/kubobo/char/");

    }

    /**

     * 收集需要标注的字符图片

     */

    public static void grabTrainImage(String basePath) {

        for (int i = 1; i <= 10; i++) {

            System.out.println("page " + i);

            Document document = getDocument(url + i);

            Elements images = document.select("table.table-condensed tbody tr img");

            images.forEach(elt -> {

                String imgLink = host + elt.attr("src");

                byte[] imgBytes = download(imgLink);

                try {

                    String outputPath = basePath + System.currentTimeMillis() + ".png";

                    BufferedImage img = ImageIO.read(new ByteArrayInputStream(imgBytes));

                    ImageIO.write(img, "png", new File(outputPath));

                    System.out.println(imgLink);

                } catch (IOException e) {

                    e.printStackTrace();

                }

            });

        }

    }

    private static Document getDocument(String url) {

        byte[] responseBytes = download(url);

        String html = new String(responseBytes, StandardCharsets.UTF_8);

        return Jsoup.parse(html);

    }

    private static byte[] download(String url) {

        for (int i = 0; i < 3; i++) {

            try {

                return Jsoup.connect(url).execute().bodyAsBytes();

            } catch (IOException e) {

                e.printStackTrace();

            }

        }

        return new byte[0];

    }

    public static List<String> grabProxyIpList() {

        List<String> resultList = new ArrayList<>();

        for (int i = 1; i <= 10; i++) {

            System.out.println("page " + i);

            Document document = getDocument(url + i);

            Elements ipElts = document.select("table.table-condensed tbody tr");

            List<String> pageIpList = ipElts.stream().map(elt -> {

                String rawText = elt.select("td:eq(0) script").first().data();

                String ip = rawText.replaceAll("document.write|[\'\"()+]|\\s+", "").trim();

                String imgLink = host + elt.select("td:eq(1) img").attr("src");

                byte[] imgBytes = download(imgLink);

                try {

                    BufferedImage img = ImageIO.read(new ByteArrayInputStream(imgBytes));

                    String port = ocrUtil.ocr(img);

                    return ip + ":" + port;

                } catch (IOException e) {

                    e.printStackTrace();

                }

                return "";

            }).filter(StringUtils::isNotEmpty).collect(toList());

            resultList.addAll(pageIpList);

        }

        return resultList;

    }

    public static void main(String[] args) throws IOException {

//        String rawImageSaveDir = "E:/test/proxy/kubobo/raw/";

//        String distinctCharSaveDir = "E:/test/proxy/kubobo/char/";

//        grabTrainImage(rawImageSaveDir);

//        ocrUtil.init(rawImageSaveDir, distinctCharSaveDir);

        grabProxyIpList().forEach(System.out::println);

    }

}

秒客网

酷伯伯实时免费HTTP代理ip爬取（端口图片显示+document.write）

分析

代码实现

相关文章