一、CMD调用tesseract

cmd的方式，就是通过在java中调用命令行，来执行tesseract，其原理比较简单，根据上篇博客中可知tesseract的识别步骤是：

public String getCaptureText() {
        String result = "";
        String imgPath=“d:\\Image\\test.jpg";
        BufferedReader bufReader = null;
        try {
            String outPath = imgPath.substring(0, imgPath.lastIndexOf("."));
            Runtime runtime = Runtime.getRuntime();
            String command = ocrCommand + " " +  imgPath + " " + outPath +" "+ ocrLangData;
            Process ps = runtime.exec(command);
            ps.waitFor();
            // 读取文件
            File file = new File(outPath + ".txt");
            bufReader = new BufferedReader(new FileReader(file));
            String temp = "";
            StringBuffer sb = new StringBuffer();
            while ((temp = bufReader.readLine()) != null) {
                sb.append(temp);
            }
            // 文字结果
            result = sb.toString();
            if (StringUtils.isNotBlank(result))
                result = result.replaceAll(" ", "");
        } catch (Exception e) {
            logger.error("识别验证码异常，Exception:{}", e.getMessage());
            e.printStackTrace();
        }
        return result;
    }

二、tess4j方式

官方解释：A Java JNA wrapper for Tesseract OCR API.也就是说：tess4j是针对tesseract进行封装的javaAPI。通过Tess4j来操作

2.1依赖

因为tess4j依赖jna，而新版的tess4j和默认的com.sun.jna 3.0.6版本不兼容，它需要先加入groupID为：net.java.dev.jna 这个jna的依赖，然后加入tess4j的依赖（exclude掉默认的jna），如下：

<dependency>
            <groupId>net.java.dev.jna</groupId>
            <artifactId>jna</artifactId>
            <version>4.1.0</version>
        </dependency>
        <dependency>
            <groupId>net.sourceforge.tess4j</groupId>
            <artifactId>tess4j</artifactId>
            <version>2.0.1</version>
            <exclusions>
                <exclusion>
                    <groupId>com.sun.jna</groupId>
                    <artifactId>jna</artifactId>
                </exclusion>
            </exclusions>
        </dependency>

依赖库就只有这些，不需要安装tessreact-ocr，因为新版tess4j的jar包里面自带了。

2.2 语言包

选择一个目录，增加语言包文件夹：tessdata.然后把需要的语言包加入到tessdata目录下：chi_sim.traineddata.eng.traineddata等。

2.3 代码调用

 public static void main(String[] args){
        File imageFile = new File("D:\\ImageCapture\\test.jpg");
        ITesseract instance = new Tesseract();     
        instance.setDatapath"D:\\Program Files(x86)\\tessdata");
        // 默认是英文（识别字母和数字），如果要识别中文(数字 + 中文），需要制定语言包
        instance.setLanguage("chi_sim");
        try{
            String result = instance.doOCR(imageFile);
            System.out.println(result);
        }catch(TesseractException e){
            System.out.println(e.getMessage());
        }
    }

这个demo中，语言包，我们使用的是跟项目无关的一个绝对路径，这样跟我们的项目不在一起，服务器变动的时候，还需要改动这个文件，容易被落下，所以我们最好是把语言包放到项目的资源文件中，下面这个例子中改成使用项目的资源文件。

public static void main(String[] args){        File imageFile = new File("D:\\ImageCapture\\test.jpg");        ITesseract instance = new Tesseract();        URL url = ClassLoader.getSystemResource("tessdata");        String path = url.getPath().substring(1);        instance.setDatapath(path);        // 默认是英文（识别字母和数字），如果要识别中文(数字 + 中文），需要制定语言包        instance.setLanguage("chi_sim");        try{            String result = instance.doOCR(imageFile);            System.out.println(result);        }catch(TesseractException e){            System.out.println(e.getMessage());        }    }

三、总结

1、cmd的方式：

需要安装tesseract软件，就是通过在程序中调用命令行来进行tesseract的操作。

不足：在使用的服务器上，都需要进行tesseract的安装。

识别图片之后的结果文件都存在服务器上，需要定期进行处理。

2、tess4j的方式：

不需要安装tesseract软件，通过依赖引入JAR，就可以了。而tess4j下也封装了图片处理的工具类：如缩放，旋转等，基本都包含这些功能。

个人还是比较推荐使用tess4j方式的。因为在测试过程中，有的图片使用cmd没有识别出来，但是使用tess4j就识别出来了。

秒客网

JAVA调用tesseract 识别图片应用二