java识别图片中文字技术

时间:2022-08-09 16:13:35
java文字识别程序的关键是寻找一个可以调用的OCR引擎。tesseract-ocr就是一个这样的OCR引擎,在1985年到1995年由HP实验室开发,现在在Google。tesseract-ocr 3.0发布,支持中文。不过tesseract-ocr 3.0不是图形化界面的客户端,别人写的FreeOCR图形化客户端还不支持导入新的 3.0 traineddata。但这标志着,现在有*的中文OCR软件了。

    java中使用tesseract-ocr3.01的步骤如下:

1.下载安装tesseract-ocr-setup-3.01-1.exe(3.0以上版本才增加了中文识别)

2.在安装向导中可以选择需要下载的语言包。

3.到网上搜索下载java图形处理所需的2个包:jai_imageio-1.1-alpha.jar,swingx-1.6.1.jar

4.java程序清单:

ImageIOHelper 类:

import java.awt.image.BufferedImage;
import java.io.File;
import java.io.IOException;
import java.util.Iterator;
import java.util.Locale;

import javax.imageio.IIOImage;
import javax.imageio.ImageIO;
import javax.imageio.ImageReader;
import javax.imageio.ImageWriteParam;
import javax.imageio.ImageWriter;
import javax.imageio.metadata.IIOMetadata;
import javax.imageio.stream.ImageInputStream;
import javax.imageio.stream.ImageOutputStream;

import com.sun.media.imageio.plugins.tiff.TIFFImageWriteParam;

public class ImageIOHelper {  
      
    public static File createImage(File imageFile, String imageFormat) {  
        File tempFile = null;  
        try {  
            Iterator<ImageReader> readers = ImageIO.getImageReadersByFormatName(imageFormat);  
            ImageReader reader = readers.next();  
          
            ImageInputStream iis = ImageIO.createImageInputStream(imageFile);  
            reader.setInput(iis);  
            //Read the stream metadata  
            IIOMetadata streamMetadata = reader.getStreamMetadata();  
              
            //Set up the writeParam  
            TIFFImageWriteParam tiffWriteParam = new TIFFImageWriteParam(Locale.CHINESE);  
            tiffWriteParam.setCompressionMode(ImageWriteParam.MODE_DISABLED);  
              
            //Get tif writer and set output to file  
            Iterator<ImageWriter> writers = ImageIO.getImageWritersByFormatName("tiff");  
            ImageWriter writer = writers.next();