Heritrix使用UTF-8编码格式存储文件

最近在学习搜索引擎，想使用Heritrix + solr 搭建一个内网搜索引擎。Heritrix爬取网页保存到本地仓库，solr在本地仓库的基础上建立索引，然后进行搜索。整合是发现solr只能读取文件编码格式为UTF-8的文件，否则会出现乱码，而Heritrix保存文件是以ANSI格式保存的。所以需要修改Heritrix使用UTF-8格式保存。基础太差，看源码非常困难，整整弄了一天才弄明白。

修改org.archive.crawler.writer.MirrorWriterProcessor中writeToPath方法。源码是

private void writeToPath(RecordingInputStream recis, File dest)
        throws IOException {
        ReplayInputStream replayis = recis.getContentReplayInputStream();
        File tf = new File (dest.getPath() + "N");
        FileOutputStream fos = new FileOutputStream(tf);
        try {
            replayis.readFullyTo(fos);
        } finally {
            fos.close();
            replayis.close();
        }
        if (!tf.renameTo(dest)) {
            throw new IOException("Can not rename " + tf.getAbsolutePath()
                                  + " to " + dest.getAbsolutePath());
        }

    }

修改后为

 private void writeToPathToUtf8(RecordingInputStream recis, File dest)        throws IOException {        ReplayInputStream replayis = recis.getContentReplayInputStream();        OutputStreamWriter out = new OutputStreamWriter(new FileOutputStream(dest.getPath()),"UTF-8");         try {             byte[]   b   =   new   byte[4096];              for   (int   n;   (n   =   replayis.read(b))   !=   -1;)   {              out.write(new   String(b,   0,   n));             }              out.flush();              out.close();         } finally {            //fos.close();            replayis.close();        }    }

秒客网

Heritrix使用UTF-8编码格式存储文件

相关文章