java utf-8文件处理bom头

UTF？

UTF，是UnicodeTransformationFormat的缩写，意为Unicode转换格式。即怎样将Unicode定义的数字转换成程序数据。utf是对Unicode的一种编码格式化。 JVM里面的任何字符串资源都是Unicode，就是说，任何String类型的数据都是Unicode编码。没有例外。既然只有一种编码，那么，我们可以这么说，JVM里面的String是不带编码的。String相当于 char[]。

JVM里面的 byte[] 数据是带编码的。比如，Big5，GBK，GB2312，UTF-8之类的（GBK并不属于utf）。

一个GBK编码的byte[] 转换成 String，其实就是从GBK编码向Unicode编码转换。

一个String转换成一个Big5编码的byte[]，其实就是从Unicode编码向Big5编码转换。我们在解析的时候就要注意是不是utf编码。

有几种UTF？

这里用char、char16_t、char32_t分别表示无符号8位整数，无符号16位整数和无符号32位整数。UTF-8、UTF-16、UTF-32分别以char、char16_t、char32_t作为编码单位。

什么是bom？

放在文件头用于标示Unicode编码格式。

bom会引起什么问题？

记事本保存的文件会存储bom，在解析的时候，在头部会多出一个乱码。

如何解决：

http://akini.mbnet.fi/java/unicodereader/

编程时根据具体的编码类型剔除头bom

public static String ReadFile(String path ,StringFilter filter ) throws IOException {

File file = new File( path );

if (! file .exists()) {

throw new IOException( "文件不存在" );

}

BufferedReader reader = null ;

StringBuffer laststr = new StringBuffer();

InputStream in = new FileInputStream( file );

try {

reader = new BufferedReader( new UnicodeReader( in , "utf-8" ));

String tempString = null ;

while (( tempString = reader .readLine()) != null ) {

if ( filter != null ) {

tempString = filter .RemoveString( tempString );

}

laststr .append( tempString );

}

reader .close();

} catch (IOException e ) {

throw new IOException( "文件读写错误" );

} finally {

if ( reader != null ) {

try {

reader .close();

} catch (IOException e1 ) {

throw new IOException( "文件流关闭错误" );

}

return laststr .toString();

}

秒客网

java utf-8文件处理bom头

相关文章