测试目的:
java从mysql读取数据并打印
测试环境1
a. hadoop集群某节点的环境是utf8,java代码也是utf8编码
b. 需要读取的mysql服,数据库、数据表均是latin1
运行mysql -u* -p* -A -h 进入mysql服
(一)查看mysql编码的基本方法
先在mysql下确认原始数据是什么编码,经过下面三步可以确认原始数据是latin1
进入mysql > use db;
mysql> show variables like 'character%';+--------------------------+----------------------------+
| Variable_name | Value |
+--------------------------+----------------------------+
| character_set_client | gbk |
| character_set_connection | gbk |
| character_set_database | latin1 |
| character_set_filesystem | binary |
| character_set_results | gbk |
| character_set_server | latin1 |
| character_set_system | utf8 |
| character_sets_dir | /usr/share/mysql/charsets/ |
+--------------------------+----------------------------+
8 rows in set (0.00 sec)
// set names的作用是改变三个红色部分的三个参数
Query OK, 0 rows affected (0.00 sec)
mysql> show variables like 'character%';
+--------------------------+----------------------------+
| Variable_name | Value |
+--------------------------+----------------------------+
| character_set_client | latin1 |
| character_set_connection | latin1 |
| character_set_database | latin1 |
| character_set_filesystem | binary |
| character_set_results | latin1 |
| character_set_server | latin1 |
| character_set_system | utf8 |
| character_sets_dir | /usr/share/mysql/charsets/ |
+--------------------------+----------------------------+
8 rows in set (0.00 sec)
Set names 'utf8'等价于下面三条语句
SET character_set_client = utf8 用来设置客户端送给MySQL服务器的数据的 字符集
SET character_set_results = utf8 服务器返回查询结果时使用的字符集
SET character_set_connection = utf8 MySQL 服务器 把客户端传来的数据,从character_set_client字符集转换成character_set_connection字符集
mysql的信息流向及编码转换
接收:client --> connection --> database
反馈:database--->connection-->results
mysql> SELECT LOWER(consumption_name),consumption FROM dimen_table;
+-------------------------+-------------+
| LOWER(consumption_name) | consumption |
+-------------------------+-------------+
| gm命令 | -1019 |
| 神器解锁 | -1018 |
| 购买经验药水 | -1017 |
| 购买时装 | -1016 |
| 运营活动(领物品) | -1015 |
| 神器精炼(棍) | -1014 |
| 神器精炼(拳) | -1013 |
| 商店购买 | -1012 |
(二) java下的代码测试
测试1: java连接代码
jdbc:mysql://106.2.67.10/sdc_hdfs?useUnicode=true&characterEncoding=UTF-8
Statement statement = con.createStatement();
statement.execute("set names 'utf8'");//与参数characterEncoding=UTF-8的作用类似
sql = "SELECT LOWER(consumption_name),consumption FROM dimen_table"
ResultSet rs = statement.executeQuery(sql);
while (rs.next()) {
System.out.println(new String(rs.getBytes(i+1),"cp1252")); // latin1 对应的解码用ISO-8859-1或者Cp1252,我测试的时候发现应该用cp1252
}
// 查看mysql支持的字符集及其描述
mysql> SHOW CHARACTER SET;
+----------+-----------------------------+---------------------+--------+
| Charset | Description | Default collation | Maxlen |
+----------+-----------------------------+---------------------+--------+
| big5 | Big5 Traditional Chinese | big5_chinese_ci | 2 |
| dec8 | DEC West European | dec8_swedish_ci | 1 |
| cp850 | DOS West European | cp850_general_ci | 1 |
| hp8 | HP West European | hp8_english_ci | 1 |
| koi8r | KOI8-R Relcom Russian | koi8r_general_ci | 1 |
| latin1 | cp1252 West European | latin1_swedish_ci | 1 |
| latin2 | ISO 8859-2 Central European | latin2_general_ci | 1 |
| swe7 | 7bit Swedish | swe7_swedish_ci | 1 |
| ascii | US ASCII | ascii_general_ci | 1 |
| ujis | EUC-JP Japanese | ujis_japanese_ci | 3 |
【参考】
http://*.com/questions/21689665/mysql-latin1-to-utf-8-using-java-hibernate-jpa :
提到latin1应该用cp1252解码:
MySQL's version of latin1 is an extended version of CP1252: it uses the 5 bytes that CP1252 leaves undefined. Unfortunately the current Connector/J has a "bug" in that it uses the original CP1252 rather than MySQL's own version. Therefore it's impossible to recover strings whose encoding uses one of these 5 bytes. Patching the Connector/J source to fix the bug could solve the problem, but ideally you should migrate the tables to UTF-8.
测试2,改sql,将编码转换交给mysql来处理
url置为jdbc:mysql://123.*.*.108/db
sql置成SELECT CONVERT(CONVERT(CONVERT(LOWER(consumption_name) USING latin1) USING binary) USING utf8),consumption FROM dimen_table
执行sql前,先执行statement.execute("set names 'utf8'");
在ResultSet中,rs.getString()获得的就是utf-8编码
在windows下显示正常。
在linux下显示有问题。
gm命令 -1019
神器解锁 -1018
购买经验药水 -1017
购买时装 -1016
运营活动(领物-1015
神器精炼(-1014
神器精炼(-1013
商店购买 -1012
(说明:红色部分不能正常解析)
这个是linux环境的编码环境导致的:解决方法
1) 使用locale查看file.encoding这个系统变量,如果不是utf8可以运行这个命令 export LANG=zh_CN.utf8
或者这样LANG=zh_CN.utf8 java -Djava.ext.dir=/sdfls/asdlfjal/test Main执行之后
2) 或者java -Dfile.encoding=utf8 MainClass
测试3:修改测试环境
a. hadoop集群某节点的环境是utf8,java代码也是utf8编码
b. 将上面的latin1数据库的编码改为utf8(修改生效的判定:新建的表默认是utf8.修改之前默认的编码是latin1)
步骤及结果
url置为jdbc:mysql://123.*.*.108/db?CharSet=utf8&useUnicode=true&characterEncoding=utf8
sql还是普通写法:SELECT LOWER(consumption_name),consumption FROM info.dimen_table
在执行上面这个sql之前,查看字符集
System.out.println(SqlTest.getSqlResutl(statement, "show variables like '%char%'"));// 初始连接进去之后查看 字符集 /** 结果5 character_set_client utf8 character_set_connection utf8 character_set_database utf8 character_set_filesystem binary character_set_results <span style="color:#FF0000;">character_set_server latin1</span> character_set_system utf8 character_sets_dir /usr/share/mysql/charsets/ */ statement.execute("set names 'utf8'"); //修改连接方式 //statement.execute("set character_set_server='utf8'"); System.out.println(SqlTest.getSqlResutl(statement, "show variables like '%char%'"));// 修改之后,查看连接使用的字符集 /** 结果6 character_set_client utf8 character_set_connection utf8 character_set_database utf8 character_set_filesystem binary character_set_results <span style="color:#FF0000;">character_set_server utf8</span> character_set_system utf8 character_sets_dir /usr/share/mysql/charsets/ */
在ResultSet中,用utf8解释:new String(rs.getString(i).getBytes("UTF-8"));// 右边这个写法完全是乱码 new String(rs.getString(i).getBytes(),"UTF-8")
跟测试2的最终输出结果是一样的。
(对于纯粹的utf8环境:服务器编码是utf8;mysql的默认编码也是utf8,进入mysql之后查看字符集,会看到字符集展示就是结果6,即默认的character_set_server=utf8)
感觉所有问题的根源在于默认的character_set_server=latin1
http://*.com/questions/27866533/whacky-latin1-to-utf8-conversion-in-jdbc :提到jdbc对不识别的latin1的编码字符插入了特殊的替换字符
JDBC seems to insert a utf8 replacement character when asked to read from a latin1 column containing undefined latin1 codepage characters
【参考】使用Java读写存储在latin1编码的MySQL中的UTF-8编码的中文
Character set bug at server with utf8 column and latin1 connection
Description: // This bug is reproduced using a MySQL Linux default installation where "character_set_server" is "latin1"‘
其他相关资料及描述
貌似最好把所有字符集设置成utf8的方法,解决起来最彻底;并且这个问题只出现在jdbc中,使用python操作的时候,读取都是正常的
conn10 = MySQLdb.connect(host=db10, user=mdUser, passwd=mdPasswd, db="sdc_hdfs" ) cursor10 = conn10.cursor() cursor10.execute("set @@autocommit=1") cursor10.execute("SHOW VARIABLES LIKE 'character_set_database'") data=cursor10.fetchone() if data[1]=='utf8': cursor10.close() cursor10 = MySQLdb.connect(db10, mdUser, mdPasswd, 'sdc_hdfs', charset='utf8', use_unicode=False).cursor() #使用utf8来连接 cursor10.execute("set @@autocommit=1")
Java中String解码、编码
Strings: although Java uses Unicode all the time under the hood, when you convert between String and byte[] using String#getBytes() or String(byte[]), you should rather use the overloaded method/constructor which takes the character encoding:
byte[] bytesInDefaultEncoding = someString.getBytes(); // May generate corrupt bytes.
byte[] bytesInUTF8 = someString.getBytes("UTF-8"); // Correct.
String stringUsingDefaultEncoding = new String(bytesInUTF8); // Unknown bytes becomes "?".
String stringUsingUTF8 = new String(bytesInUTF8, "UTF-8"); // Correct.
Otherwise the platform default encoding will be used, which can be the one of the underlying operating system or the IDE(!).
Unicode - How to get the characters right?
最后发现:
1. 不同jdbc对编码的支持情况不一样
2. 测试2,其实能解决问题,也就是set names 'utf8' + 在SQL里边使用convert()来转码是万能的: 能绕开不同jdbc带来的问题。