编码基础知识参考http://my.oschina.net/chape/blog/201725
我对此作了简单的概括
iso8859-1 (通常叫做Latin-1)
属于单字节编码,最多能表示的字符范围是0-255,应用于英文系列,无法表示中文,比如,字母a的编码为0x61=97
GB2312/GBK
汉字的国标码,专门用来表示汉字,是不定长双字节编码,而英文字母和iso8859-1一致(兼容iso8859-1编码)。其中gbk编码能够用来同时表示繁体字和简体字,而gb2312只能表示简体字,gbk是兼容gb2312编码的。即GB2312兼容IOS8859-1,GBK兼容GB2312和ISO8859-1
unicode
这是最统一的编码,可以用来表示所有语言的字符,而且是定长双字节(也有四字节的)编码,即不管是中文还是字母都是两个字节,且不兼容任何编码,所以容易浪费很多空间,不利于传输和存储。相对于iso8859-1编码来说,uniocode编码只是在前面增加了一个0字节,比如字母a为"00 61。
定长编码便于计算机处理(注意GB2312/GBK不是定长编码),而unicode又可以用来表示所有字符,所以在很多软件内部是使用unicode编码来处理的,比如java
UTF
考虑到unicode的不足,产生了utf编码,utf编码兼容iso8859-1编码,同时也可以用来表示所有语言的字符
UTF是不定长编码,每一个字符的长度从1-6个字节不等,自带简单的校验功能,一般来讲,英文字母都是用一个字节表示,而汉字使用三个字节。
在考虑选择编码使用的情况下,如果知道使用的都是中文,那么用GB2312/GBK无疑是最省空间的,如果知道了都是不含中文,可以考虑用ISO8859-1。
但是我们一般网页是既含中文也含英文的,GB2312/GBK是国标码,在中国比较通用,而utf是国际支持的,如果用GB2312/GBK,国外访问国内网站,浏览器不支持自动转码,就会出现乱码。所有建议网页上使用utf编码。
关于编码的内部实现不深入研究,这里举个例子
以"中文"两个字为例,经查表可以知道其GB2312编码是"d6d0 cec4",Unicode编码为"4e2d 6587",UTF编码就是"e4b8ad e69687"
java中getBytes(charset)
不指定cahrset集时getBytes()默认采用平台默认的编码集,什么是平台默认的编码集呢?就是.java文件保存的编码
比如我的eclipse里保存.java文件为
aaarticlea/png;base64,iVBORw0KGgoAAAANSUhEUgAAAQIAAABwCAIAAAB6sOl5AAAKYElEQVR4nO2dy28bxx3HeW/RXApUPdRt0QJFbwXc8iCgcQ4FYiSnnpxLEgIFdOk/UPSQAioaIoVsULICPyRYol6WwSgqmkiwnnRkiZT8oBPZ1MMkRVkPSpT4+Imk4osL9kCKnJmdfYhaiivx+8EerOXO7OzsfHZndnd+tuXz+Xw+393dnc/nCYC6xAYNAIAGAEADAKABAAQNACBoAABBAwDojGgQcNptNpvN4XTabXZngIg8DpvDU+tiFfE4ioUKlIoHzhhmauBx2ETMaKsBp72YTwAagKpQjbtBZW203NqV2SkalyU1AGcWaHBioMHZ53Q0KHeXyt2Ho208DpvN4Sj3p/gmxXS0HJ6yKewuxMxl5RE28DhsdqfToUgl3VLaw+M7gOWjKhVPO3+70wl5LESFGgSj2+zC/yhowF4thZYSYH40cDeQaCDPXJqa2cDjKLVqbrwhcbC8xuOwlXfKWyjRQJ7/UaqA087tTKhPLKe8VKJBMLqd51G0PKY1KgbO/IVWuxHzDVmpgWrm2nsX3FCsE3fMl1BYzw2RpXLK8uf/zIOacmwNCg5obqLQQH73N0kD7a6FfIOKNdBs0CfQANSW6msgdiHKQwK7M8A8Y6y4UyTJnE+t3EDWTLktPc5iy+U7RcU/VLo32hpodopAbTkFDUqvv8rDTKZ9lUfLxa1kT4XUNZBkLiLZQNpM2S0l42+bcItTDnZ1NGDzxxDZWlRDA2AAdIqsBDQ4NQJOu+TxErAC0OAUYbpncMBSQAMAoAEA0AAAggYAEDQAgKABAAQNACBoAABBAwDIdA2SyaTXt3C739PsutHsunG73+P1LSSTSZNKexSi4hjzL48zXVNnTn2lnwFVPFVfdYcV1INlsGTgAjM12IrtXO/qHxqbXgyvJbK5RDa3GF4bGpu+3tW/FdtRbM58WmDsnKp+iq2FIQ2KOZ8RDSqqB8tQXQ3YRlWuI27yFT8ptrCRaRokk8m2O333H/pj6YPxp6uD3meD3meTz17G6OD+Q3/bnT7FPYE/m8y8RTUqaoalI1VvPMyn0HpZnfD8HbMBq+zQ2h+n1k5SdoY7EZHHwc09Dxyt5T64N1kDr29haGx6I5Hun3r6X3/wwXdr3yxGx56sDs8930zR0Ni017egLDVXX3oVWCUNDHeboIERaqWB1n7VZqyXJjGapsGt3nsvItGxJytfzy/PPl/3Lb16Gtp+vr7rX3rlW46+iERv9d7TK7dQQu4mxoWoINlkGja1OEGHSS22IGFajEZoCW6l9JbLZ+RxFFYVs1WWQRpTQxbzgoGvB35HYnL2u26701OsseJ0ann2akekni1XS4pj1DpNxuODKOtTUXDNWzqjAdfqPI5ivBPTNGh23UhkcwPTT+eCrx6vbi1Gd15uJ6LxVDSe+uZ5JJHNNbtuKAqupoFmMIvCSofiJqelgXR3TFVIUslDS8hWKotaOPnKbAXP5SEzdCclCCdVNQUfSoO7nGhVyfEidCgqhD8NWqfpJPWpMUdRMotQY2xQWGOyBl/MfBcIx5Y399b3Urvp7N5BLk65Z5HtOB38s/Umn0KqgXhNYUsudAbEC01lGrBXEf3JxLKJ9pKi8vuSaiBNaGzavtq1TRZDQHlZUfu3+k4NZVuqJbGe1U9TpfUpRXIo5Qsc9yNzqSl3FczS4GbvveDa+syLyPzKxvpeapey6dzrg8PXu+nMejwZXFu/qdspKhVW9/SzM9q5S4aJdwPDGkiKakwDZULzNVCW3BQN1CuELZLOaaq0PuVIyy2t7PJfR1sEzBsiz80Pj3u3UgdT34aWNuN7BznKvU5kDrcSlMjmhse93rl5rWPjbrxC30BxMGK3oFS/rCbGNVAZGxjRQF5UAxrIE7Ir2UakX2JlJcrrS1cDrgy6ETo0NVA9TdqVbKQ+VVwRryKG7gZUjQem8Ux2ZWsvtL2/sZ/eTWeTuUONB6bSHpvwIz8iCwg/2x0O/uojrCzXRUA98oV89GZIA2lR1TTgyyA7Rm58qBK9Ql0DLr285vQ1kPStDWRbrhDmGGWnSaeSDdanynBYpU3xXS2usk3WgIi2Yjttd/qGx72L4bXU4fepw+9fRKLD4962O32y12eWQfMhA6gHqvAxxdz87X7PJ1fbP7nafqv3nndu3ryPKaqFkb4nOMfg0zoAoAEA0AAAggYAEDQAgKABAAQNACBoQERLy6sTUzOeoa+wFJaJqZml5dVan5ZTBRrQxuZePIH/faxMPEGhaKzWp+VUgQa0sblXy0ZnSaCBDtCgHoAGOkCDegAa6AAN6gFooAM0qAeggQ66H1rP+ud77g62tLa3tLb33B2c9Z/8Q2uV2SEmwWsw2mRrdIXYFuFqbHSF8vnRJptI02g+H3I1Mmv4pOUc2BRnAWigg4YGsVis090zMj6xEgpTJkOZzEooPDI+0enuicVUqpWbLqTW1i2hAbMB25hDrkbtxj3aVGr9IVejXBTLAQ100JiE2dHl9j6c3U+mRx8sNLf3Nrf33p9Z2E+lvQ9nO7rcknuCEAqEmUGsHsvEfKqvQSm9InOrAg10UNNg1j8/Mj6xs5f49ObAb951XP7L3+x/bvrFOx/8u2NwN5EcGZ+Y9QtT8rnJ14p150aD0j3g7NwLoIEuahq4+wdWw5ER7/yv//Rh21e+R4n//XVu6cfvfHjh0pW5J9+uhiPu/gEugbxtF9YyU6hL4XsqDBen78/JNdAeGjDbWGVooBjl2GxHoXoKQAMd1DRoaW2nTObv1zp/9/7H11d2Pl5c/+1A8CdXHzdcvNzSOUiZTEtrO5dAJyKNGMvEtPBmCsy+G5TG0oVkTIZWuh9oOJCHBrpoa/CP1q6Gi+/+qu3RL93Bhs7gj5yPGi5ebu/9MplOX237nEugqoEy+Jmp4c0U8BoomrXoxTE7RXxySw0O1BzIQwNd1DTo7h94GVmb9D25cOnKW5c+euvTxz/41+Mf/vGjn1/6YCm8/jKy1i10iiocG5w8vJmI8N4g5Gpkei+jTWJX55gacHcA5qmRNZA6kIcGuqgOkX3+0YnJeCL1ee+XP3v7SsMf3vvp79+78PaVL+4/oEx2dGJy1ucXkohhxgw9KTIhvJmA8vUZ+4pAcfGuaIiM9wbWxvwHpinKBIKrff8Zv/v11Mvo5kE2q/rAlLTeGzDBz6oQ3owBb5GVQAMdtF+fdXS5RycmV0LhTC6XyeVWw5HRicmOLrfq6zMLAA2UQAMd9D+m8Pl77g5+dq31s2ut7v6BWZ/f4lHroIESaKADPq2rB6CBDtCgHoAGOkCDegAa6AAN6gFooMO51ACRKVgQmUKf86fB/n5qM57Gwi77+6lan5ZTBRoAAA0AgAYAEDQAgKABAFSBBm/evAlGt6tdLABOk2NrQEcmYMFybpZKNCDCyyZwrqhQAwDOE9AAAGgAADQAgKABAAQNACBoAABBAwAIGgBA0AAAggYAEDQAgKABAAQNACBoAABBAwAIGgBA0AAAggYAEDQAgKABAAQNACBoAABBAwAIGgBA0AAAggYAEDQAgKABAAQNACCi/wMeQXKUweG4NAAAAABJRU5ErkJggg==" alt="" />
测试
String str="哈哈ha1";
byte b[]=str.getBytes();
System.out.println(b.length);
aaarticlea/png;base64,iVBORw0KGgoAAAANSUhEUgAAALUAAABUCAIAAABLKeHZAAADRUlEQVR4nO2ayZXbMBAFERcCQjwIwVEwCmWAky+OwG9OtjgHbli/RBKyZ6iqp4NEUkR3s9iAFjM2uN1uP399rI9xHH9//Lnk4+/9fr+PPKoPgx+tCsA47vED3hD8AAV+gAI/QIEfoPjafgRvjfXhlUMM7tUjfGvwY58fgzM5bugSRXKqV1kbvN0XcF8/grddynWGnTEcvBKDO5JoM7b8dC/yI3hr7a7y4Mcb+RG8tT7sCv20H1vHilrvnNu2ZUl2KtDg5k2DM9YP3q7NdXnHfPxWz8EZ6326dxk9avPPxDDms8S6Y8d0Vha5MlAaXRlb+3SxH3mOSZTbi/yw6qWyPmTyRVckee+8/bAfSzhJqvH9kae45pZOs8sZpginPcnhqx/x3qUkbhlhcMWbWzFsh84vihSeWFZkF7Q2UKVZHOgftRzjvWulKqUoxtjumcy+qGBRGY1xh/xornKi/IuFnPWhKFAcaO151j+2lNod5GEMWetudfIHK7n0gtaTnbbGZz84vxQ5lvWpHpYPUat1FlJ+a57uH0ko6bWp1L2rH8Hbykz0KIYHfhzqH2q5MJ1wmwOe8SPNpsxxHnC7URqHpSNklO127OhHkn46aW0BRW28Gs05P7K7oD6/lDHEG+O6nll/1AYK3udledKPKI56jlMNnFv7aOuweoDx6+L6xWK64+uPOvPd11wx9Z1ftgGsc+uJH8WQLiV9p88vtYHqqz6j1qf5FyGNHAsNWodFkaTbli3BW+NcEei6vVWAL/H92Ntw7ONyD1otrdf8Aj0o2sY/Az9AgR9wgt1+/IB3gv4BCvwABX6AAj9AgR+gwA9QnPKj/H0BLsYJPwZnrI+e8ifwC3Lcj+Bt8kvs4gpciT79Y/ff5uGbcGr9Ef3pwCfdBK5Cr88vzC/XpI8f/+/vLfBazq0/DJ9vLw7fj4ECP0CBH6DAD1DgByjwAxT4AQr8AAV+gAI/QIEfoMAPUOAHKPADFPgBCvwABX6AAj9AgR+gwA9Q4Aco8AMU+AEK/AAFfoACP0CBH6DAD1DgByjwAxT4AQr8AAV+gAI/QIEfoMAPUOAHKPADFPgBCvwABX6AAj9AgR+gwA9Q4Aco8AMU+AEK/ADFJzISyHm0NBmIAAAAAElFTkSuQmCC" alt="" />
哈占3个字节,哈哈就是6个,ha1各占一个,符合UTF-8的编码
不同的编码之间如果不存在兼容关系,肯定是不能互相转换的,如GBK编码的文字不能用UTF-8显示,但ISO编码的字母能用UTF-8或GB2312/GBK显示,因为UTF跟GB2312/GBK是兼容IS0编码的.
但是如果GBK编码的文字通过UTF编码转回GBK呢?
下面做测试
String str="哈哈123";
byte utfByte[]=str.getBytes("UTF-8");
byte gbkByte[]=str.getBytes("GBK");
byte isoByte[]=str.getBytes("ISO-8859-1");
//UTF->GBK->UTF
String str11=new String(utfByte,"GBK");
String str12=new String(str11.getBytes("GBK"),"UTF-8");
System.out.println(str12);
//UTF->ISO->UTF
String str21=new String(utfByte,"ISO-8859-1");
String str22=new String(str21.getBytes("ISO-8859-1"),"UTF-8");
System.out.println(str22);
//GBK->UTF->GBK
String str31=new String(gbkByte,"UTF-8");
String str32=new String(str31.getBytes("UTF-8"),"GBK");
System.out.println(str32);
//GBK->ISO->GBK
String str41=new String(gbkByte,"ISO8859-1");
String str42=new String(str41.getBytes("ISO8859-1"),"GBK");
System.out.println(str42);
结果
aaarticlea/png;base64,iVBORw0KGgoAAAANSUhEUgAAAJoAAABaCAIAAADPddj0AAAEpUlEQVR4nO2by5WkMAxFiYuAHI9DmCiIojNg1ZsJYbaeBeCv/Fc1bp13Ty+qjcCSX1l2IdiMMd9//9m/r68v8/Ocet92fX6yi0N19XCoLUYdLF4El+r0qspmRuQ89c4S3QydPgwO3KFGAs36Fl8Ocg76ADkTDy9XvDx0u+JaHt+ueA51Nx1q2/Whd5tpnjNuexf+obZd6/Do07uX81p8MHHKtAc6cnsqJ9FR6F3qW/5yvpxxjIGX7p/YLKAq53N24Jn/7Ys9sq6ES8RzhSvY60hgbuX0jz4RqKeHQyUn53xwpvc/SQgNS2I0/lRHxFQcmJ1UjP5RO1LEUDiKcroZGeG5m+wZdn0m8fijQH2OZqcbr/z8rPoQ5bFcWsvGaE8L4yCCvVr9qw8m2yTGdHxIM0fz7AzODIeSGCZWOU+9E2m55kNFzqHZWVrqrgu6hNgiZxhNGuPdofteZ8wsA2tndKUop6WZ0MzK6bV5XlR98Bv9YZhZO6mOTq3jYWmU0/ODjvEaA6VslsqZPYztbJ/vdnZx5k22roNdKXvhmg/hrkUz7Wypjlw/wdentBWKf4BmYkxUy5ndDMoJxhj73dMO5PxRkknJDOQUBeQURVbOP+AXgtkpinE5t23r6KbHGAwDOUVRl9O/Rek3GkqkLmPATpOcuQ8mEanLGLDTLSc5I8eMATt9clqRoqRKTsqqsaFqSxfuzmR4yLtlqXkGQBYjybZ0uR7j+8b7oSLNvNvwRAnanvvhZ8V+JW/KeZPI6XPqnbzBCTlJRtbOlDHjm5Kch4qTqnvKaCRc6XT/UKluhdqNb3JyFh/74H4ETggL7GwpOcOnB0jSiQv67wo1KdRlTG2FyFzqP+FAPcYGhuSsL4dtxvFzcZeE3tMbF/STdJiaFAvMTsAHCmSigJyigJyiQL1TFJBTFChfiwLla1EscJMP8IHytShQvhYFyteiQPlaFChfi2KBnS3K13wsUCBD+ZoPlK9FscDsBHygQCYKyCkKyCmKVjkn17zCpmlyMX7FjUk+eGXTKWfLzrZ962uK41g+9KIbkywh59Z8Z6DLXd84GvqcQq+7McnLcuZCLUyLpr4zV2gR7EU35nlTTjLI3paSB8X86R9dwQ0W3p+dNqTGaVGmHFsu2kXcmC+55648z0d2tqRxQaT2jl53g6XkvpCcvitkY/pvodF4uxtDzac13Zgsub8pZxRVOo6NaaqQDCOD3OkruHEzV3Jfa3b6Y9o+A8pZrmkQl3FjsuS+lpxpY8642pi2l7PcIm5MltyXkNP//ha+y12DW2gnDy3ixmTJfQk57YfyMvNpOd91g6Xk/r6c7oRMUqo6mptquZRYjnkRN8ZYSE4wD+QUBeQUxYpydvn0uQCAD+QURfdNPttoirvEFmPATpOcuQ+G+sHQbgzY6Zaz8NO+1xiw0yenFSlKquSkrBobvH3NzUiyLV2uxxhvX7Pzppw3ePuaj5G1M2XM+AZvX/PR/UOluhVqN77B29d8LLCzxdvXfMwWyMiWPmO8fc3HiJz15bDNGG9fs7PA7AR8oEAmCsgpCsgpCsgpCsgpCsgpCsgpCsgpCsgpCsgpiv9RUZuNbimRYgAAAABJRU5ErkJggg==" alt="" />
结论:
UTF-8编码可以用GB2312/GBK和ISO8859-1解码后编回去
GB2312/GBK编码后只能用ISO8859-1解码后编回去