Character set(字符集)
字符的集合,也就是,带有特殊语义的符号。字母“A”是一个字符。“%”也是一个字符。没有内在数字价值,与 ASC II ,Unicode,甚至是电脑也没有任何的直接联系。在电脑产生前的很长一段时间内,符号就已经存在了。
Coded character set(编码字符集)
一个数值赋给一个字符的集合。把代码赋值给字符,这样它们就可以用特定的字符编码集表达数字的结果。其他的编码字符集可以赋不同的数值到同一个字符上。字符集映射通常是由标准组织确定的,例如 USASCII ,ISO 8859 -1,Unicode (ISO 10646 -1) ,以及 JIS X0201。
Character-encoding scheme(字符编码方案)
编码字符集成员到八位字节(8 bit 字节)的映射。编码方案定义了如何把字符编码的序列表达为字节序列。字符编码的数值不需要与编码字节相同,也不需要是一对一或一对多个的关系。原则上,把字符集编码和解码近似视为对象的序列化和反序列化。
通常字符数据编码是用于网络传输或文件存储。编码方案不是字符集,它是映射;但是因为它们之间的紧密联系,大部分编码都与一个独立的字符集相关联。例如,UTF -8,
仅用来编码 Unicode字符集。尽管如此,用一个编码方案处理多个字符集还是可能发生的。例如,EUC 可以对几个亚洲语言的字符进行编码。
图6-1 是使用 UTF -8 编码方案将 Unicode字符序列编码为字节序列的图形表达式。UTF -8把小于 0x80 的字符代码值编码成一个单字节值(标准 ASC II )。所有其他的 Unicode字符都被编码成 2 到6 个字节的多字节序列( )。
术语 charset 是在RFC2278( 中定义的。它是编码字符集 和字符编码方案的集合。java.nio.charset 包的类是 Charset,它封装字符集抽取。
大部分的操作系统在 I/O 与文件存储方面仍是以字节为导向的,所以无论使用何种编码,Unicode或其他编码,在字节序列和字符集编码之间仍需要进行转化。
由java.nio.charset 包组成的类满足了这个需求。这不是 Java 平台第一次处理字符集编码,但是它是最系统、最全面、以及最灵活的解决方式。java.nio.charset.spi包提供服务器供给接口(SPI),使编码器和解码器可以根据需要选择插入。
字符集:在JVM 启动时确定默认值,取决于潜在的操作系统环境、区域设置、和/或JVM配置。如果您需要一个指定的字符集,最安全的办法是明确的命名它。不要假设默认部署与您的开发环境相同。字符集名称不区分大小写,也就是,当比较字符集名称时认为大写字母和小写字母相同。互联网名称分配机构(IANA )维护所有正式注册的字符集名称。
示例6-1 演示了通过不同的 Charset实现如何把字符翻译成字节序列。
示例6 -1. 使用标准字符集编码
- package com.ronsoft.books.nio.charset;
- import java.nio.charset.Charset;
- import java.nio.ByteBuffer;
- /**
- * Charset encoding test. Run the same input string, which contains some
- * non-ascii characters, through several Charset encoders and dump out the hex
- * values of the resulting byte sequences.
- *
- * @author Ron Hitchens (
- */
- public class EncodeTest {
- public static void main(String[] argv) throws Exception {
- // This is the character sequence to encode
- String input = " \u00bfMa\u00f1ana?";
- // the list of charsets to encode with
- String[] charsetNames = { "US-ASCII", "ISO-8859-1", "UTF-8",
- "UTF-16BE", "UTF-16LE", "UTF-16" // , "X-ROT13"
- };
- for (int i = 0; i < charsetNames.length; i++) {
- doEncode(Charset.forName(charsetNames[i]), input);
- }
- }
- /**
- * For a given Charset and input string, encode the chars and print out the
- * resulting byte encoding in a readable form.
- */
- private static void doEncode(Charset cs, String input) {
- ByteBuffer bb = cs.encode(input);
- System.out.println("Charset: " +;
- System.out.println(" Input: " + input);
- System.out.println("Encoded: ");
- for (int i = 0; bb.hasRemaining(); i++) {
- int b = bb.get();
- int ival = ((int) b) & 0xff;
- char c = (char) ival;
- // Keep tabular alignment pretty
- if (i < 10)
- System.out.print(" ");
- // Print index number
- System.out.print(" " + i + ": ");
- // Better formatted output is coming someday...
- if (ival < 16)
- System.out.print("0");
- // Print the hex value of the byte
- System.out.print(Integer.toHexString(ival));
- // If the byte seems to be the value of a
- // printable character, print it. No guarantee
- // it will be.
- if (Character.isWhitespace(c) || Character.isISOControl(c)) {
- System.out.println("");
- } else {
- System.out.println(" (" + c + ")");
- }
- }
- System.out.println("");
- }
- }
- Charset: US-ASCII
- Input: ?Ma?ana?
- Encoded:
- 0: 20
- 1: 3f (?)
- 2: 4d (M)
- 3: 61 (a)
- 4: 3f (?)
- 5: 61 (a)
- 6: 6e (n)
- 7: 61 (a)
- 8: 3f (?)
- Charset: ISO-8859-1
- Input: ?Ma?ana?
- Encoded:
- 0: 20
- 1: bf (?)
- 2: 4d (M)
- 3: 61 (a)
- 4: f1 (?)
- 5: 61 (a)
- 6: 6e (n)
- 7: 61 (a)
- 8: 3f (?)
- Charset: UTF-8
- Input: ?Ma?ana?
- Encoded:
- 0: 20
- 1: c2 (?)
- 2: bf (?)
- 3: 4d (M)
- 4: 61 (a)
- 5: c3 (?)
- 6: b1 (±)
- 7: 61 (a)
- 8: 6e (n)
- 9: 61 (a)
- 10: 3f (?)
- Charset: UTF-16BE
- Input: ?Ma?ana?
- Encoded:
- 0: 00
- 1: 20
- 2: 00
- 3: bf (?)
- 4: 00
- 5: 4d (M)
- 6: 00
- 7: 61 (a)
- 8: 00
- 9: f1 (?)
- 10: 00
- 11: 61 (a)
- 12: 00
- 13: 6e (n)
- 14: 00
- 15: 61 (a)
- 16: 00
- 17: 3f (?)
- Charset: UTF-16LE
- Input: ?Ma?ana?
- Encoded:
- 0: 20
- 1: 00
- 2: bf (?)
- 3: 00
- 4: 4d (M)
- 5: 00
- 6: 61 (a)
- 7: 00
- 8: f1 (?)
- 9: 00
- 10: 61 (a)
- 11: 00
- 12: 6e (n)
- 13: 00
- 14: 61 (a)
- 15: 00
- 16: 3f (?)
- 17: 00
- Charset: UTF-16
- Input: ?Ma?ana?
- Encoded:
- 0: fe (?)
- 1: ff (?)
- 2: 00
- 3: 20
- 4: 00
- 5: bf (?)
- 6: 00
- 7: 4d (M)
- 8: 00
- 9: 61 (a)
- 10: 00
- 11: f1 (?)
- 12: 00
- 13: 61 (a)
- 14: 00
- 15: 6e (n)
- 16: 00
- 17: 61 (a)
- 18: 00
- 19: 3f (?)
- package java.nio.charset;
- public abstract class Charset implements Comparable
- {
- public static boolean isSupported (String charsetName)
- public static Charset forName (String charsetName)
- public static SortedMap availableCharsets()
- public final String name()
- public final Set aliases()
- public String displayName()
- public String displayName (Locale locale)
- public final boolean isRegistered()
- public boolean canEncode()
- public abstract CharsetEncoder newEncoder();
- public final ByteBuffer encode (CharBuffer cb)
- public final ByteBuffer encode (String str)
- public abstract CharsetDecoder newDecoder();
- public final CharBuffer decode (ByteBuffer bb)
- public abstract boolean contains (Charset cs);
- public final boolean equals (Object ob)
- public final int compareTo (Object ob)
- public final int hashCode()
- public final String toString()
- }
字符集的规范名称应与在 IANA 注册的名称相符。
如果IANA 用同一个字符集注册了多个名称,对象返回的规范名称应该与 IANA 注册中的MIME -首选名称相符。
如果字符集没有在 IANA 注册,它的规范名称必须以“X -”或“x-”开头。
大多数情况下,只有 JVM卖家才会关注这些规则。然而,如果您打算以您自己的字符集作为应用的一部分,那么了解这些不该做的事情将对您很有帮助。针对 isRegistered() 您应该返回 false 并以“X -”开头命名您的字符集。
- public abstract class Charset implements Comparable
- {
- // This is a partial API listing
- public abstract boolean contains (Charset cs);
- public final boolean equals (Object ob)
- public final int compareTo (Object ob)
- public final int hashCode()
- public final String toString()
- }
字符集编码器:字符集是由一个编码字符集和一个相关编码方案组成的。CharsetEncoder 和CharsetDecoder 类实现转换方案。
- float averageBytesPerChar()
- Returns the average number of bytes that will be produced for each character of input.
- boolean canEncode(char c)
- Tells whether or not this encoder can encode the given character.
- boolean canEncode(CharSequence cs)
- Tells whether or not this encoder can encode the given character sequence.
- Charset charset()
- Returns the charset that created this encoder.
- ByteBuffer encode(CharBuffer in)
- Convenience method that encodes the remaining content of a single input character buffer into a newly-allocated byte buffer.
- CoderResult encode(CharBuffer in, ByteBuffer out, boolean endOfInput)
- Encodes as many characters as possible from the given input buffer, writing the results to the given output buffer.
- protected abstract CoderResult encodeLoop(CharBuffer in, ByteBuffer out)
- Encodes one or more characters into one or more bytes.
- CoderResult flush(ByteBuffer out)
- Flushes this encoder.
- protected CoderResult implFlush(ByteBuffer out)
- Flushes this encoder.
- protected void implOnMalformedInput(CodingErrorAction newAction)
- Reports a change to this encoder's malformed-input action.
- protected void implOnUnmappableCharacter(CodingErrorAction newAction)
- Reports a change to this encoder's unmappable-character action.
- protected void implReplaceWith(byte[] newReplacement)
- Reports a change to this encoder's replacement value.
- protected void implReset()
- Resets this encoder, clearing any charset-specific internal state.
- boolean isLegalReplacement(byte[] repl)
- Tells whether or not the given byte array is a legal replacement value for this encoder.
- CodingErrorAction malformedInputAction()
- Returns this encoder's current action for malformed-input errors.
- float maxBytesPerChar()
- Returns the maximum number of bytes that will be produced for each character of input.
- CharsetEncoder onMalformedInput(CodingErrorAction newAction)
- Changes this encoder's action for malformed-input errors.
- CharsetEncoder onUnmappableCharacter(CodingErrorAction newAction)
- Changes this encoder's action for unmappable-character errors.
- byte[] replacement()
- Returns this encoder's replacement value.
- CharsetEncoder replaceWith(byte[] newReplacement)
- Changes this encoder's replacement value.
- CharsetEncoder reset()
- Resets this encoder, clearing any internal state.
- CodingErrorAction unmappableCharacterAction()
- Returns this encoder's current action for unmappable-character errors.
CharsetEncoder 对象是一个状态转换引擎:字符进去,字节出来。一些编码器的调用可能需要完成转换。编码器存储在调用之间转换的状态。
关于 CharsetEncoder API 的一个注意事项:首先,越简单的encode() 形式越方便,在重新分配的 ByteBuffer中您提供的 CharBuffer 的编码集所有的编码于一身。这是当您在 Charset类上直接调用 encode() 时最后调用的方法。
Overflow (上溢)
Malformed input(有缺陷的输入)
Unmappable character (无映射字符)
- package java.nio.charset;
- public abstract class CharsetEncoder
- {
- // This is a partial API listing
- public boolean canEncode (char c)
- public boolean canEncode (CharSequence cs)
- }
CodingErrorAction 定义了三个公共域:
创建 CharsetEncoder 时的默认行为。这个行为表示编码错误应该通过返回 CoderResult 对象
通过中止错误的输入并输出针对该 CharsetEncoder 定义的当前的替换字节序列处理编码错误。
CoderResult类:CoderResult 对象是由 CharsetEncoder 和CharsetDecoder 对象返回的:
- package java.nio.charset;
- public class CoderResult {
- public static final CoderResult OVERFLOW
- public static final CoderResult UNDERFLOW
- public boolean isUnderflow()
- public boolean isOverflow()
- <span style="white-space:pre"> </span>public boolean isError()
- public boolean isMalformed()
- public boolean isUnmappable()
- public int length()
- public static CoderResult malformedForLength (int length)
- public static CoderResult unmappableForLength (int length)
- <span style="white-space:pre"> </span>public void throwException() throws CharacterCodingException
- }
- float averageCharsPerByte()
- Returns the average number of characters that will be produced for each byte of input.
- Charset charset()
- Returns the charset that created this decoder.
- CharBuffer decode(ByteBuffer in)
- Convenience method that decodes the remaining content of a single input byte buffer into a newly-allocated character buffer.
- CoderResult decode(ByteBuffer in, CharBuffer out, boolean endOfInput)
- Decodes as many bytes as possible from the given input buffer, writing the results to the given output buffer.
- protected abstract CoderResult decodeLoop(ByteBuffer in, CharBuffer out)
- Decodes one or more bytes into one or more characters.
- Charset detectedCharset()
- Retrieves the charset that was detected by this decoder (optional operation).
- CoderResult flush(CharBuffer out)
- Flushes this decoder.
- protected CoderResult implFlush(CharBuffer out)
- Flushes this decoder.
- protected void implOnMalformedInput(CodingErrorAction newAction)
- Reports a change to this decoder's malformed-input action.
- protected void implOnUnmappableCharacter(CodingErrorAction newAction)
- Reports a change to this decoder's unmappable-character action.
- protected void implReplaceWith(String newReplacement)
- Reports a change to this decoder's replacement value.
- protected void implReset()
- Resets this decoder, clearing any charset-specific internal state.
- boolean isAutoDetecting()
- Tells whether or not this decoder implements an auto-detecting charset.
- boolean isCharsetDetected()
- Tells whether or not this decoder has yet detected a charset (optional operation).
- CodingErrorAction malformedInputAction()
- Returns this decoder's current action for malformed-input errors.
- float maxCharsPerByte()
- Returns the maximum number of characters that will be produced for each byte of input.
- CharsetDecoder onMalformedInput(CodingErrorAction newAction)
- Changes this decoder's action for malformed-input errors.
- CharsetDecoder onUnmappableCharacter(CodingErrorAction newAction)
- Changes this decoder's action for unmappable-character errors.
- String replacement()
- Returns this decoder's replacement value.
- CharsetDecoder replaceWith(String newReplacement)
- Changes this decoder's replacement value.
- CharsetDecoder reset()
- Resets this decoder, clearing any internal state.
- CodingErrorAction unmappableCharacterAction()
- Returns this decoder's current action for unmappable-character errors.
- package java.nio.charset;
- public abstract class CharsetDecoder
- {
- // This is a partial API listing
- public final CharsetDecoder reset()
- public final CharBuffer decode (ByteBuffer in)
- throws CharacterCodingException
- public final CoderResult decode (ByteBuffer in, CharBuffer out,
- boolean endOfInput)
- public final CoderResult flush (CharBuffer out)
- }
1. 复位解码器,通过调用 reset() ,把解码器放在一个已知的状态准备用来接收输入。
2. 把endOfInput 设置成 false 不调用或多次调用 decode(),供给字节到解码引擎中。随着解码的进行,字符将被添加到给定的 CharBuffer 中。
3. 把endOfInput 设置成 true 调用一次 decode(),通知解码器已经提供了所有的输入。
4. 调用flush() ,确保所有的解码字符都已经发送给输出。
示例6-2 说明了如何对表示字符集编码的字节流进行编码。
示例6 -2. 字符集解码
- package com.ronsoft.books.nio.charset;
- import java.nio.*;
- import java.nio.charset.*;
- import java.nio.channels.*;
- import*;
- /**
- * Test charset decoding.
- *
- * @author Ron Hitchens (
- */
- public class CharsetDecode {
- /**
- * Test charset decoding in the general case, detecting and handling buffer
- * under/overflow and flushing the decoder state at end of input. This code
- * reads from stdin and decodes the ASCII-encoded byte stream to chars. The
- * decoded chars are written to stdout. This is effectively a 'cat' for
- * input ascii files, but another charset encoding could be used by simply
- * specifying it on the command line.
- */
- public static void main(String[] argv) throws IOException {
- // Default charset is standard ASCII
- String charsetName = "ISO-8859-1";
- // Charset name can be specified on the command line
- if (argv.length > 0) {
- charsetName = argv[0];
- }
- // Wrap a Channel around stdin, wrap a channel around stdout,
- // find the named Charset and pass them to the deco de method.
- // If the named charset is not valid, an exception of type
- // UnsupportedCharsetException will be thrown.
- decodeChannel(Channels.newChannel(, new OutputStreamWriter(
- System.out), Charset.forName(charsetName));
- }
- /**
- * General purpose static method which reads bytes from a Channel, decodes
- * them according
- *
- * @param source
- * A ReadableByteChannel object which will be read to EOF as a
- * source of encoded bytes.
- * @param writer
- * A Writer object to which decoded chars will be written.
- * @param charset
- * A Charset object, whose CharsetDecoder will be used to do the
- * character set decoding. Java NIO 206
- */
- public static void decodeChannel(ReadableByteChannel source, Writer writer,
- Charset charset) throws UnsupportedCharsetException, IOException {
- // Get a decoder instance from the Charset
- CharsetDecoder decoder = charset.newDecoder();
- // Tell decoder to replace bad chars with default mark
- decoder.onMalformedInput(CodingErrorAction.REPLACE);
- decoder.onUnmappableCharacter(CodingErrorAction.REPLACE);
- // Allocate radically different input and output buffer sizes
- // for testing purposes
- ByteBuffer bb = ByteBuffer.allocateDirect(16 * 1024);
- CharBuffer cb = CharBuffer.allocate(57);
- // Buffer starts empty; indicate input is needed
- CoderResult result = CoderResult.UNDERFLOW;
- boolean eof = false;
- while (!eof) {
- // Input buffer underflow; decoder wants more input
- if (result == CoderResult.UNDERFLOW) {
- // decoder consumed all input, prepare to refill
- bb.clear();
- // Fill the input buffer; watch for EOF
- eof = ( == -1);
- // Prepare the buffer for reading by decoder
- bb.flip();
- }
- // Decode input bytes to output chars; pass EOF flag
- result = decoder.decode(bb, cb, eof);
- // If output buffer is full, drain output
- if (result == CoderResult.OVERFLOW) {
- drainCharBuf(cb, writer);
- }
- }
- // Flush any remaining state from the decoder, being careful
- // to detect output buffer overflow(s)
- while (decoder.flush(cb) == CoderResult.OVERFLOW) {
- drainCharBuf(cb, writer);
- }
- // Drain any chars remaining in the output buffer
- drainCharBuf(cb, writer);
- // Close the channel; push out any buffered data to stdout
- source.close();
- writer.flush();
- }
- /**
- * Helper method to drain the char buffer and write its content to the given
- * Writer object. Upon return, the buffer is empty and ready to be refilled.
- *
- * @param cb
- * A CharBuffer containing chars to be written.
- * @param writer
- * A Writer object to consume the chars in cb.
- */
- static void drainCharBuf(CharBuffer cb, Writer writer) throws IOException {
- cb.flip(); // Prepare buffer for draining
- // This writes the chars contained in the CharBuffer but
- // doesn't actually modify the state of the buffer.
- // If the char buffer was being drained by calls to get( ),
- // a loop might be needed here.
- if (cb.hasRemaining()) {
- writer.write(cb.toString());
- }
- cb.clear(); // Prepare buffer to be filled again
- }
- }
字符集服务器供应者接口:可插拔的 SPI 结构是在许多不同的内容中贯穿于 Java 环境使用的。在 1.4JDK中有八个包,一个叫spi 而剩下的有其它的名称。可插拔是一个功能强大的设计技术,是在 Java 的可移植性和适应性上建立的基石之一。
在浏览 API 之前,需要解释一下 Charset SPI 如何工作。java.nio.charset.spi 包仅包含一个抽取类,CharsetProvider 。这个类的具体实现供给与它们提供过的 Charset对象相关的信息。为了定义自定义字符集,您首先必须从 java.nio.charset package中创建 Charset, CharsetEncoder,以及CharsetDecoder 的具体实现。然后您创建CharsetProvider 的自定义子类,它将把那些类提供给JVM。
您至少要做的是创建 java.nio.charset.Charset 的子类、提供三个抽取方法的具体实现以及一个构造函数。Charset类没有默认的,无参数的构造函数。这表示您的自定义字符集类必须有一个构造函数,即使它不接受参数。这是因为您必须在实例化时调用 Charset的构造函数(通过在您的构造函数的开端调用 super() ),从而通过您的字符集规范名称和别名供给它。这样做可以让 Charset类中的方法帮您处理和名称相关的事情,所以是件好事。
同样地,您需要提供 CharsetEncoder和CharsetDecoder 的具体实现。回想一下,字符集是编码的字符和编码/解码方案的集合。如我们之前所看到的,编码和解码在 API 水平上几乎是对称的。这里给出了关于实现编码器所需要的东西的简短讨论:一样适用于建立解码器。
与Charset类似的, CharsetEncoder 没有默认的构造函数,所以您需要在具体类构造函数中调用super() ,提供需要的参数。
为了供给您自己的 CharsetEncoder 实现,您至少要提供具体encodeLoop () 方法。对于简单的编码运算法则,其他方法的默认实现应该可以正常进行。注意encodeLoop() 采用和 encode() 的参数类似的参数,不包括布尔标志。encode () 方法代表到encodeLoop() 的实际编码,它仅需要关注来自 CharBuffer 参数消耗的字符,并且输出编码的字节到提供的 ByteBuffer上。
现在,我们已经看到了如何实现自定义字符集,包括相关的编码器和解码器,让我们看一下如何把它们连接到 JVM中,这样可以利用它们运行代码。
为了给 JVM运行时环境提供您自己的 Charset实现,您必须在 java.nio.charsets. - spi 中创建 CharsetProvider 类的具体子类,每个都带有一个无参数构造函数。无参数构造函数很重要,因为您的 CharsetProvider 类将要通过读取配置文件的全部合格名称进行定位。之后这个类名称字符串将被导入到 Class.newInstance() 来实例化您的提供方,它仅通过无参数构造函数起作用。
JVM读取的配置文件定位字符集提供方,被命名为 java.nio.charset.spi.CharsetProvider 。它在JVM类路径中位于源目录(META-INF/services)中。每一个 JavaArchive(Java 档案文件)(JAR )都有一个 META-INF 目录,它可以包含在那个 JAR 中的类和资源的信息。一个名为META-INF 的目录也可以在 JVM类路径中放置在常规目录的顶端。
CharsetProvider 的API 几乎是没有作用的。提供自定义字符集的实际工作是发生在创建自定义 Charset,CharsetEncoder,以及 CharsetDecoder 类中。CharsetProvider 仅是连接您的字符集和运行时环境的促进者。
示例 6-3 中演示了自定义 Charset和CharsetProvider 的实现,包含说明字符集使用的取样代码,编码和解码,以及 Charset SPI。示例 6-3 实现了一个自定义Charset。
示例6 -3. 自定义Rot13 字符集
- package com.ronsoft.books.nio.charset;
- import java.nio.CharBuffer;
- import java.nio.ByteBuffer;
- import java.nio.charset.Charset;
- import java.nio.charset.CharsetEncoder;
- import java.nio.charset.CharsetDecoder;
- import java.nio.charset.CoderResult;
- import java.util.Map;
- import java.util.Iterator;
- import;
- import;
- import;
- import;
- import;
- import;
- import;
- /**
- * A Charset implementation which performs Rot13 encoding. Rot -13 encoding is a
- * simple text obfuscation algorithm which shifts alphabetical characters by 13
- * so that 'a' becomes 'n', 'o' becomes 'b', etc. This algorithm was popularized
- * by the Usenet discussion forums many years ago to mask naughty words, hide
- * answers to questions, and so on. The Rot13 algorithm is symmetrical, applying
- * it to text that has been scrambled by Rot13 will give you the original
- * unscrambled text.
- *
- * Applying this Charset encoding to an output stream will cause everything you
- * write to that stream to be Rot13 scrambled as it's written out. And appying
- * it to an input stream causes data read to be Rot13 descrambled as it's read.
- *
- * @author Ron Hitchens (
- */
- public class Rot13Charset extends Charset {
- // the name of the base charset encoding we delegate to
- private static final String BASE_CHARSET_NAME = "UTF-8";
- // Handle to the real charset we'll use for transcoding between
- // characters and bytes. Doing this allows us to apply the Rot13
- // algorithm to multibyte charset encodings. But only the
- // ASCII alpha chars will be rotated, regardless of the base encoding.
- Charset baseCharset;
- /**
- * Constructor for the Rot13 charset. Call the superclass constructor to
- * pass along the name(s) we'll be known by. Then save a reference to the
- * delegate Charset.
- */
- protected Rot13Charset(String canonical, String[] aliases) {
- super(canonical, aliases);
- // Save the base charset we're delegating to
- baseCharset = Charset.forName(BASE_CHARSET_NAME);
- }
- // ----------------------------------------------------------
- /**
- * Called by users of this Charset to obtain an encoder. This implementation
- * instantiates an instance of a private class (defined below) and passes it
- * an encoder from the base Charset.
- */
- public CharsetEncoder newEncoder() {
- return new Rot13Encoder(this, baseCharset.newEncoder());
- }
- /**
- * Called by users of this Charset to obtain a decoder. This implementation
- * instantiates an instance of a private class (defined below) and passes it
- * a decoder from the base Charset.
- */
- public CharsetDecoder newDecoder() {
- return new Rot13Decoder(this, baseCharset.newDecoder());
- }
- /**
- * This method must be implemented by concrete Charsets. We always say no,
- * which is safe.
- */
- public boolean contains(Charset cs) {
- return (false);
- }
- /**
- * Common routine to rotate all the ASCII alpha chars in the given
- * CharBuffer by 13. Note that this code explicitly compares for upper and
- * lower case ASCII chars rather than using the methods
- * Character.isLowerCase and Character.isUpperCase. This is because the
- * rotate-by-13 scheme only works properly for the alphabetic characters of
- * the ASCII charset and those methods can return true for non-ASCII Unicode
- * chars.
- */
- private void rot13(CharBuffer cb) {
- for (int pos = cb.position(); pos < cb.limit(); pos++) {
- char c = cb.get(pos);
- char a = '\u0000';
- // Is it lowercase alpha?
- if ((c >= 'a') && (c <= 'z')) {
- a = 'a';
- }
- // Is it uppercase alpha?
- if ((c >= 'A') && (c <= 'Z')) {
- a = 'A';
- }
- // If either, roll it by 13
- if (a != '\u0000') {
- c = (char) ((((c - a) + 13) % 26) + a);
- cb.put(pos, c);
- }
- }
- }
- // --------------------------------------------------------
- /**
- * The encoder implementation for the Rot13 Chars et. This class, and the
- * matching decoder class below, should also override the "impl" methods,
- * such as implOnMalformedInput( ) and make passthrough calls to the
- * baseEncoder object. That is left as an exercise for the hacker.
- */
- private class Rot13Encoder extends CharsetEncoder {
- private CharsetEncoder baseEncoder;
- /**
- * Constructor, call the superclass constructor with the Charset object
- * and the encodings sizes from the delegate encoder.
- */
- Rot13Encoder(Charset cs, CharsetEncoder baseEncoder) {
- super(cs, baseEncoder.averageBytesPerChar(), baseEncoder
- .maxBytesPerChar());
- this.baseEncoder = baseEncoder;
- }
- /**
- * Implementation of the encoding loop. First, we apply the Rot13
- * scrambling algorithm to the CharBuffer, then reset the encoder for
- * the base Charset and call it's encode( ) method to do the actual
- * encoding. This may not work properly for non -Latin charsets. The
- * CharBuffer passed in may be read -only or re-used by the caller for
- * other purposes so we duplicate it and apply the Rot13 encoding to the
- * copy. We DO want to advance the position of the input buffer to
- * reflect the chars consumed.
- */
- protected CoderResult encodeLoop(CharBuffer cb, ByteBuffer bb) {
- CharBuffer tmpcb = CharBuffer.allocate(cb.remaining());
- while (cb.hasRemaining()) {
- tmpcb.put(cb.get());
- }
- tmpcb.rewind();
- rot13(tmpcb);
- baseEncoder.reset();
- CoderResult cr = baseEncoder.encode(tmpcb, bb, true);
- // If error or output overflow, we need to adjust
- // the position of the input buffer to match what
- // was really consumed from the temp buffer. If
- // underflow (all input consumed), this is a no-op.
- cb.position(cb.position() - tmpcb.remaining());
- return (cr);
- }
- }
- // --------------------------------------------------------
- /**
- * The decoder implementation for the Rot13 Charset.
- */
- private class Rot13Decoder extends CharsetDecoder {
- private CharsetDecoder baseDecoder;
- /**
- * Constructor, call the superclass constructor with the Charset object
- * and pass alon the chars/byte values from the delegate decoder.
- */
- Rot13Decoder(Charset cs, CharsetDecoder baseDecoder) {
- super(cs, baseDecoder.averageCharsPerByte(), baseDecoder
- .maxCharsPerByte());
- this.baseDecoder = baseDecoder;
- }
- /**
- * Implementation of the decoding loop. First, we reset the decoder for
- * the base charset, then call it to decode the bytes into characters,
- * saving the result code. The CharBuffer is then de-scrambled with the
- * Rot13 algorithm and the result code is returned. This may not work
- * properly for non -Latin charsets.
- */
- protected CoderResult decodeLoop(ByteBuffer bb, CharBuffer cb) {
- baseDecoder.reset();
- CoderResult result = baseDecoder.decode(bb, cb, true);
- rot13(cb);
- return (result);
- }
- }
- // --------------------------------------------------------
- /**
- * Unit test for the Rot13 Charset. This main( ) will open and read an input
- * file if named on the command line, or stdin if no args are provided, and
- * write the contents to stdout via the X -ROT13 charset encoding. The
- * "encryption" implemented by the Rot13 algorithm is symmetrical. Feeding
- * in a plain-text file, such as Java source code for example, will output a
- * scrambled version. Feeding the scrambled version back in will yield the
- * original plain-text document.
- */
- public static void main(String[] argv) throws Exception {
- BufferedReader in;
- if (argv.length > 0) {
- // Open the named file
- in = new BufferedReader(new FileReader(argv[0]));
- } else {
- // Wrap a BufferedReader around stdin
- in = new BufferedReader(new InputStreamReader(;
- }
- // Create a PrintStream that uses the Rot13 encoding
- PrintStream out = new PrintStream(System.out, false, "X -ROT13");
- String s = null;
- // Read all input and write it to the output.
- // As the data passes through the PrintStream,
- // it will be Rot13-encoded.
- while ((s = in.readLine()) != null) {
- out.println(s);
- }
- out.flush();
- }
- }
为了使用这个 Charset和它的编码器与解码器,它必须对 Java 运行时环境有效。用CharsetProvider 类完成(示例 6-4)。
示例6 -4. 自定义字符集提供方
- package com.ronsoft.books.nio.charset;
- import java.nio.charset.Charset;
- import java.nio.charset.spi.CharsetProvider;
- import java.util.HashSet;
- import java.util.Iterator;
- /**
- * A CharsetProvider class which makes available the charsets provided by
- * Ronsoft. Currently there is only one, namely the X -ROT13 charset. This is
- * not a registered IANA charset, so it's name begins with "X-" to avoid name
- * *es with offical charsets.
- *
- * To activate this CharsetProvider, it's necessary to add a file to the
- * classpath of the JVM runtime at the following location:
- * META-INF/services/java.nio.charsets.spi.CharsetP rovider
- *
- * That file must contain a line with the fully qualified name of this class on
- * a line by itself: com.ronsoft.books.nio.charset.RonsoftCharsetProvider Java
- * NIO 216
- *
- * See the javadoc page for java.nio.charsets.spi.CharsetProvider for full
- * details.
- *
- * @author Ron Hitchens (
- */
- public class RonsoftCharsetProvider extends CharsetProvider {
- // the name of the charset we provide
- private static final String CHARSET_NAME = "X-ROT13";
- // a handle to the Charset object
- private Charset rot13 = null;
- /**
- * Constructor, instantiate a Charset object and save the reference.
- */
- public RonsoftCharsetProvider() {
- this.rot13 = new Rot13Charset(CHARSET_NAME, new String[0]);
- }
- /**
- * Called by Charset static methods to find a particular named Charset. If
- * it's the name of this charset (we don't have any aliases) then return the
- * Rot13 Charset, else return null.
- */
- public Charset charsetForName(String charsetName) {
- if (charsetName.equalsIgnoreCase(CHARSET_NAME)) {
- return (rot13);
- }
- return (null);
- }
- /**
- * Return an Iterator over the set of Charset objects we provide.
- *
- * @return An Iterator object containing references to all the Charset
- * objects provided by this class.
- */
- public Iterator<Charset> charsets() {
- HashSet<Charset> set = new HashSet<Charset>(1);
- set.add(rot13);
- return (set.iterator());
- }
- }
- 在示例 6-1 中的字符集清单中添加 X -ROT13,产生这个额外的输出:
- Charset: X-ROT13
- Input: żMańana?
- Encoded:
- 0: c2 (Ż)
- 1: bf (ż)
- 2: 5a (Z)
- 3: 6e (n)
- 4: c3 (Ă)
- 5: b1 (±)
- 6: 6e (n)
- 7: 61 (a)
- 8: 6e (n)
- 9: 3f (?)
总结:许多Java 编程人员永远不会需要处理字符集编码转换问题,而大多数永远不会创建自定义字符集。但是对于那些需要的人,在 java.nio.charset 和java.charset.spi 中的一系列类为字符处理提供了强大的以及弹性的机制。
CharsetProvider SPI(字符集供应商 SPI)
通过服务器供应商机制定位并使 Charset实现可用,从而在运行时环境中使用。