什么是unicode字符首（U + 9996）以及java / mysql如何处理它及其朋友？

I have a java String that contains the unicode character U+9996 (that's what I get if I do codePointAt()).

我有一个java字符串,其中包含unicode字符U + 9996(如果我执行codePointAt(),那就是我得到的)。

If I look at it in the debugger expressions panel (in eclipse) then all is well and it looks like "首". However if I print it out to the console I get simply "?". It doesn't seem to be the font that's the problem as I've tried setting that differently.

如果我在调试器表达式面板中看到它(在eclipse中)那么一切都很好,它看起来像“首”。但是,如果我将它打印到控制台,我只需“?”。它似乎不是那个问题的字体,因为我尝试过不同的设置。

My real problem is that I'm trying to put the string into a MySQL database (with utf8 encoding). Lots of other wide characters show up fine in the db but, again, this one and some others like it show up as "?". All of which leads me to believe that the problem is on the java side.

我真正的问题是我正在尝试将字符串放入MySQL数据库(使用utf8编码)。很多其他广泛的角色在数据库中显示得很好,但是,这个和其他一些像它一样显示为“?”。所有这些让我相信问题出在java方面。

In chasing down this bug I've learnt a little about Unicode Normalization and java.text.Normalizer which looks like it might be relevant in this case. I've learnt that U+9996 is the canonical version of U+2FB8. U+2FB8 has exactly the same problems above though as regards display and anyway why would I want to transform to a non-canonical representation (even if I could, which I don't think I can)?

在追逐这个错误的过程中,我学到了一些关于Unicode Normalization和java.text.Normalizer的知识,看起来它在这种情况下可能是相关的。我了解到U + 9996是U + 2FB8的规范版本。 U + 2FB8上面有完全相同的问题,但在显示方面,无论如何我为什么要转换为非规范表示(即使我可以,我认为我不能)?

Anyway, there's one potential clue I've found which I've been unable to comprehend. This page contains the words "U+9996 is not a valid unicode character" with no further explanation. It then proceeds to show how to encode this supposedly non-valid unicode character in various unicode encodings. So my question is this basically: WTF?

无论如何,我找到了一条我无法理解的潜在线索。此页面包含单词“U + 9996不是有效的unicode字符”,没有进一步说明。然后它继续展示如何在各种unicode编码中编码这个所谓的无效unicode字符。所以我的问题基本上是这样的:WTF?

UPDATES

I'm on a Mac.

我在Mac上。

I'm talking about the Eclipse console.
- I set the console encoding to UTF-8 under Run > Common
- I added -Dfile.encoding=UTF-8 to the JVM arguments (the default was MacRoman)
- The console (Eclipse and Terminal.app) now show the right chars. Hooray!

我在谈论Eclipse控制台。我在“运行”>“通用”下将控制台编码设置为UTF-8我将-Dfile.encoding = UTF-8添加到JVM参数(默认为MacRoman)控制台(Eclipse和Terminal.app)现在显示正确的字符。万岁!

I'm mostly interested in the data getting into the database correctly though of course I'd like to get a total understanding of what's going on here.

我最感兴趣的是数据正确进入数据库,当然我想要全面了解这里发生了什么。

I think I've fixed the database problem. I forgot to set the encoding on the connection. Now I don't understand why some asian characters were getting through and not others.

我想我已修复了数据库问题。我忘了在连接上设置编码。现在我不明白为什么有些亚洲人物正在经历,而不是其他人。

Phew, * moves fast. It's hard to keep up. Thanks people.

Phew,*快速移动。很难跟上。谢谢大家。

3 个解决方案

#1

Have you verified that the value that gets stored in the database is actually U+003f (question mark)? There are all sorts of conventions for how to display characters that don't exist in the chosen font, and displaying them as ?' is fairly common.

您是否验证了存储在数据库中的值实际上是U + 003f(问号)?有关如何显示所选字体中不存在的字符并将其显示为?'的各种约定?很常见

So most likely, the character gets stored correctly, and for whatever reasons, simply gets displayed as '?'. Basically, ignore how it gets rendered, and look at what codepoint gets stored in the database. Is it U+9996 or U+003f (or something else entirely)? Don't blindly assume that just because it gets rendered as a question mark, it is actually a question mark that is stored in the database.

所以最有可能的是,角色被正确存储,无论出于何种原因,只是显示为“?”。基本上,忽略它的呈现方式,并查看在数据库中存储的代码点。它是U + 9996还是U + 003f(还是完全不同的东西)?不要盲目地认为只是因为它被渲染为问号,它实际上是一个存储在数据库中的问号。

#2

I don't know about the problems, but it's definitely a valid Unicode character (and has been since Unicode 1.1).

我不知道这些问题,但它肯定是一个有效的Unicode字符(自从Unicode 1.1以来)。

#3

What O/S is this running on?

这是运行什么O / S?

What console application is ie (xterm, cmd.exe, etc?)

什么控制台应用程序即(xterm,cmd.exe等?)

Is the console application set for UTF-8 output?

控制台应用程序是否设置为UTF-8输出?

Regarding 3 above, which is probably the important one, I've seen similar issues using e.g. PuTTY to talk to a Linux box, where the Linux box thought I was on UTF-8, but the PuTTY session itself was set to ISO-Latin-1 (8859-1)

关于上面的3,这可能是重要的,我已经看到类似的问题使用例如PuTTY与Linux机器人交谈,Linux机箱认为我在UTF-8上,但PuTTY会话本身设置为ISO-Latin-1(8859-1)

#1