如果您的程序只有英文，为什么要使用Unicode？

So I've read Joel's article, and looked through SO, and it seems the only reason to switch from ASCII to Unicode is for internationalization. The company I work for, as a policy, will only release software in English, even though we have customers throughout the world. Since all of our customers are scientists, they have functional enough English to use our software as a non-native speaker. Or so the logic goes. Because of this policy, there is no pressing need to switch to Unicode to support other languages.

所以我读过Joel的文章,并查看了SO,似乎从ASCII切换到Unicode的唯一原因是国际化。作为一项政策,我所工作的公司只会发布英文软件,即使我们的客户遍布全球。由于我们所有的客户都是科学家,因此他们具有足够的英语功能,可以将我们的软件用作非母语人士。或者逻辑如此。由于此策略,没有迫切需要切换到Unicode以支持其他语言。

However, I'm starting a new project and wanted to use Unicode (because that is what a responsible programmer is supposed to do, right?). In order to do so, we would have to start converting all of the libraries we've written into Unicode. This is no small task.

但是,我正在开始一个新项目,并希望使用Unicode(因为这是一个负责任的程序员应该做的,对吧?)。为此,我们必须开始将我们编写的所有库转换为Unicode。这不是一项小任务。

If internationalization of the programs themselves is not considered a valid reason, how would one justify all the time spent recoding libraries and programs to make the switch to Unicode?

如果程序本身的国际化不被认为是一个正当的理由,那么如何将重新编码库和程序所花费的时间用于转换为Unicode?

21 个解决方案

#1

This obviously depends on what your app actually does, but just because you only have an english version in no way means that internationalization is not an issue.

这显然取决于你的应用程序实际上做了什么,但仅仅因为你只有英文版本绝不意味着国际化不是问题。

What if I want to store a customer name which uses non-english characters? Or the name of a place in another country?

如果我想存储使用非英文字符的客户名称该怎么办?或者是另一个国家/地区的名称?

As an added bonus (since you say you're targeting scientists) is that all sorts of scientific symbols and notiations are supported as part of Unicode.

作为一个额外的好处(因为你说你的目标是科学家)是各种科学符号和符号作为Unicode的一部分得到支持。

Ultimately, I find it much easier to be consistent. Unicode behaves the same no matter whose computer you run the app on. Non-unicode means that you use some locale-dependant character set or codepage by default, and so text that looks fine on your computer may be full of garbage characters on someone else's.

最终,我发现保持一致更容易。无论您在哪台计算机上运行应用程序,Unicode的行为都相同。非unicode意味着您默认使用某些与语言环境相关的字符集或代码页,因此在您的计算机上看起来很好的文本可能在其他人的垃圾字符中充满了垃圾字符。

Apart from that, you probably don't need to translate all your libraries to Unicode in one go. Write wrappers as needed to convert between Unicode and whichever encoding you use otherwise.

除此之外,您可能不需要一次性将所有库转换为Unicode。根据需要编写包装器,以便在Unicode和您使用的任何编码之间进行转换。

If you use UTF-8 for your Unicode text, you even get the ability to read plain ASCII strings, which should save you some conversion headaches.

如果您使用UTF-8作为Unicode文本,您甚至可以读取纯ASCII字符串,这可以节省一些转换问题。

#2

They say they will always put it in English now, but you admit you have worldwide clients. A client comes in and says internationalization is a deal breaker, will they really turn them down?

他们说他们现在总会把它用英语,但你承认你有全球客户。一位客户说,国际化是一个交易破坏者,他们真的会拒绝他们吗?

To clarify the point I'm trying to make you say that they will not accept this reasoning, but it is sound.

澄清一点,我试图让你说他们不会接受这种推理,但这是合理的。

Always better to be safe than sorry, IMO.

IMO,总是比对不起更安全。

#3

The extended Scientific, Technical and Mathematical character set rules.

扩展的科学,技术和数学字符集规则。

Where else can you say ⟦∀c∣c∈Unicode⟧ and similar technical stuff.

还有什么地方可以说⟦∀c|c∈Unicode⟧和类似的技术内容。

#4

Characters beyond the 7-bit ASCII range are useful in English as well. Does anyone using your software even need to write the € sign? Or £? How about distinguishing "résumé" from "resume"?You say it's used by scientists around the world, who may have names like "Jörg" or "Guðmundsdóttir". In a scientific setting, it is useful to talk about wavelengths like λ, units like Å, or angles as Θ, even in English.

超出7位ASCII范围的字符在英语中也很有用。有没有人使用你的软件甚至需要写下€标志?还是£?区分“简历”和“简历”怎么样?你说它被世界各地的科学家所使用,他们的名字可能是“Jörg”或“Guðmundsdóttir”。在科学的环境中,即使在英语中,也可以将像λ这样的波长,像Å这样的单位或角度称为Θ。

Some of these characters, like "ö", "£", and "€" may be available in 8-bit encodings like ISO-8859-1 or Windows-1252, so it may seem like you could just use those encodings and be done with it. The problem is that there are characters outside of those ranges that many people use very frequently, and so lots of existing data is encoded in UTF-8. If your software doesn't understand that when importing data, it may interpret the "£" character in UTF-8 as a sequence of 2 Windows-1252 characters, and render it as "Â£". If this sort of error goes undetected for long enough, you can start to get your data seriously garbled, as multiple passes of misinterpretation alter your data more and more until it becomes unrecoverable.

其中一些字符,如“ö”,“£”和“€”可能有8位编码,如ISO-8859-1或Windows-1252,所以看起来你可能只是使用那些编码而且是完成它。问题是许多人经常使用这些范围之外的字符,因此许多现有数据以UTF-8编码。如果您的软件在导入数据时不理解,它可能会将UTF-8中的“£”字符解释为2个Windows-1252字符的序列,并将其渲染为“£”。如果这种错误未被检测到足够长的时间,您就可以开始使数据严重乱码,因为多次错误解释会越来越多地改变您的数据,直到它变得无法恢复。

And it's good to think about these issues early on in the design of your program. Since strings tend to be very low-level concept that are threaded throughout your entire program, with lots of assumptions about how they work implicit in how they are used, it can be very difficult and expensive to add Unicode support to a program later on if you have never even thought about the issue to begin with.

在程序设计的早期考虑这些问题是很好的。由于字符串往往是非常低级的概念,贯穿整个程序,并且有很多关于它们如何隐式使用它们的假设,如果以后向程序添加Unicode支持可能会非常困难和昂贵。你从来没有想过这个问题。

My recommendation is to always use Unicode capable string types and libraries wherever possible, and make sure any tests you have (whether they be unit, integration, regression, or any other sort of tests) that deal with strings try passing some Unicode strings through your system to ensure that they work and come through unscathed.

我的建议是尽可能始终使用支持Unicode的字符串类型和库,并确保处理字符串的任何测试(无论是单元,集成,回归或任何其他类型的测试)尝试通过您的传递一些Unicode字符串系统,以确保他们的工作和安然无恙。

If you don't handle Unicode, then I would recommend ensuring that all data accepted by the system is 7-bit clean (that is, there are no characters beyond the 7-bit US-ASCII range). This will help avoid problems with incompatibilities between 8-bit legacy encodings like the ISO-8859 family and UTF-8.

如果你不处理Unicode,那么我建议确保系统接受的所有数据都是7位干净的(也就是说,7位US-ASCII范围之外没有字符)。这将有助于避免ISO-8859系列和UTF-8等8位传统编码之间不兼容的问题。

#5

Suppose your program allows me to put my name in it, on a form, a dialog, whatever, and my name can't be written with ascii characters... Even though your program is in English, the data may be in other language...

假设你的程序允许我把我的名字放在它,表格,对话框等等,而我的名字不能用ascii字符写成......即使你的程序是英文的,数据也可能是其他语言...

#6

It doesn't matter that your software is not translated, if your users use international characters then you need to support unicode to be able to do correct capitalization, sorting, etc.

如果您的用户使用国际字符,那么您的软件没有翻译并不重要,那么您需要支持unicode才能进行正确的大写,排序等。

#7

Well for one, your users might know and understand english, but they can still have 'local' names. If you allow your users to do any kind of input to your application, they might want to use characters that are not part of ascii. If you don't support unicode, you will have no way of allowing these names. You'd be forcing your users to adopt a more simple name just because the application isn't smart enough to handle special characters.

好吧,您的用户可能会了解并理解英语,但他们仍然可以拥有“本地”名称。如果您允许用户对您的应用程序进行任何类型的输入,他们可能希望使用不属于ascii的字符。如果您不支持unicode,则无法允许这些名称。您将强制用户采用更简单的名称,因为应用程序不够智能,无法处理特殊字符。

Another thing is, even if the standard right now is that the app will only be released in English, you are also blocking the possibility of internationalization with ASCII, adding to the work that needs to be done when the company policy decides that translations are a good thing. Company policy is good, but has also been known to change.

另一件事是,即使现在的标准是应用程序只会以英文发布,你也阻止了使用ASCII进行国际化的可能性,增加了当公司政策决定翻译时需要完成的工作。好事。公司政策很好,但也有所改变。

#8

If you have no business need to switch to unicode, then don't do it. I'm basing this on the fact that you thought you'd need to change code unrelated to component you already need to change to make it all work with Unicode. If you can make the component/feature you're working on "Unicode ready" without spreading code churn to lots of other components (especially other components without good test coverage) then go ahead and make it unicode ready. But don't go churn your whole codebase without business need.

如果您没有业务需要切换到unicode,那么就不要这样做。我的基础是你认为你需要更改与你需要更改的组件无关的代码,以使其全部使用Unicode。如果您可以制作组件/功能,那么您就可以使用“Unicode ready”而不会将代码扩展到许多其他组件(特别是没有良好测试覆盖率的其他组件),那么请继续使用unicode。但是,如果没有业务需求,不要浪费整个代码库。

If the business need arises later, address it then. Otherwise, you aren't going to need it.

如果以后出现业务需求,那么请解决它。否则,你不会需要它。

People in this thread may suppose scenarios where it becomes a business requirement. Run those scenarios by your product managers before considering them scenarios worth addressing. Make sure they know the cost of addressing them when you ask.

此线程中的人员可能会假设它成为业务需求的场景。在考虑这些方案值得解决之前,请由产品经理运行这些方案。当你提问时,确保他们知道解决这些问题的成本。

#9

The company I work for, **as a policy**, will only release software in English, even though we have customers throughout the world.

1 reason only: Policies change, and when they change, they will break your existing code. Period.

仅限1个原因:政策发生变化,当它们发生变化时,它们将破坏您现有的代码。期。

Design for evil, and you have a chance of not breaking your code so soon. In this case, use Unicode. Happened to me on a brazilian specific stock-market legacy system.

设计为邪恶,你有机会不这么快就破坏你的代码。在这种情况下,请使用Unicode。发生在巴西特定的股票市场遗产系统上。

#10

I'd say this attitude expressed naïveté, but I wouldn't be able to spell naïveté in ASCII-only.

我会说这种态度表达了天真,但我无法用ASCII语言表达天真。

ASCII still works for some computer-only codes, but is no good for the façade between machine and user.

ASCII仍适用于某些仅限计算机的代码,但对机器和用户之间的外观不利。

Even without the New Yorker's old-fashioned style of coöperation, how would some poor woman called Zoë cope if her employers used such a system?

即使没有纽约人的老式合作风格,如果她的雇主使用这样一个系统,那么一个名叫Zoë的可怜女人会如何应对呢?

Alas, she wouldn't even seek other employment, as updating her résumé would be impossible, and she'd have to resume instead. How's she going to explain that to her fiancée?

唉,她甚至不会寻求其他工作,因为更新她的简历是不可能的,她必须改为。她怎么去向她的未婚妻解释那个?

#11

Many languages (Java [and thus most JVM-based language implementations], C# [and thus most .NET-based language implementatons], Objective C, Python 3, ...) support Unicode strings by preference or even (nearly) exclusively (you have to go out of your way to work with "strings" of bytes rather than of Unicode characters).

许多语言(Java [因此大多数基于JVM的语言实现],C#[因此大多数基于.NET的语言实现],Objective C,Python 3,......)优先支持Unicode字符串,甚至(几乎)专有(你必须用你的方式来处理字节的“字符串”而不是Unicode字符。

If the company you work for ever intends to use any of these languages and platforms, it would therefore be quite advisable to start planning a Unicode-support strategy; a pilot project in particular might not be a bad idea.

如果您工作的公司打算使用这些语言和平台中的任何一种,那么开始规划Unicode支持策略是非常明智的。特别是一个试点项目可能不是一个坏主意。

#12

That's a really good question. The only reason I can think of that has nothing to do with I18n or non-English text is that Unicode is particularly suited to being what might be called a hub character set. If you think of your system as a hub with its external dependencies as spokes, you want to isolate character encoding conversions to the spokes, so that your hub system works consistently with your chosen encoding. What makes Unicode a ideal character set for the hub of your system is that it acknowledges the existence of other character sets, it defines equivalences between its own characters and characters in those external character sets, and there's an ongoing process where it extends itself to keep up with the innovation and evolution of external character sets. There are all sorts of weird encodings out there: even when the documentation assures you that the external system or library is using plain ASCII it often turns out to be some variant like IBM775 or HPRoman8, and the nice thing about Unicode is that no matter what encoding is thrown at you, there's a good chance that there's a table on unicode.org that defines exactly how to convert that data into Unicode and back out again without losing information. Then again, equivalents of a-z are fairly well-defined in every character set, so if your data really is restricted to the standard English alphabet, ASCII may do just as well as a hub character set.

这是一个非常好的问题。我能想到的与I18n或非英文文本无关的唯一原因是Unicode特别适合作为可能被称为集线器字符集的东西。如果您将系统视为具有外部依赖关系作为辐条的集线器,则需要将字符编码转换与辐条隔离,以便您的集线器系统与所选编码一致。使Unicode成为系统中枢的理想字符集的原因在于它承认其他字符集的存在,它定义了它自己的字符和那些外部字符集中的字符之间的等价,并且有一个持续的过程,它将自身扩展到保持随着外部字符集的创新和发展。有各种各样奇怪的编码:即使文档确保外部系统或库使用纯ASCII,它通常会变成像IBM775或HPRoman8这样的变体,而Unicode的优点在于无论是什么编码向您抛出,很有可能在unicode.org上有一个表格,该表格确切地定义了如何将该数据转换为Unicode并再次退出而不会丢失信息。然后,a-z的等价物在每个字符集中都相当明确,因此如果您的数据实际上仅限于标准英语字母表,则ASCII可能与集线器字符集一样好。

A decision on encoding is a decision on two things - what set of characters are permitted and how those characters are represented. Unicode permits you to use pretty much any character ever invented, but you may have your own reasons not to want or need such a wide choice. You might still restrict usernames, for example, to combinations of a-z and underscore, maybe because you have to put them into an external LDAP system whose own character set is restricted, maybe because you need to print them out using a font that doesn't cover all of Unicode, maybe because it closes off the security problems opened up by lookalike characters. If you're using something like ASCII or ISO8859-1, the storage/transmission layer implements a lot of those restrictions; with Unicode the storage layer doesn't restrict anything so you might have to implement your own rules at the application layer. This is more work - more programming, more testing, more possible system states. The tradeoff for that extra work is more flexibility, application-level rules being easier to change than system encodings.

关于编码的决定是关于两件事的决定 - 允许哪些字符集以及如何表示这些字符。 Unicode允许您使用几乎所有发明的角色,但您可能有自己的理由不想要或需要这么多选择。例如,您可能仍会将用户名限制为az和下划线的组合,可能是因为您必须将它们放入外部LDAP系统中,这些外部LDAP系统的字符集受到限制,可能是因为您需要使用不支持的字体将它们打印出来覆盖所有的Unicode,可能是因为它关闭了由相似的字符打开的安全问题。如果您使用的是ASCII或ISO8859-1,存储/传输层会实现许多限制;使用Unicode,存储层不会限制任何内容,因此您可能必须在应用程序层实现自己的规则。这是更多的工作 - 更多的编程,更多的测试,更多可能的系统状态。额外工作的权衡更灵活,应用程序级规则比系统编码更容易更改。

#13

The reason to use unicode is to respect proper abstractions in your design.

使用unicode的原因是为了尊重设计中的正确抽象。

Just get used to treating the concept of text properly. It is not hard. There's no reason to create a broken design even if your users are English.

只是习惯于正确对待文本的概念。这并不难。即使您的用户是英语,也没有理由创建破碎的设计。

#14

Just think of a customer wanting to use names like Schrödingers Cat for files he saved using your software. Or imagine some localized Windows with a translation of My Documents that uses non-ASCII characters. That would be internationalization that has, though you don't support internationalization at all, have effects on your software.

想想一个客户想要使用像SchrödingersCat这样的名字来获取他使用你的软件保存的文件。或者想象一些本地化的Windows,其中包含使用非ASCII字符的“我的文档”的翻译。这将是国际化,尽管你根本不支持国际化,但它对你的软件有影响。

Also, having the option of supporting internationalization later is always a good thing.

此外,选择以后支持国际化总是一件好事。

#15

Internationalization is so much more than just text in different languages. I bet it's the niche of the future in the IT-world. Heck, it already is. A lot has already been said, just thought I would add a small thing. Even though your customers right now are satisfied with english, that might change in the future. And the longer you wait, the harder it will be to convert your code base. They might even today have problems with e.g. file names or other types of data you save/load in your application.

国际化不仅仅是不同语言的文本。我敢打赌,这是IT世界未来的利基。哎呀,它已经是。已经说了很多,只是想我会添加一些小东西。即使您的客户现在对英语感到满意,但未来可能会发生变化。等待的时间越长,转换代码库就越困难。他们甚至可能在今天遇到问题。您在应用程序中保存/加载的文件名或其他类型的数据。

#16

Unicode is like cooties. Once it "infects" one area, it's usually hard to contain it given interconnectedness of dependencies. Sooner or later, you'll probably have to tie in a library that is unicode compliant and thus will use wchar_t's or the like. Instead of marshaling between character types, it's nice to have consistent strings throughout.

Unicode就像cooties。一旦它“感染”了一个区域,由于依赖关系的互连性,通常很难包含它。迟早,你可能必须绑定一个符合unicode的库,因此会使用wchar_t等。不是在字符类型之间进行编组,而是始终拥有一致的字符串。

Thus, it's nice to be consistent. Otherwise you'll end up with something similar to the Windows API that has a "A" version and a "W" version for most APIs since they weren't consistent to start with. (And in some cases, Microsoft has abandoned creating "A" versions altogether.)

因此,保持一致是件好事。否则,您将最终得到类似于Windows API的东西,其中包含大多数API的“A”版本和“W”版本,因为它们一开始并不一致。 (在某些情况下,微软完全放弃了创建“A”版本。)

#17

You haven't said what language you're using. In some languages, changing from ASCII to Unicode may be pretty easy, whereas in others (which don't support Unicode) it might be pretty darn hard.

您还没有说过您正在使用的语言。在某些语言中,从ASCII更改为Unicode可能非常简单,而在其他语言(不支持Unicode)中,它可能非常难以实现。

That said, maybe in your situation you shouldn't support Unicode: you can't think of a compelling reason why you should, and there are some reasons (i.e. your cost to change your existing libraries) which argue against. I mean, perhaps 'ideally' you should but in practice there might be some other, more important or more urgent, thing to spend your time and effort on at the moment.

也就是说,也许在你的情况下你不应该支持Unicode:你不能想到你应该有一个令人信服的理由,并且有一些理由(即改变现有库的成本)反对。我的意思是,或许'理想'你应该,但在实践中可能会有一些其他的,更重要或更紧急的事情,现在花费你的时间和精力。

#18

If program takes text input from the user, it should use unicode; you never know what language the user is going to use.

如果程序从用户那里获取文本输入,它应该使用unicode;你永远不知道用户将使用什么语言。

#19

When using Unicode, it leaves the door open for internationalization if requirements ever change and you are required to use text in other languages than English.

当使用Unicode时,如果需求发生变化并且您需要使用除英语之外的其他语言的文本,它将为国际化打开大门。

Also, in your new project you could always just write wrappers for the libraries that internally convert between ASCII and Unicode and vice-versa.

此外,在您的新项目中,您始终可以只为内部在ASCII和Unicode之间转换的库编写包装器,反之亦然。

#20

Your potential client may already be running a non-unicode application in a language other than English and won't be able to run your program without swichting the windows unicode locale back and forth, which will be a big pain.

您的潜在客户可能已经使用非英语语言运行非unicode应用程序,并且无法在不交换Windows unicode语言环境的情况下运行您的程序,这将是一个巨大的痛苦。

#21

Because the internet is overwhelmingly using Unicode. Web pages use unicode. Text files including your customer's documents, and the data on their clipboards, is Unicode.

因为互联网绝大多数都使用Unicode。网页使用unicode。文本文件(包括客户的文档和剪贴板上的数据)是Unicode。

Secondly Windows, is natively Unicode, and the ANSI APIs are a legacy.

其次,Windows本身是Unicode,ANSI API是遗留的。

Modern applications should use Unicode where applicable, which is almost everywhere.

现代应用程序应该在适用的地方使用Unicode,几乎无处不在。

#1