There are the standard A-Z, a-z characters, but also there are hyphens, em dashes, quotes, etc.
有标准的A-Z,a-z字符,但也有连字符,短划线,引号等。
Plus, there are all of the international characters, like umlauts, etc.
此外,还有所有的国际角色,如变形金刚等。
So, for an English-based system, what's the complete set? What about sets for other languages? What about UTF8, UTF16, etc?
那么,对于基于英语的系统,完整的设置是什么?如何设置其他语言? UTF8,UTF16等怎么样?
Bonus question: How many name fields are needed, and what are their maximum lengths?
奖金问题:需要多少名称字段,以及它们的最大长度是多少?
EDIT: There are definitely two different types of characters involved in people's names, those that are there as part of the context, and those that are there for structural reasons. I don't want to limit or interfere with the context characters, but I do need to deal with the structural ones.
编辑:人名中肯定有两种不同类型的字符,那些作为上下文的一部分,以及那些由于结构原因而存在的字符。我不想限制或干扰上下文字符,但我确实需要处理结构上的字符。
For example, I had a name come in that was separated by an em dash, but it was hard to distinguish that from the minus character. To make the system easier for searching, I want to take all five different types of dashes, and map them onto one unique character (minus), that way the searcher doesn't need to know specifically which symbol was initially entered.
例如,我有一个名字,它被一个em破折号分开,但是很难将它与减号区分开来。为了使系统更容易搜索,我想采用所有五种不同类型的破折号,并将它们映射到一个唯一的字符(减号),这样搜索者就不需要具体知道最初输入的符号。
The problem exists for dashes, probably quotes as well, but also how many other symbols?
破折号存在问题,可能还有引号,还有多少其他符号?
10 个解决方案
#1
There's good article by the W3C called Personal names around the world that explains the problems (and possible solutions) pretty well (it was originally a two-part blog post by Richard Ishida: part 1 and part 2)
W3C的好文章称世界各地的个人名称很好地解释了问题(和可能的解决方案)(最初由Richard Ishida撰写的两部分博客文章:第1部分和第2部分)
Personally I'd say: support every printable Unicode-Character and to be safe provide just a single field "name" that contains the full, formatted name. This way you can store pretty much every form of name. You might need a more structured storage, but then don't expect to be able to store every single combination in a structured form, as there are simply too many different ones.
我个人会说:支持每个可打印的Unicode字符并且安全地只提供包含完整格式化名称的单个字段“名称”。这样您就可以存储几乎所有形式的名称。您可能需要更结构化的存储,但是不要期望能够以结构化形式存储每个组合,因为存在太多不同的存储。
#2
Whitelisting characters that could appear in a person's name is the wrong way to go, if you ask me. Sure, [A-Za-z] is a fair starting point, but, as you said, you get problems with "European" names. So you map all the umlauts, circumflexes and those. What about Chinese names? Japanese? Indian? Hebrew? You're entering a battle against wind turbines.
如果你问我,那些可能出现在某个人姓名中的字符是错误的。当然,[A-Za-z]是一个公平的起点,但正如你所说,你会遇到“欧洲”名字的问题。所以你映射了所有变音符号,回音符号和那些符号。中国名字怎么样?日本?印度?希伯来语?你正在与风力涡轮机进行战斗。
If you absolutely must check the validity of someone's name, I'd suggest doing a modest blacklist of certain characters. Braces, mathematical characters, some punctuation and such might be safe to ignore. But I'd be cautious, if I were you.
如果你绝对必须检查某人姓名的有效性,我建议你做一些适当的黑名单。大括号,数学字符,一些标点符号等都可以安全地忽略。但如果我是你,我会保持谨慎。
It might be best to just accept whatever comes in. UTF-16 should be today's overkill character set, that should be adequate for some years to come.
最好只接受任何内容.UTF-16应该是今天的过度杀伤字符集,在未来几年应该足够了。
Edit: As for your question about name length and amount of names. If you really want people to write their real and complete names, I guess the only foolproof answer to both of those questions would be "infinite". Not being able to whip out any real examples for human beings, but surely there are analogous examples for humans as the native name for the city of Bangkok.
编辑:关于名称长度和名称数量的问题。如果你真的希望人们写出真实而完整的名字,我想这两个问题的唯一万无一失的答案就是“无限”。不能为人类扯出任何真实的例子,但肯定有类似的例子,人类作为曼谷市的本土名称。
#3
I don't think there's a definitive answer. After all, some people have names that can't even be expressed in UTF-16...
我认为没有明确的答案。毕竟,有些人的名字甚至无法用UTF-16表达......
There are some odd people out there, who will give their kids the craziest of names, including putting in weird punctuation, accents that don't exist in their own language, etc.
那里有一些奇怪的人,他们会给他们的孩子最疯狂的名字,包括加入奇怪的标点符号,用他们自己的语言不存在的口音等。
However, you can place arbitrary restrictions on your database. If you want to you can insist on 7 bit ASCII names. It's slightly rude to users, but they'll live with it. It certainly makes searching easier.
但是,您可以对数据库设置任意限制。如果你愿意,你可以坚持使用7位ASCII名称。这对用户来说有点粗鲁,但他们会忍受它。它当然使搜索更容易。
My colleague's daughter is named Amélie. But even some (not all!) official British government web sites ("Please enter the name exactly as shown on the birth certificate") won't accept the unicode, so he has to use 'Amelie' instead.
我同事的女儿名叫Amélie。但即使是一些(并非所有!)官方英国*网站(“请输入出生证明上显示的名称”)也不会接受unicode,因此他必须使用'Amelie'代替。
#4
Any character that can be represented by any multiple of eight bits (greater than zero) is a possible character for a person's name. Lengths of both names and encodings are arbitrary, so no upper bound should be considered.
任何可由八位(大于零)的任意倍数表示的字符都是人名的可能字符。名称和编码的长度都是任意长度,因此不应考虑上限。
Just make sure you sanitize your database inputs so little Bobby Drop-tables doesn't get ya.
只要确保你清理你的数据库输入,所以很少Bobby Drop-tables不能得到你。
#5
On the issue of name fields, the WRONG answer is first name, middle initial, last name, etc. for many reasons.
关于名称字段的问题,错误答案是名字,中间姓名,姓氏等,原因有很多。
-
Many people are known by their middle name, and formally use a first initial, middle name, last name format.
许多人以他们的中间名称而闻名,并正式使用第一个初始,中间名,姓氏格式。
-
In some cultures, the surname is the first name, and the given name is the last name.
在某些文化中,姓氏是名字,给定的名字是姓氏。
-
Multiple first and/or middle given names is getting more common. As @Dour High Arch points out, the other extreme is people with only one word in their name.
多个第一个和/或中间给定的名称变得越来越普遍。正如@Dour High Arch指出的那样,另一个极端是人们名字中只有一个字。
In an object-oriented database, you would store a Name object with methods to return a directory-style or signature-style name; and the backing store would contain whatever data was necessary to support those methods.
在面向对象的数据库中,您将使用方法存储Name对象以返回目录样式或签名样式的名称;并且后备存储将包含支持这些方法所需的任何数据。
I haven't yet seen a relational database model that improves on the model of two variable-length strings for directory-style and signature-style names.
我还没有看到一个关系数据库模型改进了目录样式和签名样式名称的两个可变长度字符串的模型。
#6
It really depends on what the app is supposed to be used for.
这实际上取决于应用程序应该用于什么。
Sure, in theory it's great if you allow every script on god's green earth to be used, but if the DB is also used by support staff, are they going to be able to handle names in Japanese, Hebrew and Thai script? Can you printer, if it's used to print postage labels?
当然,理论上如果你允许使用上帝绿地上的每个剧本都很棒,但是如果数据库也被支持人员使用,他们是否能够处理日语,希伯来语和泰语脚本的名字?如果它用于打印邮资标签,你可以打印机吗?
You might add an extra field "Latin Transcription", but IMO it's really OK to restrict it to ISO-8859-1 characters - People who don't use Latin characters are by now so used to having to use a transcription that they don't mind it anymore, unless they're hardcore nationalists.
您可以添加一个额外的字段“拉丁语转录”,但IMO可以将其限制为ISO-8859-1字符 - 不使用拉丁字符的人现在已经习惯了使用他们不喜欢的转录除非他们是铁杆民族主义者,否则不要再介意了。
#7
I'm making software for driving schools in the USA, so to me what matters most what the state DMV's accept as a proper name on a driver's license. In my case, it would cause problems to allow names beyond what the DMV allows, even if such names were legal because the same name must later be used for a driver's license.
我正在为在美国驾驶学校制作软件,所以对我来说最重要的是国家DMV在驾驶执照上接受的正确名称。在我的情况下,它会导致问题,允许名称超出DMV允许的范围,即使这些名称是合法的,因为以后必须使用相同的名称作为驾驶执照。
From *, I still hadn't confirmed the answer I needed. And I happen to know that in my state (Calif) they're using AS400's with software probably written in COBOL, and to the best of my knowledge, those only support an 8-bit character set. (Is it EBCDIC?) Anyway... Ugh.
从*,我仍然没有确认我需要的答案。我碰巧知道在我的州(加利福尼亚州),他们使用AS400和可能用COBOL编写的软件,而据我所知,那些只支持8位字符集。 (是EBCDIC吗?)无论如何......呃。
So, I called the California DMV... Sure enough, their system allows A-Z and spaces and absolutely nothing else. Not even hyphens are allowed -- Hyphens are replaced with spaces. In fact, apparently just to be difficult, they only use capitals. And names such as "O'Malley" must be replaced with OMALLEY.
所以,我打电话给加利福尼亚州的DMV ......果然,他们的系统允许A-Z和空间,绝对没有别的。连字符都不允许 - 连字符用空格替换。事实上,显然只是困难,他们只使用资本。诸如“O'Malley”之类的名称必须替换为OMALLEY。
Leave it to government. I must say I'm thrilled not to be a developer working for DMV. (Although I could really use that kind of salary.)
把它留给*。我必须说我很高兴不成为DMV的开发人员。 (虽然我真的可以使用那种薪水。)
#8
UTF-8 should be good enough, as far as name fields, you'll want at minimum a first name and last.
UTF-8应该足够好,就名称字段而言,您至少需要名字和姓氏。
#9
What do you do when you have "The Artist Formerly Known as Prince". That symbol he used is not a character in the unicode set (AFAIK).
当你拥有“以前称为王子的艺术家”时你会怎么做?他使用的那个符号不是unicode集(AFAIK)中的一个字符。
It's some levity, but at the same time, names are a rather broad concept that doesn't lend itself well to a structured format. In this case, something free-form might be most appropriate.
这有点轻浮,但与此同时,名称是一个相当广泛的概念,不适合结构化格式。在这种情况下,*形式可能是最合适的。
#10
Depending on the complexity of your name structure I could see:
根据您的名称结构的复杂性,我可以看到:
- First Name
- Middle Initial/Middle Name
- Last Name
- Suffix (Jr. Sr. II, III, IV, etc.)
- Prefix (Mr., Mrs., Ms., etc.)
中间名/中间名
后缀(Jr. Sr. II,III,IV等)
前缀(先生,夫人,女士等)
#1
There's good article by the W3C called Personal names around the world that explains the problems (and possible solutions) pretty well (it was originally a two-part blog post by Richard Ishida: part 1 and part 2)
W3C的好文章称世界各地的个人名称很好地解释了问题(和可能的解决方案)(最初由Richard Ishida撰写的两部分博客文章:第1部分和第2部分)
Personally I'd say: support every printable Unicode-Character and to be safe provide just a single field "name" that contains the full, formatted name. This way you can store pretty much every form of name. You might need a more structured storage, but then don't expect to be able to store every single combination in a structured form, as there are simply too many different ones.
我个人会说:支持每个可打印的Unicode字符并且安全地只提供包含完整格式化名称的单个字段“名称”。这样您就可以存储几乎所有形式的名称。您可能需要更结构化的存储,但是不要期望能够以结构化形式存储每个组合,因为存在太多不同的存储。
#2
Whitelisting characters that could appear in a person's name is the wrong way to go, if you ask me. Sure, [A-Za-z] is a fair starting point, but, as you said, you get problems with "European" names. So you map all the umlauts, circumflexes and those. What about Chinese names? Japanese? Indian? Hebrew? You're entering a battle against wind turbines.
如果你问我,那些可能出现在某个人姓名中的字符是错误的。当然,[A-Za-z]是一个公平的起点,但正如你所说,你会遇到“欧洲”名字的问题。所以你映射了所有变音符号,回音符号和那些符号。中国名字怎么样?日本?印度?希伯来语?你正在与风力涡轮机进行战斗。
If you absolutely must check the validity of someone's name, I'd suggest doing a modest blacklist of certain characters. Braces, mathematical characters, some punctuation and such might be safe to ignore. But I'd be cautious, if I were you.
如果你绝对必须检查某人姓名的有效性,我建议你做一些适当的黑名单。大括号,数学字符,一些标点符号等都可以安全地忽略。但如果我是你,我会保持谨慎。
It might be best to just accept whatever comes in. UTF-16 should be today's overkill character set, that should be adequate for some years to come.
最好只接受任何内容.UTF-16应该是今天的过度杀伤字符集,在未来几年应该足够了。
Edit: As for your question about name length and amount of names. If you really want people to write their real and complete names, I guess the only foolproof answer to both of those questions would be "infinite". Not being able to whip out any real examples for human beings, but surely there are analogous examples for humans as the native name for the city of Bangkok.
编辑:关于名称长度和名称数量的问题。如果你真的希望人们写出真实而完整的名字,我想这两个问题的唯一万无一失的答案就是“无限”。不能为人类扯出任何真实的例子,但肯定有类似的例子,人类作为曼谷市的本土名称。
#3
I don't think there's a definitive answer. After all, some people have names that can't even be expressed in UTF-16...
我认为没有明确的答案。毕竟,有些人的名字甚至无法用UTF-16表达......
There are some odd people out there, who will give their kids the craziest of names, including putting in weird punctuation, accents that don't exist in their own language, etc.
那里有一些奇怪的人,他们会给他们的孩子最疯狂的名字,包括加入奇怪的标点符号,用他们自己的语言不存在的口音等。
However, you can place arbitrary restrictions on your database. If you want to you can insist on 7 bit ASCII names. It's slightly rude to users, but they'll live with it. It certainly makes searching easier.
但是,您可以对数据库设置任意限制。如果你愿意,你可以坚持使用7位ASCII名称。这对用户来说有点粗鲁,但他们会忍受它。它当然使搜索更容易。
My colleague's daughter is named Amélie. But even some (not all!) official British government web sites ("Please enter the name exactly as shown on the birth certificate") won't accept the unicode, so he has to use 'Amelie' instead.
我同事的女儿名叫Amélie。但即使是一些(并非所有!)官方英国*网站(“请输入出生证明上显示的名称”)也不会接受unicode,因此他必须使用'Amelie'代替。
#4
Any character that can be represented by any multiple of eight bits (greater than zero) is a possible character for a person's name. Lengths of both names and encodings are arbitrary, so no upper bound should be considered.
任何可由八位(大于零)的任意倍数表示的字符都是人名的可能字符。名称和编码的长度都是任意长度,因此不应考虑上限。
Just make sure you sanitize your database inputs so little Bobby Drop-tables doesn't get ya.
只要确保你清理你的数据库输入,所以很少Bobby Drop-tables不能得到你。
#5
On the issue of name fields, the WRONG answer is first name, middle initial, last name, etc. for many reasons.
关于名称字段的问题,错误答案是名字,中间姓名,姓氏等,原因有很多。
-
Many people are known by their middle name, and formally use a first initial, middle name, last name format.
许多人以他们的中间名称而闻名,并正式使用第一个初始,中间名,姓氏格式。
-
In some cultures, the surname is the first name, and the given name is the last name.
在某些文化中,姓氏是名字,给定的名字是姓氏。
-
Multiple first and/or middle given names is getting more common. As @Dour High Arch points out, the other extreme is people with only one word in their name.
多个第一个和/或中间给定的名称变得越来越普遍。正如@Dour High Arch指出的那样,另一个极端是人们名字中只有一个字。
In an object-oriented database, you would store a Name object with methods to return a directory-style or signature-style name; and the backing store would contain whatever data was necessary to support those methods.
在面向对象的数据库中,您将使用方法存储Name对象以返回目录样式或签名样式的名称;并且后备存储将包含支持这些方法所需的任何数据。
I haven't yet seen a relational database model that improves on the model of two variable-length strings for directory-style and signature-style names.
我还没有看到一个关系数据库模型改进了目录样式和签名样式名称的两个可变长度字符串的模型。
#6
It really depends on what the app is supposed to be used for.
这实际上取决于应用程序应该用于什么。
Sure, in theory it's great if you allow every script on god's green earth to be used, but if the DB is also used by support staff, are they going to be able to handle names in Japanese, Hebrew and Thai script? Can you printer, if it's used to print postage labels?
当然,理论上如果你允许使用上帝绿地上的每个剧本都很棒,但是如果数据库也被支持人员使用,他们是否能够处理日语,希伯来语和泰语脚本的名字?如果它用于打印邮资标签,你可以打印机吗?
You might add an extra field "Latin Transcription", but IMO it's really OK to restrict it to ISO-8859-1 characters - People who don't use Latin characters are by now so used to having to use a transcription that they don't mind it anymore, unless they're hardcore nationalists.
您可以添加一个额外的字段“拉丁语转录”,但IMO可以将其限制为ISO-8859-1字符 - 不使用拉丁字符的人现在已经习惯了使用他们不喜欢的转录除非他们是铁杆民族主义者,否则不要再介意了。
#7
I'm making software for driving schools in the USA, so to me what matters most what the state DMV's accept as a proper name on a driver's license. In my case, it would cause problems to allow names beyond what the DMV allows, even if such names were legal because the same name must later be used for a driver's license.
我正在为在美国驾驶学校制作软件,所以对我来说最重要的是国家DMV在驾驶执照上接受的正确名称。在我的情况下,它会导致问题,允许名称超出DMV允许的范围,即使这些名称是合法的,因为以后必须使用相同的名称作为驾驶执照。
From *, I still hadn't confirmed the answer I needed. And I happen to know that in my state (Calif) they're using AS400's with software probably written in COBOL, and to the best of my knowledge, those only support an 8-bit character set. (Is it EBCDIC?) Anyway... Ugh.
从*,我仍然没有确认我需要的答案。我碰巧知道在我的州(加利福尼亚州),他们使用AS400和可能用COBOL编写的软件,而据我所知,那些只支持8位字符集。 (是EBCDIC吗?)无论如何......呃。
So, I called the California DMV... Sure enough, their system allows A-Z and spaces and absolutely nothing else. Not even hyphens are allowed -- Hyphens are replaced with spaces. In fact, apparently just to be difficult, they only use capitals. And names such as "O'Malley" must be replaced with OMALLEY.
所以,我打电话给加利福尼亚州的DMV ......果然,他们的系统允许A-Z和空间,绝对没有别的。连字符都不允许 - 连字符用空格替换。事实上,显然只是困难,他们只使用资本。诸如“O'Malley”之类的名称必须替换为OMALLEY。
Leave it to government. I must say I'm thrilled not to be a developer working for DMV. (Although I could really use that kind of salary.)
把它留给*。我必须说我很高兴不成为DMV的开发人员。 (虽然我真的可以使用那种薪水。)
#8
UTF-8 should be good enough, as far as name fields, you'll want at minimum a first name and last.
UTF-8应该足够好,就名称字段而言,您至少需要名字和姓氏。
#9
What do you do when you have "The Artist Formerly Known as Prince". That symbol he used is not a character in the unicode set (AFAIK).
当你拥有“以前称为王子的艺术家”时你会怎么做?他使用的那个符号不是unicode集(AFAIK)中的一个字符。
It's some levity, but at the same time, names are a rather broad concept that doesn't lend itself well to a structured format. In this case, something free-form might be most appropriate.
这有点轻浮,但与此同时,名称是一个相当广泛的概念,不适合结构化格式。在这种情况下,*形式可能是最合适的。
#10
Depending on the complexity of your name structure I could see:
根据您的名称结构的复杂性,我可以看到:
- First Name
- Middle Initial/Middle Name
- Last Name
- Suffix (Jr. Sr. II, III, IV, etc.)
- Prefix (Mr., Mrs., Ms., etc.)
中间名/中间名
后缀(Jr. Sr. II,III,IV等)
前缀(先生,夫人,女士等)