I'm trying to gather a Unicode list of all the 'o' like shapes in the Hindi character-set. In fact, a list of any characters (in any language) that makes uses of separate characters to indicate an accent would be better.
我正在尝试收集印地语字符集中所有'o'形状的Unicode列表。事实上,任何使用单独字符来表示重音的字符(使用任何语言)都会更好。
I intend to use this unicode-list in a RegExp.
我打算在RegExp中使用这个unicode-list。
I been trying to edit a list of character-ranges by outputting them in an Input TextField, but editing this text causes weird issues (the keyboard-cursor isn't place on the correct character, selections suddenly dissappear / incorrectly warps... in other words... HINDI HELL!)
我一直在尝试通过在输入TextField中输出它们来编辑字符范围列表,但编辑此文本会导致奇怪的问题(键盘光标不在正确的字符上,选择突然消失/错误地扭曲...换句话说...... HINDI HELL!)
I've tried this with Notepad++ too, but although it was more responsive, it eventually crapped out on me like it did in the Flash Player textfield. This seems to occur especially while removing the [] block (nulls?) characters. Some of them trigger odd behaviors.
我也用Notepad ++尝试了这个,但是虽然响应速度更快,但它最终像我在Flash Player文本字段中那样对我产生了影响。这似乎特别是在删除[]块(nulls?)字符时发生。其中一些引发奇怪的行为。
Anyways, all I want is a list of the accents. An example of a few are in the image below (but I would need ALL accents):
无论如何,我想要的只是一个重音列表。下面的图片中有一些例子(但我需要所有重音):
Thanks!
3 个解决方案
#1
4
You can find pdf's containing lists of unicode ranges, grouped by language, here: http://unicode.org/charts/
您可以在这里找到pdf,其中包含按语言分组的unicode范围列表:http://unicode.org/charts/
For Hindi, you probably want Devanagari or Devanagari Extended.
对于印地语,你可能想要Devanagari或Devanagari Extended。
#2
3
Here is the character class for Devanagari combining marks:
这是Devanagari结合标记的角色类:
[\u901\u902\u903\u93c\u93e\u93f\u940\u941\u942\u943
\u944\u945\u946\u947\u948\u949\u94a\u94b\u94c\u94d
\u951\u952\u953\u954\u962\u963]
This is only the basic Devanagari block (not Devanagari Extended).
这只是基本的梵文块(不是Devanagari Extended)。
#3
0
If you want the complete set (for all languages), you can do it problematically. You start from the Unicode date file at ftp://ftp.unicode.org/Public/6.1.0/ucd/UnicodeData.txt, described by TR-44 (http://unicode.org/reports/tr44/#Property_Definitions)
如果您想要完整集(适用于所有语言),您可以解决问题。您可以从ftp://ftp.unicode.org/Public/6.1.0/ucd/UnicodeData.txt中的Unicode日期文件开始,由TR-44描述(http://unicode.org/reports/tr44/#Property_Definitions )
You can use the Canonical_Combining_Class field (see at http://unicode.org/reports/tr44/#Canonical_Combining_Class_Values) to filter the exact characters you want. Can't be more precise, because "accent" a bit vague :-) You might even have to also look at General_Category to get the filter right (and exclude certain marks, or symbols, or punctuation).
您可以使用Canonical_Combining_Class字段(请参阅http://unicode.org/reports/tr44/#Canonical_Combining_Class_Values)来过滤所需的确切字符。不能更精确,因为“重音”有点模糊:-)您甚至可能还需要查看General_Category以获得正确的过滤器(并排除某些标记,符号或标点符号)。
And a script doing this would definitely be better than trying to mess with text editors. One of the characteristics of combining characters is that they combine :-) So you might get all kind of puzzling results (like this: http://www.siao2.com/2006/02/17/533929.aspx :-)
这样做的脚本肯定比试图弄乱文本编辑器更好。组合字符的一个特点是它们结合起来:-)所以你可能得到各种令人费解的结果(像这样:http://www.siao2.com/2006/02/17/533929.aspx :-)
#1
4
You can find pdf's containing lists of unicode ranges, grouped by language, here: http://unicode.org/charts/
您可以在这里找到pdf,其中包含按语言分组的unicode范围列表:http://unicode.org/charts/
For Hindi, you probably want Devanagari or Devanagari Extended.
对于印地语,你可能想要Devanagari或Devanagari Extended。
#2
3
Here is the character class for Devanagari combining marks:
这是Devanagari结合标记的角色类:
[\u901\u902\u903\u93c\u93e\u93f\u940\u941\u942\u943
\u944\u945\u946\u947\u948\u949\u94a\u94b\u94c\u94d
\u951\u952\u953\u954\u962\u963]
This is only the basic Devanagari block (not Devanagari Extended).
这只是基本的梵文块(不是Devanagari Extended)。
#3
0
If you want the complete set (for all languages), you can do it problematically. You start from the Unicode date file at ftp://ftp.unicode.org/Public/6.1.0/ucd/UnicodeData.txt, described by TR-44 (http://unicode.org/reports/tr44/#Property_Definitions)
如果您想要完整集(适用于所有语言),您可以解决问题。您可以从ftp://ftp.unicode.org/Public/6.1.0/ucd/UnicodeData.txt中的Unicode日期文件开始,由TR-44描述(http://unicode.org/reports/tr44/#Property_Definitions )
You can use the Canonical_Combining_Class field (see at http://unicode.org/reports/tr44/#Canonical_Combining_Class_Values) to filter the exact characters you want. Can't be more precise, because "accent" a bit vague :-) You might even have to also look at General_Category to get the filter right (and exclude certain marks, or symbols, or punctuation).
您可以使用Canonical_Combining_Class字段(请参阅http://unicode.org/reports/tr44/#Canonical_Combining_Class_Values)来过滤所需的确切字符。不能更精确,因为“重音”有点模糊:-)您甚至可能还需要查看General_Category以获得正确的过滤器(并排除某些标记,符号或标点符号)。
And a script doing this would definitely be better than trying to mess with text editors. One of the characteristics of combining characters is that they combine :-) So you might get all kind of puzzling results (like this: http://www.siao2.com/2006/02/17/533929.aspx :-)
这样做的脚本肯定比试图弄乱文本编辑器更好。组合字符的一个特点是它们结合起来:-)所以你可能得到各种令人费解的结果(像这样:http://www.siao2.com/2006/02/17/533929.aspx :-)