中文的正则表达式

时间:2022-05-19 06:46:44

中文的正则 [\u2E80-\uFE4F]+


现在网络上流行的是以下两个: 
/^[\u0391-\uFFE5]+$/ 
/^[\u4E00-\u9FA5]+$/ 

明显,第二个的范围比较小。经过测试,第二个是不对的,第二个范围外的 '\u9FA6' 是汉字 "囗",所以第二个明显没有包含所有必需的。 
第一个的最后一个字符 '\uFFE5' 是 ‘¥’ 字符,而 '\uFFE6' 是 '₩' 字符。所以我认为第一个是大体对的,不过第一个开头 ‘\u0391’ 是 ''Α",但是奇怪的是这个不是英文的半角A也不是中文的全角 A,奇怪。 所以我觉得第一个的范围可能稍微偏大,特别是开始段。 
于是去查 utf8 编码表 
原来汉字编码是比较奇特的,并不是编在一起,比如希伯来文U+0590 -- U+05FF这么方便。汉字被分成了很多小段,而且因为有很多汉字是中国、日本、韩国共享的,所以UTF8编码里面的CJK一般都是指汉字段。 
经过审查,第一次出现CJK的是 U+2E80, 最后一次是U+FE4F。因此最终结论是: 

/^[\u2E80-\uFE4F]+$/ 

最后,再贴一下utf8码表 
U+0000 -- U+007F: Basic Latin 
U+0080 -- U+00FF: Latin-1 Supplement 
U+0100 -- U+017F: Latin Extended-A 
U+0180 -- U+024F: Latin Extended-B 
U+0250 -- U+02AF: IPA Extensions 
U+02B0 -- U+02FF: Spacing Modifier Letters 
U+0300 -- U+036F: Combining Diacritical Marks 
U+0370 -- U+03FF: Greek and Coptic 
U+0400 -- U+04FF: Cyrillic 
U+0500 -- U+052F: Cyrillic Supplement 
U+0530 -- U+058F: Armenian 
U+0590 -- U+05FF: Hebrew 
U+0600 -- U+06FF: Arabic 
U+0700 -- U+074F: Syriac 
U+0750 -- U+077F: Arabic Supplement 
U+0780 -- U+07BF: Thaana 
U+07C0 -- U+07FF: NKo 
U+0900 -- U+097F: Devanagari 
U+0980 -- U+09FF: Bengali 
U+0A00 -- U+0A7F: Gurmukhi 
U+0A80 -- U+0AFF: Gujarati 
U+0B00 -- U+0B7F: Oriya 
U+0B80 -- U+0BFF: Tamil 
U+0C00 -- U+0C7F: Telugu 
U+0C80 -- U+0CFF: Kannada 
U+0D00 -- U+0D7F: Malayalam 
U+0D80 -- U+0DFF: Sinhala 
U+0E00 -- U+0E7F: Thai 
U+0E80 -- U+0EFF: Lao 
U+0F00 -- U+0FFF: * 
U+1000 -- U+109F: Myanmar 
U+10A0 -- U+10FF: Georgian 
U+1100 -- U+11FF: Hangul Jamo 
U+1200 -- U+137F: Ethiopic 
U+1380 -- U+139F: Ethiopic Supplement 
U+13A0 -- U+13FF: Cherokee 
U+1400 -- U+167F: Unified Canadian Aboriginal Syllabics 
U+1680 -- U+169F: Ogham 
U+16A0 -- U+16FF: Runic 
U+1700 -- U+171F: Tagalog 
U+1720 -- U+173F: Hanunoo 
U+1740 -- U+175F: Buhid 
U+1760 -- U+177F: Tagbanwa 
U+1780 -- U+17FF: Khmer 
U+1800 -- U+18AF: *n 
U+1900 -- U+194F: Limbu 
U+1950 -- U+197F: Tai Le 
U+1980 -- U+19DF: New Tai Lue 
U+19E0 -- U+19FF: Khmer Symbols 
U+1A00 -- U+1A1F: Buginese 
U+1B00 -- U+1B7F: Balinese 
U+1D00 -- U+1D7F: Phonetic Extensions 
U+1D80 -- U+1DBF: Phonetic Extensions Supplement 
U+1DC0 -- U+1DFF: Combining Diacritical Marks Supplement 
U+1E00 -- U+1EFF: Latin Extended Additional 
U+1F00 -- U+1FFF: Greek Extended 
U+2000 -- U+206F: General Punctuation 
U+2070 -- U+209F: Superscripts and Subscripts 
U+20A0 -- U+20CF: Currency Symbols 
U+20D0 -- U+20FF: Combining Diacritical Marks for Symbols 
U+2100 -- U+214F: Letterlike Symbols 
U+2150 -- U+218F: Number Forms 
U+2190 -- U+21FF: Arrows 
U+2200 -- U+22FF: Mathematical Operators 
U+2300 -- U+23FF: Miscellaneous Technical 
U+2400 -- U+243F: Control Pictures 
U+2440 -- U+245F: Optical Character Recognition 
U+2460 -- U+24FF: Enclosed Alphanumerics 
U+2500 -- U+257F: Box Drawing 
U+2580 -- U+259F: Block Elements 
U+25A0 -- U+25FF: Geometric Shapes 
U+2600 -- U+26FF: Miscellaneous Symbols 
U+2700 -- U+27BF: Dingbats 
U+27C0 -- U+27EF: Miscellaneous Mathematical Symbols-A 
U+27F0 -- U+27FF: Supplemental Arrows-A 
U+2800 -- U+28FF: Braille Patterns 
U+2900 -- U+297F: Supplemental Arrows-B 
U+2980 -- U+29FF: Miscellaneous Mathematical Symbols-B 
U+2A00 -- U+2AFF: Supplemental Mathematical Operators 
U+2B00 -- U+2BFF: Miscellaneous Symbols and Arrows 
U+2C00 -- U+2C5F: Glagolitic 
U+2C60 -- U+2C7F: Latin Extended-C 
U+2C80 -- U+2CFF: Coptic 
U+2D00 -- U+2D2F: Georgian Supplement 
U+2D30 -- U+2D7F: Tifinagh 
U+2D80 -- U+2DDF: Ethiopic Extended 
U+2E00 -- U+2E7F: Supplemental Punctuation 
U+2E80 -- U+2EFF: CJK Radicals Supplement 
U+2F00 -- U+2FDF: Kangxi Radicals 
U+2FF0 -- U+2FFF: Ideographic Description Characters 
U+3000 -- U+303F: CJK Symbols and Punctuation 
U+3040 -- U+309F: Hiragana 
U+30A0 -- U+30FF: Katakana 
U+3100 -- U+312F: Bopomofo 
U+3130 -- U+318F: Hangul Compatibility Jamo 
U+3190 -- U+319F: Kanbun 
U+31A0 -- U+31BF: Bopomofo Extended 
U+31C0 -- U+31EF: CJK Strokes 
U+31F0 -- U+31FF: Katakana Phonetic Extensions 
U+3200 -- U+32FF: Enclosed CJK Letters and Months 
U+3300 -- U+33FF: CJK Compatibility 
U+3400 -- U+4DBF: CJK Unified Ideographs Extension A 
U+4DC0 -- U+4DFF: Yijing Hexagram Symbols 
U+4E00 -- U+9FFF: CJK Unified Ideographs 
U+A000 -- U+A48F: Yi Syllables 
U+A490 -- U+A4CF: Yi Radicals 
U+A700 -- U+A71F: Modifier Tone Letters 
U+A720 -- U+A7FF: Latin Extended-D 
U+A800 -- U+A82F: Syloti Nagri 
U+A840 -- U+A87F: Phags-pa 
U+AC00 -- U+D7AF: Hangul Syllables 
U+D800 -- U+DB7F: High Surrogates 
U+DB80 -- U+DBFF: High Private Use Surrogates 
U+DC00 -- U+DFFF: Low Surrogates 
U+E000 -- U+F8FF: Private Use Area 
U+F900 -- U+FAFF: CJK Compatibility Ideographs 
U+FB00 -- U+FB4F: Alphabetic Presentation Forms 
U+FB50 -- U+FDFF: Arabic Presentation Forms-A 
U+FE00 -- U+FE0F: Variation Selectors 
U+FE10 -- U+FE1F: Vertical Forms 
U+FE20 -- U+FE2F: Combining Half Marks 
U+FE30 -- U+FE4F: CJK Compatibility Forms 
U+FE50 -- U+FE6F: Small Form Variants 
U+FE70 -- U+FEFF: Arabic Presentation Forms-B 
U+FF00 -- U+FFEF: Halfwidth and Fullwidth Forms 
U+FFF0 -- U+FFFF: Specials