如何指定unicode字符的范围

时间:2022-05-22 00:11:35

How do I specify a range of unicode characters from ' ' (space) to \u00D7FF?

如何指定从' '(空格)到\u00D7FF的unicode字符范围?

I have a regular expression like r'[\u0020-\u00D7FF]' and it won't compile saying that it's a bad range. I am new to Unicode regular expressions so I haven't had this problem before.

我有一个像r'[\u0020-\u00D7FF]这样的正则表达式,它不会编译成说它是一个糟糕的范围。我是Unicode正则表达式的新手,所以我以前没有遇到过这个问题。

Is there a way to make this compile or a regular expression that I'm forgetting or haven't learned yet?

是否有一种方法可以使这个编译或正则表达式成为我正在遗忘或尚未学习的?

2 个解决方案

#1


25  

The syntax of your unicode range will not do what you expect.

unicode范围的语法不会满足您的期望。

  1. The raw r'' string prevents \u escapes from being parsed, and the regex engine will not do this. The only range in this set is [0-\]:

    原始的r "字符串阻止\u转义被解析,regex引擎不会这样做。这个集合中唯一的范围是[0-\]:

    >>> re.compile(r'[\u0020-\u00d7ff]', re.DEBUG)
    in
      literal 117
      literal 48
      literal 48
      literal 50
      range (48, 117)
      literal 48
      literal 48
      literal 100
      literal 55
      literal 102
      literal 102
    
  2. Making it a Unicode literal causes \u parsing while leaving other backslashes alone (although that’s not a concern here), but the leading zeroes are messing it up. The syntax is \uxxxx or \Uxxxxxxxx, so it’s parsed as "\u00d7, f, f".

    将它设置为Unicode文字会导致\u解析,而将其他反斜杠放在一边(尽管这里不考虑这个问题),但是前面的0会把它搞砸。语法是\uxxxx或\ uxxxxxx,所以它被解析为“\u00d7, f, f”。

    >>> re.compile(ur'[\u0020-\u00d7ff]', re.DEBUG)
    in
      range (32, 215)
      literal 102
      literal 102
    
  3. Removing the leading zeroes or switching to \U0000d7ff will fix it:

    删除前导0或切换到\U0000d7ff将修复它:

    >>> re.compile(ur'[\u0020-\ud7ff]', re.DEBUG)
    in
      range (32, 55295)
    

#2


5  

If you're using Python 2.x, you should make sure you're specifying a unicode string (with u'', or the "unicode" built-in):

如果你使用的是Python 2。x,您应该确保指定了一个unicode字符串(带有u或内置的“unicode”):

>>> r = re.compile(u'[\u0020-\uD7FF]')
>>> r.search(u'foo \uD7F0 bar')
<_sre.SRE_Match object at 0xb7084950>
r.search(u' ')
<_sre.SRE_Match object at 0xb7084b48>

Using raw strings (as you are, with r'') gives you the (ascii) string composed by "backstroke" + the letter "u" plus the number 0 plus...

使用原始字符串(就像你一样,用r)给你一个由“backstroke”+字母“u”加上数字0加上…

#1


25  

The syntax of your unicode range will not do what you expect.

unicode范围的语法不会满足您的期望。

  1. The raw r'' string prevents \u escapes from being parsed, and the regex engine will not do this. The only range in this set is [0-\]:

    原始的r "字符串阻止\u转义被解析,regex引擎不会这样做。这个集合中唯一的范围是[0-\]:

    >>> re.compile(r'[\u0020-\u00d7ff]', re.DEBUG)
    in
      literal 117
      literal 48
      literal 48
      literal 50
      range (48, 117)
      literal 48
      literal 48
      literal 100
      literal 55
      literal 102
      literal 102
    
  2. Making it a Unicode literal causes \u parsing while leaving other backslashes alone (although that’s not a concern here), but the leading zeroes are messing it up. The syntax is \uxxxx or \Uxxxxxxxx, so it’s parsed as "\u00d7, f, f".

    将它设置为Unicode文字会导致\u解析,而将其他反斜杠放在一边(尽管这里不考虑这个问题),但是前面的0会把它搞砸。语法是\uxxxx或\ uxxxxxx,所以它被解析为“\u00d7, f, f”。

    >>> re.compile(ur'[\u0020-\u00d7ff]', re.DEBUG)
    in
      range (32, 215)
      literal 102
      literal 102
    
  3. Removing the leading zeroes or switching to \U0000d7ff will fix it:

    删除前导0或切换到\U0000d7ff将修复它:

    >>> re.compile(ur'[\u0020-\ud7ff]', re.DEBUG)
    in
      range (32, 55295)
    

#2


5  

If you're using Python 2.x, you should make sure you're specifying a unicode string (with u'', or the "unicode" built-in):

如果你使用的是Python 2。x,您应该确保指定了一个unicode字符串(带有u或内置的“unicode”):

>>> r = re.compile(u'[\u0020-\uD7FF]')
>>> r.search(u'foo \uD7F0 bar')
<_sre.SRE_Match object at 0xb7084950>
r.search(u' ')
<_sre.SRE_Match object at 0xb7084b48>

Using raw strings (as you are, with r'') gives you the (ascii) string composed by "backstroke" + the letter "u" plus the number 0 plus...

使用原始字符串(就像你一样,用r)给你一个由“backstroke”+字母“u”加上数字0加上…