如何在Python 3.2 regex中指定Cyrillic字符范围?

时间:2022-02-27 10:46:39

Once upon a time, I found this question interesting.

从前,我发现这个问题很有趣。

Today I decided to play around with the text of that book.

今天,我决定玩那本书的文字。

I want to use the regular expression in this script. When I use the script on Cyrillic text, it wipes out all of the Cyrillic characters, leaving only punctuation and whitespace.

我想在这个脚本中使用正则表达式。当我在Cyrillic文本上使用这个脚本时,它会删除所有的Cyrillic字符,只留下标点和空格。

#!/usr/bin/env python3.2
# coding=UTF-8

import sys, re

for file in sys.argv[1:]:
    f = open(file)
    fs = f.read()
    regexnl = re.compile('[^\s\w.,?!:;-]')
    rstuff = regexnl.sub('', f)
    f.close()
    print(rstuff)

Something very similar has already been done in this answer.

在这个答案中已经做了一些非常相似的事情。

Basically, I just want to be able to specify a set of characters that are not alphabetic, alphanumeric, or punctuation or whitespace.

基本上,我只是希望能够指定一组字符,这些字符不是字母、字母数字、标点或空格。

3 个解决方案

#1


7  

This doesn't exactly answer your question, but the regex module has much much better unicode support than the built-in re module. e.g. regex supports the \p{Cyrillic} property and its negation \P{Cyrillic} (as well as a huge number of other unicode properties). Also, it handles unicode case-insensitivity correctly.

这并不能完全回答您的问题,但是regex模块比内置的re模块有更好的unicode支持。例如,regex支持\p{Cyrillic}属性及其否定的p{Cyrillic}(以及大量其他unicode属性)。此外,它还正确地处理unicode大小写不敏感。

#2


9  

You can specify the unicode range pretty easily: \u0400-\u0500. See also here.

您可以很容易地指定unicode范围:\u0400-\u0500。参见这里。

Here's an example with some text from the Russian wikipedia, and also a sentence from the English wikipedia containing a single word in cyrillic.

这里有一个例子,来自俄罗斯*的一些文本,还有一个来自英文*的句子,里面包含了西里尔字母的一个单词。

#coding=utf-8
import re

ru = u"Владивосток находится на одной широте с Сочи, однако имеет среднегодовую температуру почти на 10 градусов ниже."
en = u"Vladivostok (Russian: Владивосток; IPA: [vlədʲɪvɐˈstok] ( listen); Chinese: 海參崴; pinyin: Hǎishēnwǎi) is a city and the administrative center of Primorsky Krai, Russia"

cyril1 = re.findall(u"[\u0400-\u0500]+", en)
cyril2 = re.findall(u"[\u0400-\u0500]+", ru)

for x in cyril1:
    print x

for x in cyril2:
    print x

output:

输出:

Владивосток
------
Владивосток
находится
на
одной
широте
с
Сочи
однако
имеет
среднегодовую
температуру
почти
на
градусов
ниже

Addition:

Two other ways that should also work, and in a bit less hackish fashion than specifying a unicode range:

还有另外两种方法也应该有效,而且与指定unicode范围相比,这种方式不那么陈腐:

  • re.findall("(?u)\w+", text) should match Cyrillic as well as Latin word characters.
  • re.findall(“(?u)\w+”文本)应该匹配西里尔字母和拉丁单词字符。
  • re.findall("\w+", text, re.UNICODE) is equivalent
  • findall("\w+", text, res . unicode)是等价的

So, more specifically for your problem: * re.compile('[^\s\w.,?!:;-], re.UNICODE') should do the trick.

所以,更确切地说你的问题:* re.compile(“[^ \ \ w。? !:;-],re.UNICODE”)就会达到想要的效果。

See here (point 7)

在这里看到的(7点)

#3


-1  

For practical reasons I suggest using the exact Modern Russian subset of glyphs, instead of general Cyrillic. This is because Russian websites never use the full Cyrillic subset, which includes Belarusian, Ukrainian, Slavonic and Macedonian glyphs. For historical reasons I am keeping "u\0463".

出于实际原因,我建议使用精确的现代俄语字形子集,而不是普通的西里尔字母。这是因为俄罗斯网站从不使用完整的西里尔语子集,包括白俄罗斯、乌克兰、斯拉夫语和马其顿语的字形。出于历史原因,我保留“u\0463”。

//Basic Cyr Unicode range for use on Russian websites. 0401,0406,0410,0411,0412,0413,0414,0415,0416,0417,0418,0419,041A,041B,041C,041D,041E,041F,0420,0421,0422,0423,0424,0425,0426,0427,0428,0429,042A,042B,042C,042D,042E,042F,0430,0431,0432,0433,0434,0435,0436,0437,0438,0439,043A,043B,043C,043D,043E,043F,0440,0441,0442,0443,0444,0445,0446,0447,0448,0449,044A,044B,044C,044D,044E,044F,0451,0462,0463

//俄罗斯网站上使用的基本的Cyr Unicode等级。0401,0406,0410,0411,0412,0413,0414,0415,0416,0417,0418,0419,041,041,041 c,041 d,041 e,041,0420,0421,0422,0423,0424,0425,0426,0427,0428,0429,042,042,042 c,042 d,042 e,042,0430,0431,0432,0433,0434,0435,0436,0437,0438,0439,043,043,043 c,043 d,043 e,043,0440,0441,0442,0443,0444,0445,0446,0447,0448,0449,044,044,044 c,044 d,044 e,044,0451,0462,0463

Using this subset on a multilingual website will save you 60% of bandwidth, in comparison to using the original full range, and will increase page loading speed accordingly.

在多语言网站上使用这个子集将比使用原始的全范围节省60%的带宽,并相应地提高页面加载速度。

#1


7  

This doesn't exactly answer your question, but the regex module has much much better unicode support than the built-in re module. e.g. regex supports the \p{Cyrillic} property and its negation \P{Cyrillic} (as well as a huge number of other unicode properties). Also, it handles unicode case-insensitivity correctly.

这并不能完全回答您的问题,但是regex模块比内置的re模块有更好的unicode支持。例如,regex支持\p{Cyrillic}属性及其否定的p{Cyrillic}(以及大量其他unicode属性)。此外,它还正确地处理unicode大小写不敏感。

#2


9  

You can specify the unicode range pretty easily: \u0400-\u0500. See also here.

您可以很容易地指定unicode范围:\u0400-\u0500。参见这里。

Here's an example with some text from the Russian wikipedia, and also a sentence from the English wikipedia containing a single word in cyrillic.

这里有一个例子,来自俄罗斯*的一些文本,还有一个来自英文*的句子,里面包含了西里尔字母的一个单词。

#coding=utf-8
import re

ru = u"Владивосток находится на одной широте с Сочи, однако имеет среднегодовую температуру почти на 10 градусов ниже."
en = u"Vladivostok (Russian: Владивосток; IPA: [vlədʲɪvɐˈstok] ( listen); Chinese: 海參崴; pinyin: Hǎishēnwǎi) is a city and the administrative center of Primorsky Krai, Russia"

cyril1 = re.findall(u"[\u0400-\u0500]+", en)
cyril2 = re.findall(u"[\u0400-\u0500]+", ru)

for x in cyril1:
    print x

for x in cyril2:
    print x

output:

输出:

Владивосток
------
Владивосток
находится
на
одной
широте
с
Сочи
однако
имеет
среднегодовую
температуру
почти
на
градусов
ниже

Addition:

Two other ways that should also work, and in a bit less hackish fashion than specifying a unicode range:

还有另外两种方法也应该有效,而且与指定unicode范围相比,这种方式不那么陈腐:

  • re.findall("(?u)\w+", text) should match Cyrillic as well as Latin word characters.
  • re.findall(“(?u)\w+”文本)应该匹配西里尔字母和拉丁单词字符。
  • re.findall("\w+", text, re.UNICODE) is equivalent
  • findall("\w+", text, res . unicode)是等价的

So, more specifically for your problem: * re.compile('[^\s\w.,?!:;-], re.UNICODE') should do the trick.

所以,更确切地说你的问题:* re.compile(“[^ \ \ w。? !:;-],re.UNICODE”)就会达到想要的效果。

See here (point 7)

在这里看到的(7点)

#3


-1  

For practical reasons I suggest using the exact Modern Russian subset of glyphs, instead of general Cyrillic. This is because Russian websites never use the full Cyrillic subset, which includes Belarusian, Ukrainian, Slavonic and Macedonian glyphs. For historical reasons I am keeping "u\0463".

出于实际原因,我建议使用精确的现代俄语字形子集,而不是普通的西里尔字母。这是因为俄罗斯网站从不使用完整的西里尔语子集,包括白俄罗斯、乌克兰、斯拉夫语和马其顿语的字形。出于历史原因,我保留“u\0463”。

//Basic Cyr Unicode range for use on Russian websites. 0401,0406,0410,0411,0412,0413,0414,0415,0416,0417,0418,0419,041A,041B,041C,041D,041E,041F,0420,0421,0422,0423,0424,0425,0426,0427,0428,0429,042A,042B,042C,042D,042E,042F,0430,0431,0432,0433,0434,0435,0436,0437,0438,0439,043A,043B,043C,043D,043E,043F,0440,0441,0442,0443,0444,0445,0446,0447,0448,0449,044A,044B,044C,044D,044E,044F,0451,0462,0463

//俄罗斯网站上使用的基本的Cyr Unicode等级。0401,0406,0410,0411,0412,0413,0414,0415,0416,0417,0418,0419,041,041,041 c,041 d,041 e,041,0420,0421,0422,0423,0424,0425,0426,0427,0428,0429,042,042,042 c,042 d,042 e,042,0430,0431,0432,0433,0434,0435,0436,0437,0438,0439,043,043,043 c,043 d,043 e,043,0440,0441,0442,0443,0444,0445,0446,0447,0448,0449,044,044,044 c,044 d,044 e,044,0451,0462,0463

Using this subset on a multilingual website will save you 60% of bandwidth, in comparison to using the original full range, and will increase page loading speed accordingly.

在多语言网站上使用这个子集将比使用原始的全范围节省60%的带宽,并相应地提高页面加载速度。