Python原始字符串和unicode：如何使用Web输入作为regexp模式？

EDIT : This question doesn't really make sense once you have picked up what the "r" flag means. More details here. For people looking for a quick anwser, I added on below.

编辑：一旦你拿起“r”标志的含义，这个问题就没有意义了。更多细节在这里。对于寻找快速通道的人，我在下面补充道。

If I enter a regexp manually in a Python script, I can use 4 combinations of flags for my pattern strings :

如果我在Python脚本中手动输入正则表达式，我可以为模式字符串使用4种标志组合：

p1 = "pattern"
p1 =“模式”
p2 = u"pattern"
p2 = u“模式”
p3 = r"pattern"
p3 = r“模式”
p4 = ru"pattern"
p4 = ru“模式”

I have a bunch a unicode strings coming from a Web form input and want to use them as regexp patterns.

我有一堆来自Web表单输入的unicode字符串，并希望将它们用作regexp模式。

I want to know what process I should apply to the strings so I can expect similar result from the usage of the manual form above. Something like :

我想知道我应该对字符串应用什么过程，所以我可以期望使用上面的手册形式得到类似的结果。就像是：

import re
assert re.match(p1, some_text) == re.match(someProcess1(web_input), some_text)
assert re.match(p2, some_text) == re.match(someProcess2(web_input), some_text)
assert re.match(p3, some_text) == re.match(someProcess3(web_input), some_text)
assert re.match(p4, some_text) == re.match(someProcess4(web_input), some_text)

What would be someProcess1 to someProcessN and why ?

someProcess1到someProcessN会是什么？为什么？

I suppose that someProcess2 doesn't need to do anything while someProcess1 should do some unicode conversion to the local encoding. For the raw string literals, I am clueless.

我想someProcess2不需要做任何事情，而someProcess1应该做一些unicode转换到本地编码。对于原始字符串文字，我无能为力。

3 个解决方案

#1

Apart from possibly having to encode Unicode properly (in Python 2.*), no processing is needed because there is no specific type for "raw strings" -- it's just a syntax for literals, i.e. for string constants, and you don't have any string constants in your code snippet, so there's nothing to "process".

除了可能必须正确编码Unicode（在Python 2. *中）之外，不需要任何处理，因为“原始字符串”没有特定的类型 - 它只是文字的语法，即字符串常量，而你不是你的代码片段中有任何字符串常量，因此没有什么可以“处理”。

#2

Note the following in your first example:

请注意第一个示例中的以下内容：

>>> p1 = "pattern"
>>> p2 = u"pattern"
>>> p3 = r"pattern"
>>> p4 = ur"pattern" # it's ur"", not ru"" btw
>>> p1 == p2 == p3 == p4
True

While these constructs look different, they all do the same thing, they create a string object (p1 and p3 a str and p2 and p4 a unicode object in Python 2.x), containing the value "pattern". The u, r and ur just tell the parser, how to interpret the following quoted string, namely as a unicode text (u) and/or a raw text (r) where backslashes to encode other characters are ignored. However in the end it doesn't matter how a string was created, being it a raw string or not, internally it is stored the same.

虽然这些结构看起来不同，但它们都做同样的事情，它们创建一个字符串对象（p1和p3是str和p2，p4是Python 2.x中的unicode对象），包含值“pattern”。 u，r和ur告诉解析器，如何解释以下引用的字符串，即作为unicode文本（u）和/或原始文本（r），其中忽略对其他字符进行编码的反斜杠。但是最后无论字符串是如何创建的，无论是否为原始字符串，内部都存储相同的字符串。

When you get unicode text as input, you have to differ (in Python 2.x) if it is a unicode text or a str object. If you want to work with the unicode content, you should internally work only with those, and convert all str objects to unicode objects (either with str.decode() or with the u'text' syntax for hard-coded texts). If you however encode it to your local encoding, you will get problems with unicode symbols.

当您将unicode文本作为输入时，如果它是unicode文本或str对象，则必须不同（在Python 2.x中）。如果你想使用unicode内容，你应该只在内部使用它们，并将所有str对象转换为unicode对象（使用str.decode（）或使用u'text'语法进行硬编码文本）。但是，如果您将其编码为本地编码，则会出现unicode符号问题。

A different approach would be using Python 3, which str object supports unicode directly and stores everything as unicode and where you simply don't need to care about the encoding.

一种不同的方法是使用Python 3，str对象直接支持unicode并将所有内容存储为unicode，而您根本不需要关心编码。

#3

"r" flags just prevent Python from interpreting "\" in a string. Since the Web doesn't care about what kind of data it carries, your web input will be a bunch of bytes you are free to interpret the way you want.

“r”标志只是阻止Python在字符串中解释“\”。由于Web不关心它携带什么类型的数据，因此您的Web输入将是一堆字节，您可以*地按照您想要的方式进行解释。

So to address this problem :

所以要解决这个问题：

be sure you use Unicode (e.g. utf-8) all long the way
一定要长期使用Unicode（例如utf-8）
when you get the string, it will be Unicode and "\n", "\t" and "\a" will be literals, so you don't need to care about if you need to escape them of not.
当你得到字符串时，它将是Unicode和“\ n”，“\ t”和“\ a”将是文字，所以你不需要关心你是否需要逃避它们。

#1

#2