For instance, say I wanted a function to escape a string for use in HTML (as in Django's escape filter):
例如,假设我想要一个函数来转义HTML中使用的字符串(如Django的escape过滤器):
def escape(string):
"""
Returns the given string with ampersands, quotes and angle
brackets encoded.
"""
return string.replace('&', '&').replace('<', '<').replace('>', '>').replace("'", ''').replace('"', '"')
This works, but it gets ugly quickly and appears to have poor algorithmic performance (in this example, the string is repeatedly traversed 5 times). What would be better is something like this:
这是可行的,但是它很快就会变丑,并且显示出很差的算法性能(在本例中,字符串被反复遍历5次)。最好是这样:
def escape(string):
"""
Returns the given string with ampersands, quotes and angle
brackets encoded.
"""
# Note that ampersands must be escaped first; the rest can be escaped in
# any order.
return replace_multi(string.replace('&', '&'),
{'<': '<', '>': '>',
"'": ''', '"': '"'})
Does such a function exist, or is the standard Python idiom to use what I wrote before?
是否存在这样的函数,或者是使用我以前写过的标准Python语言?
9 个解决方案
#1
19
Do you have an application that is running too slow and you profiled it to find that a line like this snippet is causing it to be slow? Bottlenecks occur at unexpected places.
您是否有一个运行速度太慢的应用程序,您对它进行了剖析,发现像这个代码片段这样的行导致它运行速度太慢?瓶颈发生在意想不到的地方。
The current snippet traverses the string 5 times, doing one thing each time. You are suggesting traversing it once, probably doing doing five things each time (or at least doing something each time). It isn't clear that this will automatically do a better job to me. Currently the algorithm used is O(nm) (assuming the length of the string is longer than the stuff in the rules), where n is the length of the string and m is the number of substitution rules. You could, I think, reduce the algorithmic complexity to something like O(nlog(m)) and in the specific case we're in—where the original things are all only one character (but not in the case of multiple calls to replace
in general)—O(n), but this doesn't matter since m is 5 but n is unbounded.
当前的代码片段遍历字符串5次,每次执行一项操作。你是在建议穿越一次,可能每次做五件事(或者至少每次做一件事)。现在还不清楚这是否会自动地让我做得更好。目前使用的算法是O(nm)(假设字符串的长度比规则中的内容长),其中n是字符串的长度,m是替换规则的个数。我认为,你可以减少算法复杂度类似O(nlog(m))在特定的情况下,我们在原来的东西都是只有一个字符(而不是在多个调用替换)- O(n),但是这并不重要,因为米是5 n是*的。
If m is held constant, then, the complexity of both solutions really goes to O(n). It is not clear to me that it is going to be a worthy task to try to turn five simple passes into one complex one, the actual time of which I cannot guess at the current moment. If there was something about it that could make it scale better, I would have thought it was much more worthwhile task.
如果m保持不变,那么这两个解的复杂度就会变成O(n)我不清楚,把五个简单的传球转变成一个复杂的传球会不会是一项有价值的任务,我现在猜不出具体的传球时间。如果有什么东西可以让它更好地扩展,我就会认为这是一个更有价值的任务。
Doing everything on one pass rather than consecutive passes also demands questions be answered about what to do about conflicting rules and how they are applied. The resolution to these questions is clear with a chain of replace
.
在一次传球而不是连续传球的情况下做每一件事也需要回答关于如何处理冲突规则以及如何应用这些规则的问题。解决这些问题的办法显然是用一连串的替换。
#2
14
How about we just test various ways of doing this and see which comes out faster (assuming we are only caring about the fastest way to do it).
我们只是测试各种方法,看看哪个更快(假设我们只关心最快的方法)。
def escape1(input):
return input.replace('&', '&').replace('<', '<').replace('>', '>').replace("'", ''').replace('"', '"')
translation_table = {
'&': '&',
'<': '<',
'>': '>',
"'": ''',
'"': '"',
}
def escape2(input):
return ''.join(translation_table.get(char, char) for char in input)
import re
_escape3_re = re.compile(r'[&<>\'"]')
def _escape3_repl(x):
s = x.group(0)
return translation_table.get(s, s)
def escape3(x):
return _escape3_re.sub(_escape3_repl, x)
def escape4(x):
return unicode(x).translate(translation_table)
test_strings = (
'Nothing in there.',
'<this is="not" a="tag" />',
'Something & Something else',
'This one is pretty long. ' * 50
)
import time
for test_i, test_string in enumerate(test_strings):
print repr(test_string)
for func in escape1, escape2, escape3, escape4:
start_time = time.time()
for i in xrange(1000):
x = func(test_string)
print '\t%s done in %.3fms' % (func.__name__, (time.time() - start_time))
print
Running this gives you:
运行这个给你:
'Nothing in there.'
escape1 done in 0.002ms
escape2 done in 0.009ms
escape3 done in 0.001ms
escape4 done in 0.005ms
'<this is="not" a="tag" />'
escape1 done in 0.002ms
escape2 done in 0.012ms
escape3 done in 0.009ms
escape4 done in 0.007ms
'Something & Something else'
escape1 done in 0.002ms
escape2 done in 0.012ms
escape3 done in 0.003ms
escape4 done in 0.007ms
'This one is pretty long. <snip>'
escape1 done in 0.008ms
escape2 done in 0.386ms
escape3 done in 0.011ms
escape4 done in 0.310ms
Looks like just replacing them one after another goes the fastest.
看起来把它们一个接一个地替换是最快的。
Edit: Running the tests again with 1000000 iterations gives the following for the first three strings (the fourth would take too long on my machine for me to wait =P):
编辑:使用1000000次迭代再次运行测试,对于前三个字符串(第四个字符串在我的机器上等待时间太长,我无法等待=P):
'Nothing in there.'
escape1 done in 0.001ms
escape2 done in 0.008ms
escape3 done in 0.002ms
escape4 done in 0.005ms
'<this is="not" a="tag" />'
escape1 done in 0.002ms
escape2 done in 0.011ms
escape3 done in 0.009ms
escape4 done in 0.007ms
'Something & Something else'
escape1 done in 0.002ms
escape2 done in 0.011ms
escape3 done in 0.003ms
escape4 done in 0.007ms
The numbers are pretty much the same. In the first case they are actually even more consistent as the direct string replacement is fastest now.
这些数字几乎是一样的。在第一种情况下,它们实际上更加一致,因为直接替换字符串现在是最快的。
#3
13
I prefer something clean like:
我喜欢干净的,比如:
substitutions = [
('<', '<'),
('>', '>'),
...]
for search, replacement in substitutions:
string = string.replace(search, replacement)
#4
7
That's what Django does:
这是什么Django:
def escape(html):
"""Returns the given HTML with ampersands, quotes and carets encoded."""
return mark_safe(force_unicode(html).replace('&', '&').replace('<', '<').replace('>', '>').replace('"', '"').replace("'", '''))
#5
4
You can use reduce:
您可以使用减少:
reduce(lambda s,r: s.replace(*r),
[('&', '&'),
('<', '<'),
('>', '>'),
("'", '''),
('"', '"')],
string)
#6
4
In accordance with bebraw's suggestion, here is what I ended up using (in a separate module, of course):
根据bebraw的建议,以下是我最后使用的(当然是在一个单独的模块中):
import re
class Subs(object):
"""
A container holding strings to be searched for and replaced in
replace_multi().
Holds little relation to the sandwich.
"""
def __init__(self, needles_and_replacements):
"""
Returns a new instance of the Subs class, given a dictionary holding
the keys to be searched for and the values to be used as replacements.
"""
self.lookup = needles_and_replacements
self.regex = re.compile('|'.join(map(re.escape,
needles_and_replacements)))
def replace_multi(string, subs):
"""
Replaces given items in string efficiently in a single-pass.
"string" should be the string to be searched.
"subs" can be either:
A.) a dictionary containing as its keys the items to be
searched for and as its values the items to be replaced.
or B.) a pre-compiled instance of the Subs class from this module
(which may have slightly better performance if this is
called often).
"""
if not isinstance(subs, Subs): # Assume dictionary if not our class.
subs = Subs(subs)
lookup = subs.lookup
return subs.regex.sub(lambda match: lookup[match.group(0)], string)
Example usage:
使用示例:
def escape(string):
"""
Returns the given string with ampersands, quotes and angle
brackets encoded.
"""
# Note that ampersands must be escaped first; the rest can be escaped in
# any order.
escape.subs = Subs({'<': '<', '>': '>', "'": ''', '"': '"'})
return replace_multi(string.replace('&', '&'), escape.subs)
Much better :). Thanks for the help.
更好:)。谢谢你的帮助。
Edit
Nevermind, Mike Graham was right. I benchmarked it and the replacement ends up actually being much slower.
没关系,麦克·格雷厄姆是对的。我对它做了基准测试,而替换结果实际上要慢得多。
Code:
代码:
from urllib2 import urlopen
import timeit
def escape1(string):
"""
Returns the given string with ampersands, quotes and angle
brackets encoded.
"""
return string.replace('&', '&').replace('<', '<').replace('>', '>').replace("'", ''').replace('"', '"')
def escape2(string):
"""
Returns the given string with ampersands, quotes and angle
brackets encoded.
"""
# Note that ampersands must be escaped first; the rest can be escaped in
# any order.
escape2.subs = Subs({'<': '<', '>': '>', "'": ''', '"': '"'})
return replace_multi(string.replace('&', '&'), escape2.subs)
# An example test on the * homepage.
request = urlopen('http://*.com')
test_string = request.read()
request.close()
test1 = timeit.Timer('escape1(test_string)',
setup='from __main__ import escape1, test_string')
test2 = timeit.Timer('escape2(test_string)',
setup='from __main__ import escape2, test_string')
print 'multi-pass:', test1.timeit(2000)
print 'single-pass:', test2.timeit(2000)
Output:
输出:
multi-pass: 15.9897229671
single-pass: 66.5422530174
So much for that.
这么多。
#7
#8
1
If you work with non-Unicode strings and Python < 3.0, try an alternate translate
method:
如果您使用非unicode字符串和Python < 3.0,请尝试另一种翻译方法:
# Python < 3.0
import itertools
def escape(a_string):
replacer= dict( (chr(c),chr(c)) for c in xrange(256))
replacer.update(
{'&': '&',
'<': '<',
'>': '>',
'"': '"',
"'": '''}
)
return ''.join(itertools.imap(replacer.__getitem__, a_string))
if __name__ == "__main__":
print escape('''"Hello"<i> to George's friend&co.''')
$ python so2484156.py
"Hello"<i> to George's friend&co.
This is closer to a "single scan" of the input string, as per your wish.
这更接近输入字符串的“单个扫描”,如您所愿。
EDIT
My intention was to create a unicode.translate
equivalent that was not restricted to single-character replacements, so I came up with the answer above; I got a comment by user "flow" that was almost completely out of context, with a single correct point: the code above, as is, is intended to work with byte strings and not unicode strings. There is an obvious update (i.e. unichr() … xrange(sys.maxunicode+1)) which I strongly dislike, so I came up with another function that works on both unicode and byte strings, given that Python guarantees:
我的目的是创建一个unicode。翻译等价,不局限于单字符替换,所以我想出了上面的答案;我得到了一个用户“流”的注释,它几乎完全脱离上下文,只有一个正确的点:上面的代码是用来处理字节字符串而不是unicode字符串的。有一个明显的更新(例如unichr()…xrange(sys.maxunicode+1)),我非常不喜欢,所以我提出了另一个同时处理unicode和字节字符串的函数,因为Python保证:
all( (chr(i)==unichr(i) and hash(chr(i))==hash(unichr(i)))
for i in xrange(128)) is True
The new function follows:
新功能:
def escape(a_string):
replacer= {
'&': '&',
'<': '<',
'>': '>',
'"': '"',
"'": ''',
}
return ''.join(
itertools.starmap(
replacer.get, # .setdefault *might* be faster
itertools.izip(a_string, a_string)
)
)
Notice the use of starmap with a sequence of tuples: for any character not in the replacer dict, return said character.
注意starmap使用了一系列元组:对于任何不在replacer dict中的字符,返回该字符。
#9
1
ok so i sat down and did the math. pls do not get mad at me i answer specifically discussing ΤΖΩΤΖΙΟΥ’s solution, but this would be somewhat hard to shoehorn inside a comment, so let me do it this way. i will, in fact, also air some considerations that are relevant to the OP’s question.
我坐下来计算了一下。请不要生我的气我回答具体讨论ΤΖΩΤΖΙΟΥ的解决方案,但这将是很难硬塞进里面的评论,让我这么做。事实上,我也会考虑一些与OP的问题相关的考虑。
first up, i have been discussing with ΤΖΩΤΖΙΟΥ the elegance, correctness, and viability of his approach. turns out it looks like the proposal, while it does use an (inherently unordered) dictionary as a register to store the substitution pairs, does in fact consistently return correct results, where i had claimed it wouldn’t. this is because the call to itertools.starmap()
in line 11, below, gets as its second argument an iterator over pairs of single characters/bytes (more on that later) that looks like [ ( 'h', 'h', ), ( 'e', 'e', ), ( 'l', 'l', ), ... ]
. these pairs of characters/bytes is what the first argument, replacer.get
, is repeatedly called with. there is not a chance to run into a situation where first '>'
is transformed into '>'
and then inadvertently again into '&gt;'
, because each character/byte is considered only once for substitution. so this part is in principle fine and algorithmically correct.
首先,我一直在讨论与ΤΖΩΤΖΙΟΥ优雅,他的方法的正确性和可行性。它看起来像提议,虽然它确实使用(天生无序的)字典作为寄存器来存储替换对,但实际上它始终返回正确的结果,而我声称它不会。这是因为下面第11行对itertools.starmap()的调用获得了第二个参数,即迭代器对单字符/字节(后面会详细介绍)的对,它看起来像[('h', 'h',), ('e', 'e',), ('l', 'l',)……]。这些字符/字节是第一个参数,replacer。得到,被反复调用。没有机会遇到第一个“>”被转换为“>”的情况,然后又无意中进入“>”,因为每个字符/字节只考虑一次替换。这部分在原则上是正确的,算法上也是正确的。
the next question is viability, and that would include a look at performance. if a vital task gets correctly completed in 0.01s using an awkward code but 1s using awesome code, then awkward might be considered preferable in practice (but only if the 1 second loss is in fact intolerable). here is the code i used for testing; it includes a number of different implementations. it is written in python 3.1 so we can use unicode greek letters for identifiers which in itself is awesome (zip
in py3k returns the same as itertools.izip
in py2):
下一个问题是可行性,这将包括对性能的考察。如果一个重要的任务用一个笨拙的代码在0.01中正确地完成了,而用一个可怕的代码在1s中完成了,那么在实践中笨拙可能被认为是更可取的(但前提是1秒的损失实际上是无法忍受的)。这是我用来测试的代码;它包含许多不同的实现。它是在python 3.1中编写的,因此我们可以使用unicode的希腊字母作为标识符,这本身是很了不起的(py3k的zip文件与itertools相同)。izip py2):
import itertools #01
#02
_replacements = { #03
'&': '&', #04
'<': '<', #05
'>': '>', #06
'"': '"', #07
"'": ''', } #08
#09
def escape_ΤΖΩΤΖΙΟΥ( a_string ): #10
return ''.join( #11
itertools.starmap( #12
_replacements.get, #13
zip( a_string, a_string ) ) ) #14
#15
def escape_SIMPLE( text ): #16
return ''.join( _replacements.get( chr, chr ) for chr in text ) #17
#18
def escape_SIMPLE_optimized( text ): #19
get = _replacements.get #20
return ''.join( get( chr, chr ) for chr in text ) #21
#22
def escape_TRADITIONAL( text ): #23
return text.replace('&', '&').replace('<', '<').replace('>', '>')\ #24
.replace("'", ''').replace('"', '"') #25
these are the timing results:
这些是时间结果:
escaping with SIMPLE took 5.74664253sec for 100000 items
escaping with SIMPLE_optimized took 5.11457801sec for 100000 items
escaping TRADITIONAL in-situ took 0.57543013sec for 100000 items
escaping with TRADITIONAL took 0.62347413sec for 100000 items
escaping a la ΤΖΩΤΖΙΟΥ took 2.66592320sec for 100000 items
turns out the original poster’s concern that the ‘traditional’ method gets ‘ugly quickly and appears to have poor algorithmic performance’ appears partially unwarranted when put into this context. it actually performs best; when stashed away into a function call, we do get to see a 8% performance penalty (‘calling methods is expensive’, but in general you should still do it). in comparison, ΤΖΩΤΖΙΟΥ’s implementation takes around 5 times as long as the traditional method, which, given it’s higher complexity that has to compete with python’s long-honed, optimized string methods is no surprise.
原来的海报担心“传统的”方法会很快变得“丑陋,而且似乎算法性能很差”,但放在这个上下文中就显得有些不合理了。最好实际执行;当隐藏到函数调用中时,我们会看到8%的性能损失(“调用方法是昂贵的”,但是通常您仍然应该这样做)。相比之下,ΤΖΩΤΖΙΟΥ的实现需要大约5倍长传统的方法,,考虑到它的高复杂性与python的long-honed竞争,优化字符串方法不足为奇。
there is yet another algorithm here, the SIMPLE one. as far as i can see, this very much does exactly what ΤΖΩΤΖΙΟΥ’s method does: it iterates over the characters/bytes in the text and performs a lookup for each, then joins all the characters/bytes together and returns the resulting escaped text. you can see that where one way to do that involves a fairly lengthy and myterious formulation, the SIMPLE implementation is actually understandable at a glance.
这里还有一个算法,简单的。据我所见,这个ΤΖΩΤΖΙΟΥ非常确实的方法:它遍历字符/字节的文本并执行一个查询,然后一起加入所有的人物/字节并返回结果逃出来的文本。您可以看到,要做到这一点,有一种方法需要相当冗长和神秘的公式,简单的实现是可以理解的。
what really trips me up here, though, is how badly the SIMPLE approach is in performance: it is around 10 times as slow as the traditional one, and also twice as slow as ΤΖΩΤΖΙΟΥ’s method. i am completely at a loss here, maybe someone can come up with an idea why this should be so. it uses only the most basic building blocks of python and works with two implicit iterations, so it avoids to build throw-away lists and everything, but it still slow, and i don’t know why.
真正旅行的我,是多么简单的方法在性能:它是大约10倍缓慢的传统,也慢两倍ΤΖΩΤΖΙΟΥ的方法。我在这里完全不知所措,也许有人能想出为什么会这样。它只使用python最基本的构建块,并使用两个隐式迭代,因此避免构建一次性列表和所有东西,但它仍然很慢,我不知道为什么。
let me conclude this code review with a remark on the merit of ΤΖΩΤΖΙΟΥ’s solution. i have made it sufficiently clear i find the code hard to read and too overblown for the task at hand. more critical than that, however, i find the way he treats characters and makes sure that for a given small range of characters they will behave in a byte-like fashion a little irritating. sure it works for the task at hand, but as soon as i iterate e.g. over the bytestring 'ΤΖΩΤΖΙΟΥ' what i do is iterate over adjacent bytes representing single characters. in most situations this is exactly what you should avoid; this is precisely the reason why in py3k ‘strings’ are now the ‘unicode objects’ of old, and the ‘strings’ of old have become ‘bytes’ and ‘bytearray’. if i was to nominate the one feature of py3k that could warrant a possibly expensive migration of code from the 2 series to the 3 series, it would be this single property of py3k. 98% of all my encoding issues have just dissolved ever since, period, and there is no clever hack that could have me seriously doubt my move. said algorithm is not ‘conceptually 8bit-clean and unicode safe’, which to me is a seriously shortcome, given this is 2010.
让我结束这段代码审查评论ΤΖΩΤΖΙΟΥ的价值的解决方案。我已经说得很清楚了,我发现代码很难读,而且对于手头的任务来说过于夸大了。然而,更重要的是,我发现他对待角色的方式,并确保对于给定的一小部分角色,他们的行为会有点烦人。确定它适用于手头的任务,但当我遍历如bytestringΤΖΩΤΖΙΟΥ的我所做的就是迭代相邻字节代表单个字符。在大多数情况下,这正是你应该避免的;这就是为什么在py3k中“strings”现在是旧的“unicode对象”,旧的“string”变成了“bytes”和“bytearray”。如果我指定py3k的一个特性,可以保证从2系列向3系列迁移代码的成本可能很高,那么它就是py3k的这个特性。从那以后,我所有的编码问题中有98%都已经消失了,而且也没有什么聪明的方法可以让我对自己的行为产生严重的怀疑。该算法在概念上不“8bit-clean和unicode safe”,这对我来说是一个严重的缺点,因为这是2010年。
#1
19
Do you have an application that is running too slow and you profiled it to find that a line like this snippet is causing it to be slow? Bottlenecks occur at unexpected places.
您是否有一个运行速度太慢的应用程序,您对它进行了剖析,发现像这个代码片段这样的行导致它运行速度太慢?瓶颈发生在意想不到的地方。
The current snippet traverses the string 5 times, doing one thing each time. You are suggesting traversing it once, probably doing doing five things each time (or at least doing something each time). It isn't clear that this will automatically do a better job to me. Currently the algorithm used is O(nm) (assuming the length of the string is longer than the stuff in the rules), where n is the length of the string and m is the number of substitution rules. You could, I think, reduce the algorithmic complexity to something like O(nlog(m)) and in the specific case we're in—where the original things are all only one character (but not in the case of multiple calls to replace
in general)—O(n), but this doesn't matter since m is 5 but n is unbounded.
当前的代码片段遍历字符串5次,每次执行一项操作。你是在建议穿越一次,可能每次做五件事(或者至少每次做一件事)。现在还不清楚这是否会自动地让我做得更好。目前使用的算法是O(nm)(假设字符串的长度比规则中的内容长),其中n是字符串的长度,m是替换规则的个数。我认为,你可以减少算法复杂度类似O(nlog(m))在特定的情况下,我们在原来的东西都是只有一个字符(而不是在多个调用替换)- O(n),但是这并不重要,因为米是5 n是*的。
If m is held constant, then, the complexity of both solutions really goes to O(n). It is not clear to me that it is going to be a worthy task to try to turn five simple passes into one complex one, the actual time of which I cannot guess at the current moment. If there was something about it that could make it scale better, I would have thought it was much more worthwhile task.
如果m保持不变,那么这两个解的复杂度就会变成O(n)我不清楚,把五个简单的传球转变成一个复杂的传球会不会是一项有价值的任务,我现在猜不出具体的传球时间。如果有什么东西可以让它更好地扩展,我就会认为这是一个更有价值的任务。
Doing everything on one pass rather than consecutive passes also demands questions be answered about what to do about conflicting rules and how they are applied. The resolution to these questions is clear with a chain of replace
.
在一次传球而不是连续传球的情况下做每一件事也需要回答关于如何处理冲突规则以及如何应用这些规则的问题。解决这些问题的办法显然是用一连串的替换。
#2
14
How about we just test various ways of doing this and see which comes out faster (assuming we are only caring about the fastest way to do it).
我们只是测试各种方法,看看哪个更快(假设我们只关心最快的方法)。
def escape1(input):
return input.replace('&', '&').replace('<', '<').replace('>', '>').replace("'", ''').replace('"', '"')
translation_table = {
'&': '&',
'<': '<',
'>': '>',
"'": ''',
'"': '"',
}
def escape2(input):
return ''.join(translation_table.get(char, char) for char in input)
import re
_escape3_re = re.compile(r'[&<>\'"]')
def _escape3_repl(x):
s = x.group(0)
return translation_table.get(s, s)
def escape3(x):
return _escape3_re.sub(_escape3_repl, x)
def escape4(x):
return unicode(x).translate(translation_table)
test_strings = (
'Nothing in there.',
'<this is="not" a="tag" />',
'Something & Something else',
'This one is pretty long. ' * 50
)
import time
for test_i, test_string in enumerate(test_strings):
print repr(test_string)
for func in escape1, escape2, escape3, escape4:
start_time = time.time()
for i in xrange(1000):
x = func(test_string)
print '\t%s done in %.3fms' % (func.__name__, (time.time() - start_time))
print
Running this gives you:
运行这个给你:
'Nothing in there.'
escape1 done in 0.002ms
escape2 done in 0.009ms
escape3 done in 0.001ms
escape4 done in 0.005ms
'<this is="not" a="tag" />'
escape1 done in 0.002ms
escape2 done in 0.012ms
escape3 done in 0.009ms
escape4 done in 0.007ms
'Something & Something else'
escape1 done in 0.002ms
escape2 done in 0.012ms
escape3 done in 0.003ms
escape4 done in 0.007ms
'This one is pretty long. <snip>'
escape1 done in 0.008ms
escape2 done in 0.386ms
escape3 done in 0.011ms
escape4 done in 0.310ms
Looks like just replacing them one after another goes the fastest.
看起来把它们一个接一个地替换是最快的。
Edit: Running the tests again with 1000000 iterations gives the following for the first three strings (the fourth would take too long on my machine for me to wait =P):
编辑:使用1000000次迭代再次运行测试,对于前三个字符串(第四个字符串在我的机器上等待时间太长,我无法等待=P):
'Nothing in there.'
escape1 done in 0.001ms
escape2 done in 0.008ms
escape3 done in 0.002ms
escape4 done in 0.005ms
'<this is="not" a="tag" />'
escape1 done in 0.002ms
escape2 done in 0.011ms
escape3 done in 0.009ms
escape4 done in 0.007ms
'Something & Something else'
escape1 done in 0.002ms
escape2 done in 0.011ms
escape3 done in 0.003ms
escape4 done in 0.007ms
The numbers are pretty much the same. In the first case they are actually even more consistent as the direct string replacement is fastest now.
这些数字几乎是一样的。在第一种情况下,它们实际上更加一致,因为直接替换字符串现在是最快的。
#3
13
I prefer something clean like:
我喜欢干净的,比如:
substitutions = [
('<', '<'),
('>', '>'),
...]
for search, replacement in substitutions:
string = string.replace(search, replacement)
#4
7
That's what Django does:
这是什么Django:
def escape(html):
"""Returns the given HTML with ampersands, quotes and carets encoded."""
return mark_safe(force_unicode(html).replace('&', '&').replace('<', '<').replace('>', '>').replace('"', '"').replace("'", '''))
#5
4
You can use reduce:
您可以使用减少:
reduce(lambda s,r: s.replace(*r),
[('&', '&'),
('<', '<'),
('>', '>'),
("'", '''),
('"', '"')],
string)
#6
4
In accordance with bebraw's suggestion, here is what I ended up using (in a separate module, of course):
根据bebraw的建议,以下是我最后使用的(当然是在一个单独的模块中):
import re
class Subs(object):
"""
A container holding strings to be searched for and replaced in
replace_multi().
Holds little relation to the sandwich.
"""
def __init__(self, needles_and_replacements):
"""
Returns a new instance of the Subs class, given a dictionary holding
the keys to be searched for and the values to be used as replacements.
"""
self.lookup = needles_and_replacements
self.regex = re.compile('|'.join(map(re.escape,
needles_and_replacements)))
def replace_multi(string, subs):
"""
Replaces given items in string efficiently in a single-pass.
"string" should be the string to be searched.
"subs" can be either:
A.) a dictionary containing as its keys the items to be
searched for and as its values the items to be replaced.
or B.) a pre-compiled instance of the Subs class from this module
(which may have slightly better performance if this is
called often).
"""
if not isinstance(subs, Subs): # Assume dictionary if not our class.
subs = Subs(subs)
lookup = subs.lookup
return subs.regex.sub(lambda match: lookup[match.group(0)], string)
Example usage:
使用示例:
def escape(string):
"""
Returns the given string with ampersands, quotes and angle
brackets encoded.
"""
# Note that ampersands must be escaped first; the rest can be escaped in
# any order.
escape.subs = Subs({'<': '<', '>': '>', "'": ''', '"': '"'})
return replace_multi(string.replace('&', '&'), escape.subs)
Much better :). Thanks for the help.
更好:)。谢谢你的帮助。
Edit
Nevermind, Mike Graham was right. I benchmarked it and the replacement ends up actually being much slower.
没关系,麦克·格雷厄姆是对的。我对它做了基准测试,而替换结果实际上要慢得多。
Code:
代码:
from urllib2 import urlopen
import timeit
def escape1(string):
"""
Returns the given string with ampersands, quotes and angle
brackets encoded.
"""
return string.replace('&', '&').replace('<', '<').replace('>', '>').replace("'", ''').replace('"', '"')
def escape2(string):
"""
Returns the given string with ampersands, quotes and angle
brackets encoded.
"""
# Note that ampersands must be escaped first; the rest can be escaped in
# any order.
escape2.subs = Subs({'<': '<', '>': '>', "'": ''', '"': '"'})
return replace_multi(string.replace('&', '&'), escape2.subs)
# An example test on the * homepage.
request = urlopen('http://*.com')
test_string = request.read()
request.close()
test1 = timeit.Timer('escape1(test_string)',
setup='from __main__ import escape1, test_string')
test2 = timeit.Timer('escape2(test_string)',
setup='from __main__ import escape2, test_string')
print 'multi-pass:', test1.timeit(2000)
print 'single-pass:', test2.timeit(2000)
Output:
输出:
multi-pass: 15.9897229671
single-pass: 66.5422530174
So much for that.
这么多。
#7
3
Apparently it's pretty common to implement that via regex. You can find an example of this at ASPN and here.
显然,通过regex实现这一点是很常见的。你可以在ASPN和这里找到一个这样的例子。
#8
1
If you work with non-Unicode strings and Python < 3.0, try an alternate translate
method:
如果您使用非unicode字符串和Python < 3.0,请尝试另一种翻译方法:
# Python < 3.0
import itertools
def escape(a_string):
replacer= dict( (chr(c),chr(c)) for c in xrange(256))
replacer.update(
{'&': '&',
'<': '<',
'>': '>',
'"': '"',
"'": '''}
)
return ''.join(itertools.imap(replacer.__getitem__, a_string))
if __name__ == "__main__":
print escape('''"Hello"<i> to George's friend&co.''')
$ python so2484156.py
"Hello"<i> to George's friend&co.
This is closer to a "single scan" of the input string, as per your wish.
这更接近输入字符串的“单个扫描”,如您所愿。
EDIT
My intention was to create a unicode.translate
equivalent that was not restricted to single-character replacements, so I came up with the answer above; I got a comment by user "flow" that was almost completely out of context, with a single correct point: the code above, as is, is intended to work with byte strings and not unicode strings. There is an obvious update (i.e. unichr() … xrange(sys.maxunicode+1)) which I strongly dislike, so I came up with another function that works on both unicode and byte strings, given that Python guarantees:
我的目的是创建一个unicode。翻译等价,不局限于单字符替换,所以我想出了上面的答案;我得到了一个用户“流”的注释,它几乎完全脱离上下文,只有一个正确的点:上面的代码是用来处理字节字符串而不是unicode字符串的。有一个明显的更新(例如unichr()…xrange(sys.maxunicode+1)),我非常不喜欢,所以我提出了另一个同时处理unicode和字节字符串的函数,因为Python保证:
all( (chr(i)==unichr(i) and hash(chr(i))==hash(unichr(i)))
for i in xrange(128)) is True
The new function follows:
新功能:
def escape(a_string):
replacer= {
'&': '&',
'<': '<',
'>': '>',
'"': '"',
"'": ''',
}
return ''.join(
itertools.starmap(
replacer.get, # .setdefault *might* be faster
itertools.izip(a_string, a_string)
)
)
Notice the use of starmap with a sequence of tuples: for any character not in the replacer dict, return said character.
注意starmap使用了一系列元组:对于任何不在replacer dict中的字符,返回该字符。
#9
1
ok so i sat down and did the math. pls do not get mad at me i answer specifically discussing ΤΖΩΤΖΙΟΥ’s solution, but this would be somewhat hard to shoehorn inside a comment, so let me do it this way. i will, in fact, also air some considerations that are relevant to the OP’s question.
我坐下来计算了一下。请不要生我的气我回答具体讨论ΤΖΩΤΖΙΟΥ的解决方案,但这将是很难硬塞进里面的评论,让我这么做。事实上,我也会考虑一些与OP的问题相关的考虑。
first up, i have been discussing with ΤΖΩΤΖΙΟΥ the elegance, correctness, and viability of his approach. turns out it looks like the proposal, while it does use an (inherently unordered) dictionary as a register to store the substitution pairs, does in fact consistently return correct results, where i had claimed it wouldn’t. this is because the call to itertools.starmap()
in line 11, below, gets as its second argument an iterator over pairs of single characters/bytes (more on that later) that looks like [ ( 'h', 'h', ), ( 'e', 'e', ), ( 'l', 'l', ), ... ]
. these pairs of characters/bytes is what the first argument, replacer.get
, is repeatedly called with. there is not a chance to run into a situation where first '>'
is transformed into '>'
and then inadvertently again into '&gt;'
, because each character/byte is considered only once for substitution. so this part is in principle fine and algorithmically correct.
首先,我一直在讨论与ΤΖΩΤΖΙΟΥ优雅,他的方法的正确性和可行性。它看起来像提议,虽然它确实使用(天生无序的)字典作为寄存器来存储替换对,但实际上它始终返回正确的结果,而我声称它不会。这是因为下面第11行对itertools.starmap()的调用获得了第二个参数,即迭代器对单字符/字节(后面会详细介绍)的对,它看起来像[('h', 'h',), ('e', 'e',), ('l', 'l',)……]。这些字符/字节是第一个参数,replacer。得到,被反复调用。没有机会遇到第一个“>”被转换为“>”的情况,然后又无意中进入“>”,因为每个字符/字节只考虑一次替换。这部分在原则上是正确的,算法上也是正确的。
the next question is viability, and that would include a look at performance. if a vital task gets correctly completed in 0.01s using an awkward code but 1s using awesome code, then awkward might be considered preferable in practice (but only if the 1 second loss is in fact intolerable). here is the code i used for testing; it includes a number of different implementations. it is written in python 3.1 so we can use unicode greek letters for identifiers which in itself is awesome (zip
in py3k returns the same as itertools.izip
in py2):
下一个问题是可行性,这将包括对性能的考察。如果一个重要的任务用一个笨拙的代码在0.01中正确地完成了,而用一个可怕的代码在1s中完成了,那么在实践中笨拙可能被认为是更可取的(但前提是1秒的损失实际上是无法忍受的)。这是我用来测试的代码;它包含许多不同的实现。它是在python 3.1中编写的,因此我们可以使用unicode的希腊字母作为标识符,这本身是很了不起的(py3k的zip文件与itertools相同)。izip py2):
import itertools #01
#02
_replacements = { #03
'&': '&', #04
'<': '<', #05
'>': '>', #06
'"': '"', #07
"'": ''', } #08
#09
def escape_ΤΖΩΤΖΙΟΥ( a_string ): #10
return ''.join( #11
itertools.starmap( #12
_replacements.get, #13
zip( a_string, a_string ) ) ) #14
#15
def escape_SIMPLE( text ): #16
return ''.join( _replacements.get( chr, chr ) for chr in text ) #17
#18
def escape_SIMPLE_optimized( text ): #19
get = _replacements.get #20
return ''.join( get( chr, chr ) for chr in text ) #21
#22
def escape_TRADITIONAL( text ): #23
return text.replace('&', '&').replace('<', '<').replace('>', '>')\ #24
.replace("'", ''').replace('"', '"') #25
these are the timing results:
这些是时间结果:
escaping with SIMPLE took 5.74664253sec for 100000 items
escaping with SIMPLE_optimized took 5.11457801sec for 100000 items
escaping TRADITIONAL in-situ took 0.57543013sec for 100000 items
escaping with TRADITIONAL took 0.62347413sec for 100000 items
escaping a la ΤΖΩΤΖΙΟΥ took 2.66592320sec for 100000 items
turns out the original poster’s concern that the ‘traditional’ method gets ‘ugly quickly and appears to have poor algorithmic performance’ appears partially unwarranted when put into this context. it actually performs best; when stashed away into a function call, we do get to see a 8% performance penalty (‘calling methods is expensive’, but in general you should still do it). in comparison, ΤΖΩΤΖΙΟΥ’s implementation takes around 5 times as long as the traditional method, which, given it’s higher complexity that has to compete with python’s long-honed, optimized string methods is no surprise.
原来的海报担心“传统的”方法会很快变得“丑陋,而且似乎算法性能很差”,但放在这个上下文中就显得有些不合理了。最好实际执行;当隐藏到函数调用中时,我们会看到8%的性能损失(“调用方法是昂贵的”,但是通常您仍然应该这样做)。相比之下,ΤΖΩΤΖΙΟΥ的实现需要大约5倍长传统的方法,,考虑到它的高复杂性与python的long-honed竞争,优化字符串方法不足为奇。
there is yet another algorithm here, the SIMPLE one. as far as i can see, this very much does exactly what ΤΖΩΤΖΙΟΥ’s method does: it iterates over the characters/bytes in the text and performs a lookup for each, then joins all the characters/bytes together and returns the resulting escaped text. you can see that where one way to do that involves a fairly lengthy and myterious formulation, the SIMPLE implementation is actually understandable at a glance.
这里还有一个算法,简单的。据我所见,这个ΤΖΩΤΖΙΟΥ非常确实的方法:它遍历字符/字节的文本并执行一个查询,然后一起加入所有的人物/字节并返回结果逃出来的文本。您可以看到,要做到这一点,有一种方法需要相当冗长和神秘的公式,简单的实现是可以理解的。
what really trips me up here, though, is how badly the SIMPLE approach is in performance: it is around 10 times as slow as the traditional one, and also twice as slow as ΤΖΩΤΖΙΟΥ’s method. i am completely at a loss here, maybe someone can come up with an idea why this should be so. it uses only the most basic building blocks of python and works with two implicit iterations, so it avoids to build throw-away lists and everything, but it still slow, and i don’t know why.
真正旅行的我,是多么简单的方法在性能:它是大约10倍缓慢的传统,也慢两倍ΤΖΩΤΖΙΟΥ的方法。我在这里完全不知所措,也许有人能想出为什么会这样。它只使用python最基本的构建块,并使用两个隐式迭代,因此避免构建一次性列表和所有东西,但它仍然很慢,我不知道为什么。
let me conclude this code review with a remark on the merit of ΤΖΩΤΖΙΟΥ’s solution. i have made it sufficiently clear i find the code hard to read and too overblown for the task at hand. more critical than that, however, i find the way he treats characters and makes sure that for a given small range of characters they will behave in a byte-like fashion a little irritating. sure it works for the task at hand, but as soon as i iterate e.g. over the bytestring 'ΤΖΩΤΖΙΟΥ' what i do is iterate over adjacent bytes representing single characters. in most situations this is exactly what you should avoid; this is precisely the reason why in py3k ‘strings’ are now the ‘unicode objects’ of old, and the ‘strings’ of old have become ‘bytes’ and ‘bytearray’. if i was to nominate the one feature of py3k that could warrant a possibly expensive migration of code from the 2 series to the 3 series, it would be this single property of py3k. 98% of all my encoding issues have just dissolved ever since, period, and there is no clever hack that could have me seriously doubt my move. said algorithm is not ‘conceptually 8bit-clean and unicode safe’, which to me is a seriously shortcome, given this is 2010.
让我结束这段代码审查评论ΤΖΩΤΖΙΟΥ的价值的解决方案。我已经说得很清楚了,我发现代码很难读,而且对于手头的任务来说过于夸大了。然而,更重要的是,我发现他对待角色的方式,并确保对于给定的一小部分角色,他们的行为会有点烦人。确定它适用于手头的任务,但当我遍历如bytestringΤΖΩΤΖΙΟΥ的我所做的就是迭代相邻字节代表单个字符。在大多数情况下,这正是你应该避免的;这就是为什么在py3k中“strings”现在是旧的“unicode对象”,旧的“string”变成了“bytes”和“bytearray”。如果我指定py3k的一个特性,可以保证从2系列向3系列迁移代码的成本可能很高,那么它就是py3k的这个特性。从那以后,我所有的编码问题中有98%都已经消失了,而且也没有什么聪明的方法可以让我对自己的行为产生严重的怀疑。该算法在概念上不“8bit-clean和unicode safe”,这对我来说是一个严重的缺点,因为这是2010年。