Note: I'm a Ruby developer trying to find my way in Python.
注意:我是一名Ruby开发人员,试图在Python中找到自己的方式。
When I wanted to figure out why some scripts use mylist[:]
instead of list(mylist)
to duplicate lists, I made a quick benchmark of the various methods to duplicate range(10)
(see code below).
当我想弄清楚为什么有些脚本使用mylist [:]而不是list(mylist)来复制列表时,我对各种方法进行了快速基准测试以复制范围(10)(参见下面的代码)。
EDIT: I updated the tests to make use of Python's timeit
as suggested below. This makes it impossible to directly compare it to Ruby, because timeit doesn't account for the looping while Ruby's Benchmark
does, so Ruby code is for reference only.
编辑:我更新了测试以使用Python的timeit,如下所示。这使得无法直接将它与Ruby进行比较,因为timeit不考虑Ruby的Benchmark所做的循环,因此Ruby代码仅供参考。
Python 2.7.2
Python 2.7.2
Array duplicating. Tests run 50000000 times
list(a) 18.7599430084
copy(a) 59.1787488461
a[:] 9.58828091621
a[0:len(a)] 14.9832749367
For reference, I wrote the same script in Ruby too:
作为参考,我也在Ruby中编写了相同的脚本:
Ruby 1.9.2p0
Ruby 1.9.2p0
Array duplicating. Tests 50000000 times
user system total real
Array.new(a) 14.590000 0.030000 14.620000 ( 14.693033)
Array[*a] 18.840000 0.060000 18.900000 ( 19.156352)
a.take(a.size) 8.780000 0.020000 8.800000 ( 8.805700)
a.clone 16.310000 0.040000 16.350000 ( 16.384711)
a[0,a.size] 8.950000 0.020000 8.970000 ( 8.990514)
Question 1: what is mylist[:]
doing differently that it is 25 % faster than even mylist[0:len(mylist)]
. Does it copy in memory directly or what?
问题1:什么是mylist [:]做的不同,它比mylist [0:len(mylist)]快25%。它是直接复制到内存还是什么?
Question 2: edit: updated benchmarks don't show huge differences in Python and Ruby anymore. was: Did I implement the tests in some obviously inefficient way, so that Ruby code is so much faster than Python?
问题2:编辑:更新的基准测试不再显示Python和Ruby的巨大差异。是:我是否以一些明显低效的方式实现测试,因此Ruby代码比Python快得多?
Now the code listings:
现在代码清单:
Python:
蟒蛇:
import timeit
COUNT = 50000000
print "Array duplicating. Tests run", COUNT, "times"
setup = 'a = range(10); import copy'
print "list(a)\t\t", timeit.timeit(stmt='list(a)', setup=setup, number=COUNT)
print "copy(a)\t\t", timeit.timeit(stmt='copy.copy(a)', setup=setup, number=COUNT)
print "a[:]\t\t", timeit.timeit(stmt='a[:]', setup=setup, number=COUNT)
print "a[0:len(a)]\t", timeit.timeit(stmt='a[0:len(a)]', setup=setup, number=COUNT)
Ruby:
红宝石:
require 'benchmark'
a = (0...10).to_a
COUNT = 50_000_000
puts "Array duplicating. Tests #{COUNT} times"
Benchmark.bm(16) do |x|
x.report("Array.new(a)") {COUNT.times{ Array.new(a) }}
x.report("Array[*a]") {COUNT.times{ Array[*a] }}
x.report("a.take(a.size)") {COUNT.times{ a.take(a.size) }}
x.report("a.clone") {COUNT.times{ a.clone }}
x.report("a[0,a.size]"){COUNT.times{ a[0,a.size] }}
end
2 个解决方案
#1
9
Use the timeit
module in python for testing timings.
使用python中的timeit模块来测试时序。
from copy import *
a=range(1000)
def cop():
b=copy(a)
def func1():
b=list(a)
def slice():
b=a[:]
def slice_len():
b=a[0:len(a)]
if __name__=="__main__":
import timeit
print "copy(a)",timeit.timeit("cop()", setup="from __main__ import cop")
print "list(a)",timeit.timeit("func1()", setup="from __main__ import func1")
print "a[:]",timeit.timeit("slice()", setup="from __main__ import slice")
print "a[0:len(a)]",timeit.timeit("slice_len()", setup="from __main__ import slice_len")
Results:
结果:
copy(a) 3.98940896988
list(a) 2.54542589188
a[:] 1.96630120277 #winner
a[0:len(a)] 10.5431251526
It's surely the extra steps involved in a[0:len(a)]
are the reason for it's slowness.
肯定地,[0:len(a)]涉及的额外步骤是它缓慢的原因。
Here's the byte code comparison of the two:
这是两者的字节码比较:
In [19]: dis.dis(func1)
2 0 LOAD_GLOBAL 0 (range)
3 LOAD_CONST 1 (100000)
6 CALL_FUNCTION 1
9 STORE_FAST 0 (a)
3 12 LOAD_FAST 0 (a)
15 SLICE+0
16 STORE_FAST 1 (b)
19 LOAD_CONST 0 (None)
22 RETURN_VALUE
In [20]: dis.dis(func2)
2 0 LOAD_GLOBAL 0 (range)
3 LOAD_CONST 1 (100000)
6 CALL_FUNCTION 1
9 STORE_FAST 0 (a)
3 12 LOAD_FAST 0 (a) #same up to here
15 LOAD_CONST 2 (0) #loads 0
18 LOAD_GLOBAL 1 (len) # loads the builtin len(),
# so it might take some lookup time
21 LOAD_FAST 0 (a)
24 CALL_FUNCTION 1
27 SLICE+3
28 STORE_FAST 1 (b)
31 LOAD_CONST 0 (None)
34 RETURN_VALUE
#2
5
I can't comment on the ruby timing vs. the python timing. But I can comment on list
vs. slice
. Here's a quick inspection of the bytecode:
我不能评论ruby时间与python时间。但我可以评论列表与切片。这是对字节码的快速检查:
>>> import dis
>>> a = range(10)
>>> def func(a):
... return a[:]
...
>>> def func2(a):
... return list(a)
...
>>> dis.dis(func)
2 0 LOAD_FAST 0 (a)
3 SLICE+0
4 RETURN_VALUE
>>> dis.dis(func2)
2 0 LOAD_GLOBAL 0 (list)
3 LOAD_FAST 0 (a)
6 CALL_FUNCTION 1
9 RETURN_VALUE
Notice that list
requires a LOAD_GLOBAL
to find the function list
. Looking up globals (and calling functions) in python is relatively slow. This would explain why a[0:len(a)]
is also slower. Also remember that list
needs to be able to handle arbitrary iterators whereas slicing doesn't. This means that list
needs to allocate a new list, pack elements into that list as it iterates over the list and resize when necessary. There are a few things in here which are expensive -- resizing if necessary and iterating (effectively in python, not C). With the slicing method, you can calculate the size of the memory you'll need so can probably avoid resizing, and the iteration can be done completely in C (probably with a memcpy
or something.
请注意,list需要LOAD_GLOBAL才能找到函数列表。在python中查找全局变量(和调用函数)相对较慢。这可以解释为什么[0:len(a)]也慢。还要记住,list需要能够处理任意迭代器,而切片则不能。这意味着列表需要分配一个新列表,在迭代列表时将元素打包到该列表中,并在必要时调整大小。这里有一些很昂贵的东西 - 必要时调整大小并迭代(有效地在python中,而不是C)。使用切片方法,您可以计算您需要的内存大小,因此可以避免调整大小,并且迭代可以在C中完全完成(可能使用memcpy或其他东西。
disclaimer : I'm not a python dev, so I don't know how the internals of list()
are implemented for sure. I'm just speculating based what I know of the specification.
免责声明:我不是一个python dev,所以我不知道list()的内部是如何实现的。我只是根据我对规范的了解进行推测。
EDIT -- So I've looked at the source (with a little guidance from Martijn). The relevant code is in listobject.c
. list
calls list_init
which then calls listextend
at line 799. That function has some checks to see if it can use a fast branch if the object is a list or a tuple (line 812). Finally, the heavy lifting is done starting at line 834:
编辑 - 所以我查看了源代码(在Martijn的指导下)。相关代码在listobject.c中。 list调用list_init然后在第799行调用listextend。该函数有一些检查,如果对象是列表或元组,它是否可以使用快速分支(第812行)。最后,从第834行开始繁重的工作:
src = PySequence_Fast_ITEMS(b);
dest = self->ob_item + m;
for (i = 0; i < n; i++) {
PyObject *o = src[i];
Py_INCREF(o);
dest[i] = o;
}
Compare that to the slice version which I think is defined in list_subscript
(line 2544). That calls list_slice
(line 2570) where the heavy lifting is done by the following loop (line 486):
将其与我认为在list_subscript中定义的切片版本(第2544行)进行比较。这会调用list_slice(第2570行),其中繁重的工作由以下循环完成(第486行):
src = a->ob_item + ilow;
dest = np->ob_item;
for (i = 0; i < len; i++) {
PyObject *v = src[i];
Py_INCREF(v);
dest[i] = v;
}
They're pretty much the same code, so it's not surprising that the performance is almost the same for large lists (where the overhead of the small stuff like unpacking slices, looking up global variables, etc becomes less important)
它们几乎是相同的代码,因此大型列表的性能几乎相同也就不足为奇了(其中包括解包切片,查找全局变量等小东西的开销变得不那么重要)
Here's how I would run the python tests (and the results for my Ubuntu system):
这是我如何运行python测试(以及我的Ubuntu系统的结果):
$ python -m timeit -s 'a=range(30)' 'list(a)'
1000000 loops, best of 3: 0.39 usec per loop
$ python -m timeit -s 'a=range(30)' 'a[:]'
10000000 loops, best of 3: 0.183 usec per loop
$ python -m timeit -s 'a=range(30)' 'a[0:len(a)]'
1000000 loops, best of 3: 0.254 usec per loop
#1
9
Use the timeit
module in python for testing timings.
使用python中的timeit模块来测试时序。
from copy import *
a=range(1000)
def cop():
b=copy(a)
def func1():
b=list(a)
def slice():
b=a[:]
def slice_len():
b=a[0:len(a)]
if __name__=="__main__":
import timeit
print "copy(a)",timeit.timeit("cop()", setup="from __main__ import cop")
print "list(a)",timeit.timeit("func1()", setup="from __main__ import func1")
print "a[:]",timeit.timeit("slice()", setup="from __main__ import slice")
print "a[0:len(a)]",timeit.timeit("slice_len()", setup="from __main__ import slice_len")
Results:
结果:
copy(a) 3.98940896988
list(a) 2.54542589188
a[:] 1.96630120277 #winner
a[0:len(a)] 10.5431251526
It's surely the extra steps involved in a[0:len(a)]
are the reason for it's slowness.
肯定地,[0:len(a)]涉及的额外步骤是它缓慢的原因。
Here's the byte code comparison of the two:
这是两者的字节码比较:
In [19]: dis.dis(func1)
2 0 LOAD_GLOBAL 0 (range)
3 LOAD_CONST 1 (100000)
6 CALL_FUNCTION 1
9 STORE_FAST 0 (a)
3 12 LOAD_FAST 0 (a)
15 SLICE+0
16 STORE_FAST 1 (b)
19 LOAD_CONST 0 (None)
22 RETURN_VALUE
In [20]: dis.dis(func2)
2 0 LOAD_GLOBAL 0 (range)
3 LOAD_CONST 1 (100000)
6 CALL_FUNCTION 1
9 STORE_FAST 0 (a)
3 12 LOAD_FAST 0 (a) #same up to here
15 LOAD_CONST 2 (0) #loads 0
18 LOAD_GLOBAL 1 (len) # loads the builtin len(),
# so it might take some lookup time
21 LOAD_FAST 0 (a)
24 CALL_FUNCTION 1
27 SLICE+3
28 STORE_FAST 1 (b)
31 LOAD_CONST 0 (None)
34 RETURN_VALUE
#2
5
I can't comment on the ruby timing vs. the python timing. But I can comment on list
vs. slice
. Here's a quick inspection of the bytecode:
我不能评论ruby时间与python时间。但我可以评论列表与切片。这是对字节码的快速检查:
>>> import dis
>>> a = range(10)
>>> def func(a):
... return a[:]
...
>>> def func2(a):
... return list(a)
...
>>> dis.dis(func)
2 0 LOAD_FAST 0 (a)
3 SLICE+0
4 RETURN_VALUE
>>> dis.dis(func2)
2 0 LOAD_GLOBAL 0 (list)
3 LOAD_FAST 0 (a)
6 CALL_FUNCTION 1
9 RETURN_VALUE
Notice that list
requires a LOAD_GLOBAL
to find the function list
. Looking up globals (and calling functions) in python is relatively slow. This would explain why a[0:len(a)]
is also slower. Also remember that list
needs to be able to handle arbitrary iterators whereas slicing doesn't. This means that list
needs to allocate a new list, pack elements into that list as it iterates over the list and resize when necessary. There are a few things in here which are expensive -- resizing if necessary and iterating (effectively in python, not C). With the slicing method, you can calculate the size of the memory you'll need so can probably avoid resizing, and the iteration can be done completely in C (probably with a memcpy
or something.
请注意,list需要LOAD_GLOBAL才能找到函数列表。在python中查找全局变量(和调用函数)相对较慢。这可以解释为什么[0:len(a)]也慢。还要记住,list需要能够处理任意迭代器,而切片则不能。这意味着列表需要分配一个新列表,在迭代列表时将元素打包到该列表中,并在必要时调整大小。这里有一些很昂贵的东西 - 必要时调整大小并迭代(有效地在python中,而不是C)。使用切片方法,您可以计算您需要的内存大小,因此可以避免调整大小,并且迭代可以在C中完全完成(可能使用memcpy或其他东西。
disclaimer : I'm not a python dev, so I don't know how the internals of list()
are implemented for sure. I'm just speculating based what I know of the specification.
免责声明:我不是一个python dev,所以我不知道list()的内部是如何实现的。我只是根据我对规范的了解进行推测。
EDIT -- So I've looked at the source (with a little guidance from Martijn). The relevant code is in listobject.c
. list
calls list_init
which then calls listextend
at line 799. That function has some checks to see if it can use a fast branch if the object is a list or a tuple (line 812). Finally, the heavy lifting is done starting at line 834:
编辑 - 所以我查看了源代码(在Martijn的指导下)。相关代码在listobject.c中。 list调用list_init然后在第799行调用listextend。该函数有一些检查,如果对象是列表或元组,它是否可以使用快速分支(第812行)。最后,从第834行开始繁重的工作:
src = PySequence_Fast_ITEMS(b);
dest = self->ob_item + m;
for (i = 0; i < n; i++) {
PyObject *o = src[i];
Py_INCREF(o);
dest[i] = o;
}
Compare that to the slice version which I think is defined in list_subscript
(line 2544). That calls list_slice
(line 2570) where the heavy lifting is done by the following loop (line 486):
将其与我认为在list_subscript中定义的切片版本(第2544行)进行比较。这会调用list_slice(第2570行),其中繁重的工作由以下循环完成(第486行):
src = a->ob_item + ilow;
dest = np->ob_item;
for (i = 0; i < len; i++) {
PyObject *v = src[i];
Py_INCREF(v);
dest[i] = v;
}
They're pretty much the same code, so it's not surprising that the performance is almost the same for large lists (where the overhead of the small stuff like unpacking slices, looking up global variables, etc becomes less important)
它们几乎是相同的代码,因此大型列表的性能几乎相同也就不足为奇了(其中包括解包切片,查找全局变量等小东西的开销变得不那么重要)
Here's how I would run the python tests (and the results for my Ubuntu system):
这是我如何运行python测试(以及我的Ubuntu系统的结果):
$ python -m timeit -s 'a=range(30)' 'list(a)'
1000000 loops, best of 3: 0.39 usec per loop
$ python -m timeit -s 'a=range(30)' 'a[:]'
10000000 loops, best of 3: 0.183 usec per loop
$ python -m timeit -s 'a=range(30)' 'a[0:len(a)]'
1000000 loops, best of 3: 0.254 usec per loop