I have a very large dictionary with thousands of elements. I need to execute a function with this dictionary as parameter. Now, instead of passing the whole dictionary in a single execution, I want to execute the function in batches - with x key-value pairs of the dictionary at a time.
我有一个包含数千个元素的非常大的字典。我需要用这个字典作为参数执行一个函数。现在,我不是在一次执行中传递整个字典,而是希望批量执行该函数 - 一次使用字典的x键值对。
I am doing the following:
我正在做以下事情:
mydict = ##some large hash
x = ##batch size
def some_func(data):
##do something on data
temp = {}
for key,value in mydict.iteritems():
if len(temp) != 0 and len(temp)%x == 0:
some_func(temp)
temp = {}
temp[key] = value
else:
temp[key] = value
if temp != {}:
some_func(temp)
This looks very hackish to me. I want to know if there is an elegant/better way of doing this.
这对我来说看起来很骇人听闻。我想知道是否有一种优雅/更好的方式来做到这一点。
3 个解决方案
#1
6
I often use this little utility:
我经常使用这个小工具:
import itertools
def chunked(it, size):
it = iter(it)
while True:
p = tuple(itertools.islice(it, size))
if not p:
break
yield p
For your use case:
对于您的用例:
for chunk in chunked(big_dict.iteritems(), batch_size):
func(chunk)
#2
1
Here are two solutions adapted from earlier answers of mine.
以下是根据我早期答案改编的两种解决方案。
Either, you can just get the list of items
from the dictionary and create new dict
s from slices of that list. This is not optimal, though, as it does a lot of copying of that huge dictionary.
或者,您可以从字典中获取项目列表,并从该列表的切片中创建新的dicts。然而,这并不是最佳的,因为它会对那个庞大的字典进行大量复制。
def chunks(dictionary, size):
items = dictionary.items()
return (dict(items[i:i+size]) for i in range(0, len(items), size))
Alternatively, you can use some of the itertools
module's functions to yield (generate) new sub-dictionaries as you loop. This is similar to @georg's answer, just using a for
loop.
或者,您可以使用一些itertools模块的函数在循环时生成(生成)新的子字典。这与@ georg的答案类似,只是使用for循环。
from itertools import chain, islice
def chunks(dictionary, size):
iterator = dictionary.iteritems()
for first in iterator:
yield dict(chain([first], islice(iterator, size - 1)))
Example usage. for both cases:
用法示例。对于这两种情况:
mydict = {i+1: chr(i+65) for i in range(26)}
for sub_d in chunks2(mydict, 10):
some_func(sub_d)
#3
0
From more-itertools:
来自更多的itertools:
def chunked(iterable, n):
"""Break an iterable into lists of a given length::
>>> list(chunked([1, 2, 3, 4, 5, 6, 7], 3))
[[1, 2, 3], [4, 5, 6], [7]]
If the length of ``iterable`` is not evenly divisible by ``n``, the last
returned list will be shorter.
This is useful for splitting up a computation on a large number of keys
into batches, to be pickled and sent off to worker processes. One example
is operations on rows in MySQL, which does not implement server-side
cursors properly and would otherwise load the entire dataset into RAM on
the client.
"""
# Doesn't seem to run into any number-of-args limits.
for group in (list(g) for g in izip_longest(*[iter(iterable)] * n,
fillvalue=_marker)):
if group[-1] is _marker:
# If this is the last group, shuck off the padding:
del group[group.index(_marker):]
yield group
#1
6
I often use this little utility:
我经常使用这个小工具:
import itertools
def chunked(it, size):
it = iter(it)
while True:
p = tuple(itertools.islice(it, size))
if not p:
break
yield p
For your use case:
对于您的用例:
for chunk in chunked(big_dict.iteritems(), batch_size):
func(chunk)
#2
1
Here are two solutions adapted from earlier answers of mine.
以下是根据我早期答案改编的两种解决方案。
Either, you can just get the list of items
from the dictionary and create new dict
s from slices of that list. This is not optimal, though, as it does a lot of copying of that huge dictionary.
或者,您可以从字典中获取项目列表,并从该列表的切片中创建新的dicts。然而,这并不是最佳的,因为它会对那个庞大的字典进行大量复制。
def chunks(dictionary, size):
items = dictionary.items()
return (dict(items[i:i+size]) for i in range(0, len(items), size))
Alternatively, you can use some of the itertools
module's functions to yield (generate) new sub-dictionaries as you loop. This is similar to @georg's answer, just using a for
loop.
或者,您可以使用一些itertools模块的函数在循环时生成(生成)新的子字典。这与@ georg的答案类似,只是使用for循环。
from itertools import chain, islice
def chunks(dictionary, size):
iterator = dictionary.iteritems()
for first in iterator:
yield dict(chain([first], islice(iterator, size - 1)))
Example usage. for both cases:
用法示例。对于这两种情况:
mydict = {i+1: chr(i+65) for i in range(26)}
for sub_d in chunks2(mydict, 10):
some_func(sub_d)
#3
0
From more-itertools:
来自更多的itertools:
def chunked(iterable, n):
"""Break an iterable into lists of a given length::
>>> list(chunked([1, 2, 3, 4, 5, 6, 7], 3))
[[1, 2, 3], [4, 5, 6], [7]]
If the length of ``iterable`` is not evenly divisible by ``n``, the last
returned list will be shorter.
This is useful for splitting up a computation on a large number of keys
into batches, to be pickled and sent off to worker processes. One example
is operations on rows in MySQL, which does not implement server-side
cursors properly and would otherwise load the entire dataset into RAM on
the client.
"""
# Doesn't seem to run into any number-of-args limits.
for group in (list(g) for g in izip_longest(*[iter(iterable)] * n,
fillvalue=_marker)):
if group[-1] is _marker:
# If this is the last group, shuck off the padding:
del group[group.index(_marker):]
yield group