I have a list of dictionaries that all have the same structure within the list. For example:
我有一个字典列表,列表中的所有字典都具有相同的结构。例如:
test_data = [{'id':1, 'value':'one'}, {'id':2, 'value':'two'}, {'id':3, 'value':'three'}]
What I need to do is compare each of these dictionaries and return "similar" dictionaries based on a value key pair. For example, given the key value
and the value oen
, I want to find all the matching dictionaries almost similar to oen
which in this case would be [{'id':1, 'value':'one'}]
.
我需要做的是比较每个字典并返回基于值键对的“相似”字典。例如,给定键值和值,我想找到几乎与oen相似的所有匹配词典,在这种情况下将是[{'id':1,'value':'one'}]。
The difflib
has a function get_close_matches
which is close to what I need. I'm able to extract the values of the specific key using a list comprehension and then compare those values to my search:
difflib有一个函数get_close_matches,它接近我需要的函数。我可以使用列表推导提取特定键的值,然后将这些值与我的搜索进行比较:
values = [ item['value'] for item in test_data ]
found_vals = get_close_matches('oen', values) #returns ['one']
What I need this to do is go one step further and tie everything back together with the original dictionary:
我需要做的是更进一步,将所有内容与原始词典联系在一起:
In [1]: get_close_dicts('oen', test_data, 'value')
Out [1]: [{'id':1, 'value':'one'}]
Note: The list of dictionaries is quite large, and therefore I'm hoping to be as efficient/fast as possible.
注意:词典列表非常大,因此我希望尽可能高效/快速。
3 个解决方案
#1
2
You can create a reverse lookup dict prior to running get_close_dicts on your data, so that once you have a set of values returned, you can use them to lookup the relevant dict(s).
您可以在对数据运行get_close_dicts之前创建反向查找dict,这样一旦返回了一组值,就可以使用它们来查找相关的dict(s)。
If you're guaranteed to have unique values across your dicts for the 'value' key, then you can do:
如果你保证在你的dicts中为'value'键提供唯一值,那么你可以这样做:
reverselookup = {thedict['value']:thedict for thedict in test_data}
If, however, you need to handle the case where multiple dicts will have the same value for the 'value' key, then you need to map all of them (this will give you a dict where the key is the value in 'value' and the value is the list of dicts that have that value):
但是,如果您需要处理多个dicts对'value'键具有相同值的情况,那么您需要映射所有这些(这将为您提供一个dict,其中键是'value'中的值并且值是具有该值的dicts列表):
from collections import defaultdict
reverselookup = defaultdict(list)
for testdict in test_data:
reverselookup[testdict['value']].append(testdict)
For example, if your test data had an extra dict in it like this:
例如,如果你的测试数据中有一个额外的dict,就像这样:
>>> test_data = [{'id':1, 'value':'one'}, {'id':2, 'value':'two'},
{'id':3, 'value':'three'}, {'id':4, 'value':'three'}]
Then the above reverse lookup construction would give you this:
然后上面的反向查找结构会给你这样的:
{
"three": [
{
"id": 3,
"value": "three"
},
{
"id": 4,
"value": "three"
}
],
"two": [
{
"id": 2,
"value": "two"
}
],
"one": [
{
"id": 1,
"value": "one"
}
]
}
Then after you have your values, just retrieve the dicts (then you can chain if you have the list of lists use case, no need to chain if you have the first use case):
然后在获得值之后,只需检索dicts(如果你有列表用例,你可以链接,如果你有第一个用例,则不需要链接):
from itertools import chain
chain(*[reverselookup[val] for val in found_vals])
#2
0
You could:
return [d for d in test_data if get_close_matches('oen', [d['value'])]]
Pay attention get_close_matches could return more than one result.
注意get_close_matches可能会返回多个结果。
#3
0
No matter what, you're going to end up iterating through every dictionary at some point. There's no getting around that. What you can do is get all the work done in a preprocessing phase, to make your actual calls to the function immediate.
无论如何,你最终会在某个时刻遍历每一本字典。没有绕过那个。您可以做的是在预处理阶段完成所有工作,以便立即对函数进行实际调用。
As ValAyal mentioned, a reverse lookup dictionary is a good idea here. I'm imagining a dictionary value_dict
, where the key
is the value from the first dictionary, and the value
contains both exact and similar value
matches. Take this example with d1
and d2
, which are in your list that you want to search. If
正如ValAyal所提到的,反向查找字典在这里是一个好主意。我正在想象一个字典value_dict,其中键是第一个字典中的值,该值包含完全匹配和类似值匹配。以d1和d2为例,您想要搜索列表中的d1和d2。如果
d1 = {'id':1, 'value':'one'}
d2 = {'id':3, 'value':'oen'}
Then:
value_dict["one"] = {"exact": [d1], "close": [d2]}
value_dict["oen"] = {"exact": [d2], "close": [d1]}
Whenever you insert a dictionary that has an already-seen value, you can immediately determine all the exact and close matches (just by looking up that value), and add to the various lists accordingly. If you have a new value that hasn't been seen before, you'd have to compare it to all the values currently in the value_dict
. For example, if you wanted to add
每当您插入具有已经看到的值的字典时,您可以立即确定所有完全匹配和完全匹配(仅通过查找该值),并相应地添加到各个列表。如果您有一个以前没有见过的新值,则必须将其与value_dict中当前的所有值进行比较。例如,如果要添加
d3 = {'id':5, 'value':'one'}
You'd look up value_dict["one"]
and get both the exact
and close
lists. These lists include all of the other value_dict
entries you need to modify. You'd need to add to the exact matches of one
and the close matches of oen
; both these values you can get from the returned lists. You end up with
你会查找value_dict [“one”]并获得完全列表和关闭列表。这些列表包含您需要修改的所有其他value_dict条目。你需要添加一个完全匹配和oen的近似匹配;您可以从返回的列表中获取这两个值。你结束了
value_dict["one"] = {"exact": [d1, d3], "close": [d2]}
value_dict["oen"] = {"exact": [d2], "close": [d1, d3]}
So once all that preprocessing is done, your function becomes simpler: something like get_close_dicts(val)
(I don't know what the third argument does in your example) can just do return value_dict[val]["exact"] + value_dict[val]["close"]
. You now have a function that gives an immediate answer.
所以一旦完成所有预处理,你的函数就会变得更简单:类似于get_close_dicts(val)(我不知道你的例子中第三个参数做了什么)可以只返回value_dict [val] [“exact”] + value_dict [ VAL] [ “关闭”。你现在有一个能立即给出答案的功能。
The preprocessing step is pretty complex, but the resulting speedup in get_close_dicts
will hopefully make up for it. I can elaborate on this more when I get back from work, if you want to know how to implement this. Hopefully this can give you a good idea of a helpful data structure, and I didn't horrendously overthink this.
预处理步骤非常复杂,但get_close_dicts中的最终加速有望弥补它。如果你想知道如何实现这一点,我可以在下班回来时详细说明。希望这可以让你对一个有用的数据结构有一个很好的想法,我并没有骇人听闻。
#1
2
You can create a reverse lookup dict prior to running get_close_dicts on your data, so that once you have a set of values returned, you can use them to lookup the relevant dict(s).
您可以在对数据运行get_close_dicts之前创建反向查找dict,这样一旦返回了一组值,就可以使用它们来查找相关的dict(s)。
If you're guaranteed to have unique values across your dicts for the 'value' key, then you can do:
如果你保证在你的dicts中为'value'键提供唯一值,那么你可以这样做:
reverselookup = {thedict['value']:thedict for thedict in test_data}
If, however, you need to handle the case where multiple dicts will have the same value for the 'value' key, then you need to map all of them (this will give you a dict where the key is the value in 'value' and the value is the list of dicts that have that value):
但是,如果您需要处理多个dicts对'value'键具有相同值的情况,那么您需要映射所有这些(这将为您提供一个dict,其中键是'value'中的值并且值是具有该值的dicts列表):
from collections import defaultdict
reverselookup = defaultdict(list)
for testdict in test_data:
reverselookup[testdict['value']].append(testdict)
For example, if your test data had an extra dict in it like this:
例如,如果你的测试数据中有一个额外的dict,就像这样:
>>> test_data = [{'id':1, 'value':'one'}, {'id':2, 'value':'two'},
{'id':3, 'value':'three'}, {'id':4, 'value':'three'}]
Then the above reverse lookup construction would give you this:
然后上面的反向查找结构会给你这样的:
{
"three": [
{
"id": 3,
"value": "three"
},
{
"id": 4,
"value": "three"
}
],
"two": [
{
"id": 2,
"value": "two"
}
],
"one": [
{
"id": 1,
"value": "one"
}
]
}
Then after you have your values, just retrieve the dicts (then you can chain if you have the list of lists use case, no need to chain if you have the first use case):
然后在获得值之后,只需检索dicts(如果你有列表用例,你可以链接,如果你有第一个用例,则不需要链接):
from itertools import chain
chain(*[reverselookup[val] for val in found_vals])
#2
0
You could:
return [d for d in test_data if get_close_matches('oen', [d['value'])]]
Pay attention get_close_matches could return more than one result.
注意get_close_matches可能会返回多个结果。
#3
0
No matter what, you're going to end up iterating through every dictionary at some point. There's no getting around that. What you can do is get all the work done in a preprocessing phase, to make your actual calls to the function immediate.
无论如何,你最终会在某个时刻遍历每一本字典。没有绕过那个。您可以做的是在预处理阶段完成所有工作,以便立即对函数进行实际调用。
As ValAyal mentioned, a reverse lookup dictionary is a good idea here. I'm imagining a dictionary value_dict
, where the key
is the value from the first dictionary, and the value
contains both exact and similar value
matches. Take this example with d1
and d2
, which are in your list that you want to search. If
正如ValAyal所提到的,反向查找字典在这里是一个好主意。我正在想象一个字典value_dict,其中键是第一个字典中的值,该值包含完全匹配和类似值匹配。以d1和d2为例,您想要搜索列表中的d1和d2。如果
d1 = {'id':1, 'value':'one'}
d2 = {'id':3, 'value':'oen'}
Then:
value_dict["one"] = {"exact": [d1], "close": [d2]}
value_dict["oen"] = {"exact": [d2], "close": [d1]}
Whenever you insert a dictionary that has an already-seen value, you can immediately determine all the exact and close matches (just by looking up that value), and add to the various lists accordingly. If you have a new value that hasn't been seen before, you'd have to compare it to all the values currently in the value_dict
. For example, if you wanted to add
每当您插入具有已经看到的值的字典时,您可以立即确定所有完全匹配和完全匹配(仅通过查找该值),并相应地添加到各个列表。如果您有一个以前没有见过的新值,则必须将其与value_dict中当前的所有值进行比较。例如,如果要添加
d3 = {'id':5, 'value':'one'}
You'd look up value_dict["one"]
and get both the exact
and close
lists. These lists include all of the other value_dict
entries you need to modify. You'd need to add to the exact matches of one
and the close matches of oen
; both these values you can get from the returned lists. You end up with
你会查找value_dict [“one”]并获得完全列表和关闭列表。这些列表包含您需要修改的所有其他value_dict条目。你需要添加一个完全匹配和oen的近似匹配;您可以从返回的列表中获取这两个值。你结束了
value_dict["one"] = {"exact": [d1, d3], "close": [d2]}
value_dict["oen"] = {"exact": [d2], "close": [d1, d3]}
So once all that preprocessing is done, your function becomes simpler: something like get_close_dicts(val)
(I don't know what the third argument does in your example) can just do return value_dict[val]["exact"] + value_dict[val]["close"]
. You now have a function that gives an immediate answer.
所以一旦完成所有预处理,你的函数就会变得更简单:类似于get_close_dicts(val)(我不知道你的例子中第三个参数做了什么)可以只返回value_dict [val] [“exact”] + value_dict [ VAL] [ “关闭”。你现在有一个能立即给出答案的功能。
The preprocessing step is pretty complex, but the resulting speedup in get_close_dicts
will hopefully make up for it. I can elaborate on this more when I get back from work, if you want to know how to implement this. Hopefully this can give you a good idea of a helpful data structure, and I didn't horrendously overthink this.
预处理步骤非常复杂,但get_close_dicts中的最终加速有望弥补它。如果你想知道如何实现这一点,我可以在下班回来时详细说明。希望这可以让你对一个有用的数据结构有一个很好的想法,我并没有骇人听闻。