在Python中跨多个列表查找最相似的数字

时间:2022-09-30 04:20:24

In Python, I have 3 lists of floating-point numbers (angles), in the range 0-360, and the lists are not the same length. I need to find the triplet (with 1 number from each list) in which the numbers are the closest. (It's highly unlikely that any of the numbers will be identical, since this is real-world data.) I was thinking of using a simple lowest-standard-deviation method to measure agreement, but I'm not sure of a good way to implement this. I could loop through each list, comparing the standard deviation of every possible combination using nested for loops, and have a temporary variable save the indices of the triplet that agrees the best, but I was wondering if anyone had a better or more elegant way to do something like this. Thanks!

在Python中,我有3个浮点数(角度)列表,范围为0-360,而且列表的长度不相同。我需要找到一个三元组(每个列表中有一个数字),其中的数字是最接近的。(因为这是真实世界的数据,所以任何数字都不可能是相同的。)我正在考虑使用一个简单的最低标准偏差方法来度量协议,但是我不确定是否有一个好的方法来实现它。我可以遍历每个列表,比较所有可能的组合的标准差使用嵌套循环,和有一个临时变量保存指数的三联体同意最好的,但我想知道如果任何人有更好或更优雅的方式这样做。谢谢!

1 个解决方案

#1


6  

I wouldn't be surprised if there is an established algorithm for doing this, and if so, you should use it. But I don't know of one, so I'm going to speculate a little.

如果有现成的算法,我不会感到惊讶,如果有,你应该使用它。但是我不知道,所以我要做一些推测。

If I had to do it, the first thing I would try would be just to loop through all possible combinations of all the numbers and see how long it takes. If your data set is small enough, it's not worth the time to invent a clever algorithm. To demonstrate the setup, I'll include the sample code:

如果我必须这样做的话,我首先要做的就是循环遍历所有可能的数字组合,看看需要多长时间。如果你的数据集足够小,那就不值得花时间去发明一个聪明的算法。为了演示设置,我将包含示例代码:

# setup
def distance(nplet):
    '''Takes a pair or triplet (an "n-plet") as a list, and returns its distance.
    A smaller return value means better agreement.'''
    # your choice of implementation here. Example:
    return variance(nplet)

# algorithm
def brute_force(*lists):
    return min(itertools.product(*lists), key = distance)

For a large data set, I would try something like this: first create one triplet for each number in the first list, with its first entry set to that number. Then go through this list of partially-filled triplets and for each one, pick the number from the second list that is closest to the number from the first list and set that as the second member of the triplet. Then go through the list of triplets and for each one, pick the number from the third list that is closest to the first two numbers (as measured by your agreement metric). Finally, take the best of the bunch. This sample code demonstrates how you could try to keep the runtime linear in the length of the lists.

对于大型数据集,我将尝试这样的操作:首先为第一个列表中的每个数字创建一个三元组,第一个条目设置为该数字。然后浏览这个部分填充的三胞胎列表,对于每个三胞胎,从第二个列表中选择与第一个列表中最接近的数字,并将其设置为三胞胎的第二个成员。然后遍历三胞胎的列表,并对每一个,从最接近前两个数字的第三个列表中选出最接近前两个数字的数字(根据你的协议度量)。最后,要充分利用这群人。这个示例代码演示了如何在列表的长度中保持运行时的线性。

def item_selection(listA, listB, listC):
    # make the list of partially-filled triplets
    triplets = [[a] for a in listA]
    iT = 0
    iB = 0
    while iT < len(triplets):
        # make iB the index of a value in listB closes to triplets[iT][0]
        while iB < len(listB) and listB[iB] < triplets[iT][0]:
            iB += 1
        if iB == 0:
            triplets[iT].append(listB[0])
        elif iB == len(listB)
            triplets[iT].append(listB[-1])
        else:
            # look at the values in listB just below and just above triplets[iT][0]
            # and add the closer one as the second member of the triplet
            dist_lower = distance([triplets[iT][0], listB[iB]])
            dist_upper = distance([triplets[iT][0], listB[iB + 1]])
            if dist_lower < dist_upper:
                triplets[iT].append(listB[iB])
            elif dist_lower > dist_upper:
                triplets[iT].append(listB[iB + 1])
            else:
                # if they are equidistant, add both
                triplets[iT].append(listB[iB])
                iT += 1
                triplets[iT:iT] = [triplets[iT-1][0], listB[iB + 1]]
        iT += 1
    # then another loop while iT < len(triplets) to add in the numbers from listC
    return min(triplets, key = distance)

The thing is, I can imagine situations where this wouldn't actually find the best triplet, for instance if a number from the first list is close to one from the second list but not at all close to anything in the third list. So something you could try is to run this algorithm for all 6 possible orderings of the lists. I can't think of a specific situation where that would fail to find the best triplet, but one might still exist. In any case the algorithm will still be O(N) if you use a clever implementation, assuming the lists are sorted.

问题是,我可以想象这样的情况,如果第一个列表中的一个数字与第二个列表中的一个数字接近,但与第三个列表中的任何一个数字都不接近的话,它实际上不会找到最好的三元组。所以你可以试着运行这个算法对列表中所有6种可能的排序。我想不出一个具体的情况,那就是找不到最好的三连音,但可能仍然存在。在任何情况下,如果使用一个聪明的实现,假设列表是排序的,那么算法仍然是O(N)。

def symmetrized_item_selection(listA, listB, listC):
    best_results = []
    for ordering in itertools.permutations([listA, listB, listC]):
        best_results.extend(item_selection(*ordering))
    return min(best_results, key = distance)

Another option might be to compute all possible pairs of numbers between list 1 and list 2, between list 1 and list 3, and between list 2 and list 3. Then sort all three lists of pairs together, from best to worst agreement between the two numbers. Starting with the closest pair, go through the list pair by pair and any time you encounter a pair which shares a number with one you've already seen, merge them into a triplet. For a suitable measure of agreement, once you find your first triplet, that will give you a maximum pair distance that you need to iterate up to, and once you get up to it, you just choose the closest triplet of the ones you've found. I think that should consistently find the best possible triplet, but it will be O(N^2 log N) because of the requirement for sorting the lists of pairs.

另一个选项可能是计算列表1和列表2之间、列表1和列表3之间、列表2和列表3之间的所有可能的数字对。然后把这三对组合在一起,从最好的到最差的两个数字之间的一致。从最接近的一对开始,逐一检查列表对,当你遇到与你已经见过的数字相同的一对时,将它们合并成一个三元组。为了得到一个合适的一致性度量,一旦找到了第一个三元组,就会得到需要迭代的最大一对距离,一旦找到了,就只选择最近的三元组。我认为应该坚持找到最好的三重态,但这将是O(N ^ 2 O(log N)),因为要求排序的列表对。

def pair_sorting(listA, listB, listC):
    # make all possible pairs of values from two lists
    # each pair has the structure ((number, origin_list),(number, origin_list))
    # so we know which lists the numbers came from
    all_pairs = []
    all_pairs += [((nA,0), (nB,1)) for (nA,nB) in itertools.product(listA,listB)]
    all_pairs += [((nA,0), (nC,2)) for (nA,nC) in itertools.product(listA,listC)]
    all_pairs += [((nB,1), (nC,2)) for (nB,nC) in itertools.product(listB,listC)]
    all_pairs.sort(key = lambda p: distance(p[0][0], p[1][0]))
    # make a dict to track which (number, origin_list)s we've already seen
    pairs_by_number_and_list = collections.defaultdict(list)
    min_distance = INFINITY
    min_triplet = None
    # start with the closest pair
    for pair in all_pairs:
        # for the first value of the current pair, see if we've seen that particular
        # (number, origin_list) combination before
        for pair2 in pairs_by_number_and_list[pair[0]]:
            # if so, that means the current pair shares its first value with
            # another pair, so put the 3 unique values together to make a triplet
            this_triplet = (pair[1][0], pair2[0][0], pair2[1][0])
            # check if the triplet agrees more than the previous best triplet
            this_distance = distance(this_triplet)
            if this_distance < min_distance:
                min_triplet = this_triplet
                min_distance = this_distance
        # do the same thing but checking the second element of the current pair
        for pair2 in pairs_by_number_and_list[pair[1]]:
            this_triplet = (pair[0][0], pair2[0][0], pair2[1][0])
            this_distance = distance(this_triplet)
            if this_distance < min_distance:
                min_triplet = this_triplet
                min_distance = this_distance
        # finally, add the current pair to the list of pairs we've seen
        pairs_by_number_and_list[pair[0]].append(pair)
        pairs_by_number_and_list[pair[1]].append(pair)
    return min_triplet

N.B. I've written all the code samples in this answer out a little more explicitly than you'd do it in practice to help you to understand how they work. But when doing it for real, you'd use more list comprehensions and such things.

N.B.我在这个答案中编写了所有的代码示例,比您在实践中为帮助您理解它们的工作方式而编写的代码示例更明确一些。但是当你真正做的时候,你会用到更多的列表理解之类的东西。

N.B.2. No guarantees that the code works :-P but it should get the rough idea across.

N.B.2。不能保证代码可以工作:-P,但是它应该能让大致的想法得到理解。

#1


6  

I wouldn't be surprised if there is an established algorithm for doing this, and if so, you should use it. But I don't know of one, so I'm going to speculate a little.

如果有现成的算法,我不会感到惊讶,如果有,你应该使用它。但是我不知道,所以我要做一些推测。

If I had to do it, the first thing I would try would be just to loop through all possible combinations of all the numbers and see how long it takes. If your data set is small enough, it's not worth the time to invent a clever algorithm. To demonstrate the setup, I'll include the sample code:

如果我必须这样做的话,我首先要做的就是循环遍历所有可能的数字组合,看看需要多长时间。如果你的数据集足够小,那就不值得花时间去发明一个聪明的算法。为了演示设置,我将包含示例代码:

# setup
def distance(nplet):
    '''Takes a pair or triplet (an "n-plet") as a list, and returns its distance.
    A smaller return value means better agreement.'''
    # your choice of implementation here. Example:
    return variance(nplet)

# algorithm
def brute_force(*lists):
    return min(itertools.product(*lists), key = distance)

For a large data set, I would try something like this: first create one triplet for each number in the first list, with its first entry set to that number. Then go through this list of partially-filled triplets and for each one, pick the number from the second list that is closest to the number from the first list and set that as the second member of the triplet. Then go through the list of triplets and for each one, pick the number from the third list that is closest to the first two numbers (as measured by your agreement metric). Finally, take the best of the bunch. This sample code demonstrates how you could try to keep the runtime linear in the length of the lists.

对于大型数据集,我将尝试这样的操作:首先为第一个列表中的每个数字创建一个三元组,第一个条目设置为该数字。然后浏览这个部分填充的三胞胎列表,对于每个三胞胎,从第二个列表中选择与第一个列表中最接近的数字,并将其设置为三胞胎的第二个成员。然后遍历三胞胎的列表,并对每一个,从最接近前两个数字的第三个列表中选出最接近前两个数字的数字(根据你的协议度量)。最后,要充分利用这群人。这个示例代码演示了如何在列表的长度中保持运行时的线性。

def item_selection(listA, listB, listC):
    # make the list of partially-filled triplets
    triplets = [[a] for a in listA]
    iT = 0
    iB = 0
    while iT < len(triplets):
        # make iB the index of a value in listB closes to triplets[iT][0]
        while iB < len(listB) and listB[iB] < triplets[iT][0]:
            iB += 1
        if iB == 0:
            triplets[iT].append(listB[0])
        elif iB == len(listB)
            triplets[iT].append(listB[-1])
        else:
            # look at the values in listB just below and just above triplets[iT][0]
            # and add the closer one as the second member of the triplet
            dist_lower = distance([triplets[iT][0], listB[iB]])
            dist_upper = distance([triplets[iT][0], listB[iB + 1]])
            if dist_lower < dist_upper:
                triplets[iT].append(listB[iB])
            elif dist_lower > dist_upper:
                triplets[iT].append(listB[iB + 1])
            else:
                # if they are equidistant, add both
                triplets[iT].append(listB[iB])
                iT += 1
                triplets[iT:iT] = [triplets[iT-1][0], listB[iB + 1]]
        iT += 1
    # then another loop while iT < len(triplets) to add in the numbers from listC
    return min(triplets, key = distance)

The thing is, I can imagine situations where this wouldn't actually find the best triplet, for instance if a number from the first list is close to one from the second list but not at all close to anything in the third list. So something you could try is to run this algorithm for all 6 possible orderings of the lists. I can't think of a specific situation where that would fail to find the best triplet, but one might still exist. In any case the algorithm will still be O(N) if you use a clever implementation, assuming the lists are sorted.

问题是,我可以想象这样的情况,如果第一个列表中的一个数字与第二个列表中的一个数字接近,但与第三个列表中的任何一个数字都不接近的话,它实际上不会找到最好的三元组。所以你可以试着运行这个算法对列表中所有6种可能的排序。我想不出一个具体的情况,那就是找不到最好的三连音,但可能仍然存在。在任何情况下,如果使用一个聪明的实现,假设列表是排序的,那么算法仍然是O(N)。

def symmetrized_item_selection(listA, listB, listC):
    best_results = []
    for ordering in itertools.permutations([listA, listB, listC]):
        best_results.extend(item_selection(*ordering))
    return min(best_results, key = distance)

Another option might be to compute all possible pairs of numbers between list 1 and list 2, between list 1 and list 3, and between list 2 and list 3. Then sort all three lists of pairs together, from best to worst agreement between the two numbers. Starting with the closest pair, go through the list pair by pair and any time you encounter a pair which shares a number with one you've already seen, merge them into a triplet. For a suitable measure of agreement, once you find your first triplet, that will give you a maximum pair distance that you need to iterate up to, and once you get up to it, you just choose the closest triplet of the ones you've found. I think that should consistently find the best possible triplet, but it will be O(N^2 log N) because of the requirement for sorting the lists of pairs.

另一个选项可能是计算列表1和列表2之间、列表1和列表3之间、列表2和列表3之间的所有可能的数字对。然后把这三对组合在一起,从最好的到最差的两个数字之间的一致。从最接近的一对开始,逐一检查列表对,当你遇到与你已经见过的数字相同的一对时,将它们合并成一个三元组。为了得到一个合适的一致性度量,一旦找到了第一个三元组,就会得到需要迭代的最大一对距离,一旦找到了,就只选择最近的三元组。我认为应该坚持找到最好的三重态,但这将是O(N ^ 2 O(log N)),因为要求排序的列表对。

def pair_sorting(listA, listB, listC):
    # make all possible pairs of values from two lists
    # each pair has the structure ((number, origin_list),(number, origin_list))
    # so we know which lists the numbers came from
    all_pairs = []
    all_pairs += [((nA,0), (nB,1)) for (nA,nB) in itertools.product(listA,listB)]
    all_pairs += [((nA,0), (nC,2)) for (nA,nC) in itertools.product(listA,listC)]
    all_pairs += [((nB,1), (nC,2)) for (nB,nC) in itertools.product(listB,listC)]
    all_pairs.sort(key = lambda p: distance(p[0][0], p[1][0]))
    # make a dict to track which (number, origin_list)s we've already seen
    pairs_by_number_and_list = collections.defaultdict(list)
    min_distance = INFINITY
    min_triplet = None
    # start with the closest pair
    for pair in all_pairs:
        # for the first value of the current pair, see if we've seen that particular
        # (number, origin_list) combination before
        for pair2 in pairs_by_number_and_list[pair[0]]:
            # if so, that means the current pair shares its first value with
            # another pair, so put the 3 unique values together to make a triplet
            this_triplet = (pair[1][0], pair2[0][0], pair2[1][0])
            # check if the triplet agrees more than the previous best triplet
            this_distance = distance(this_triplet)
            if this_distance < min_distance:
                min_triplet = this_triplet
                min_distance = this_distance
        # do the same thing but checking the second element of the current pair
        for pair2 in pairs_by_number_and_list[pair[1]]:
            this_triplet = (pair[0][0], pair2[0][0], pair2[1][0])
            this_distance = distance(this_triplet)
            if this_distance < min_distance:
                min_triplet = this_triplet
                min_distance = this_distance
        # finally, add the current pair to the list of pairs we've seen
        pairs_by_number_and_list[pair[0]].append(pair)
        pairs_by_number_and_list[pair[1]].append(pair)
    return min_triplet

N.B. I've written all the code samples in this answer out a little more explicitly than you'd do it in practice to help you to understand how they work. But when doing it for real, you'd use more list comprehensions and such things.

N.B.我在这个答案中编写了所有的代码示例,比您在实践中为帮助您理解它们的工作方式而编写的代码示例更明确一些。但是当你真正做的时候,你会用到更多的列表理解之类的东西。

N.B.2. No guarantees that the code works :-P but it should get the rough idea across.

N.B.2。不能保证代码可以工作:-P,但是它应该能让大致的想法得到理解。