编辑pandas数据帧的所有列之间的距离

时间:2020-12-20 15:21:17

I am interested in calculating the edit distances across all the columns of a given pandas DataFrame. Let's say we have a 3*5 DataFrame - I want to output something like this with the distance scores - (column*column matrix)

我有兴趣计算给定pandas DataFrame的所有列的编辑距离。假设我们有一个3 * 5的DataFrame - 我希望用距离得分输出这样的东西 - (列*列矩阵)

  col1  col2 col3 col4 col5

col1

col2

col3

col4

col5

I want each element of a column to match with every element of the other columns. Therefore, for every col1*col2 cell = summation of all the scores of the nested loop of col1 and col2.

我希望列的每个元素与其他列的每个元素匹配。因此,对于每个col1 * col2单元格= col1和col2的嵌套循环的所有分数的总和。

I would highly appreciate any help in this regards. Thanks in advance.

我非常感谢这方面的任何帮助。提前致谢。


INSPECTION_ID STRUCTURE_ID RELOCATE_FID HECO_ID HECO_ID_TAG_NOT_FOUND \ 0 100 95308 NaN 18/29 0.0
1 101 95346 NaN Nov-29 0.0
2 102 50008606 NaN 25/29 0.0
3 103 95310 NaN Dec-29 0.0
4 104 95286 NaN 17/29 0.0

INSPECTION_ID STRUCTURE_ID RELOCATE_FID HECO_ID HECO_ID_TAG_NOT_FOUND \ 0 100 95308 NaN 18/29 0.0 1 101 95346 NaN Nov-29 0.0 2 102 50008606 NaN 25/29 0.0 3 103 95310 NaN Dec-29 0.0 4 104 95286 NaN 17/29 0.0

OSMOSE_POLE_ID ALTERNATE_ID STREET_NBR STREET_DIRECTIONAL STREET_NAME \ 0 NaN NaN 1888 NaN KAIKUNANE
1 NaN NaN 1731 NaN MAKUAHINE
2 NaN NaN 1862 NaN MAKUAHINE
3 NaN NaN 1825 NaN KAIKUNANE
4 NaN NaN 1816 NaN KAIKUNANE

OSMOSE_POLE_ID ALTERNATE_ID STREET_NBR STREET_DIRECTIONAL STREET_NAME \ 0 NaN NaN 1888 NaN KAIKUNANE 1 NaN NaN 1731 NaN MAKUAHINE 2 NaN NaN 1862 NaN MAKUAHINE 3 NaN NaN 1825 NaN KAIKUNANE 4 NaN NaN 1816 NaN KAIKUNANE

Likewise, I got a (191795, 58) dataset. My objective is to find the edit distance between each column of the dataset so as to understand the patterns between them if any.

同样,我有一个(191795,58)数据集。我的目标是找到数据集的每一列之间的编辑距离,以便了解它们之间的模式(如果有的话)。

For instance, I desire INSPECTION_ID 100 to be checked with all the values of column STRUCTURE_ID ans so on. I understand the need of an optimized iterator in this case. Kindly help me throwing some direction to solve this problem. Thanks in advance.

例如,我希望使用STRUCTURE_ID列的所有值来检查INSPECTION_ID 100。在这种情况下,我理解需要优化的迭代器。请帮助我提出一些方向来解决这个问题。提前致谢。

1 个解决方案

#1


0  

Very naive solution (assuming you already have an edit distance function) but might just work for small datasets

非常天真的解决方案(假设您已经有编辑距离功能)但可能只适用于小型数据集

df = # your dataset
def edit_distance(s1, s2):
    # some code
    # return edit distance of s1, s2


df_distances = []
for i, row in df.iterrows():
    row_distances = []
    for item in row:
        for item2 in row:
              row_distances.append(edit_distance(item, item2))
    df_distances.append(some_array)

I haven't tested this solution so there might be bugs but the general principle should work. If you don't have an edit distance function, you can use this implementation https://en.wikibooks.org/wiki/Algorithm_Implementation/Strings/Levenshtein_distance#Python or one of the many others freely available

我没有测试过这个解决方案,所以可能存在错误,但一般原则应该有效。如果您没有编辑距离功能,可以使用此实现https://en.wikibooks.org/wiki/Algorithm_Implementation/Strings/Levenshtein_distance#Python或其他许多免费提供的功能

#1


0  

Very naive solution (assuming you already have an edit distance function) but might just work for small datasets

非常天真的解决方案(假设您已经有编辑距离功能)但可能只适用于小型数据集

df = # your dataset
def edit_distance(s1, s2):
    # some code
    # return edit distance of s1, s2


df_distances = []
for i, row in df.iterrows():
    row_distances = []
    for item in row:
        for item2 in row:
              row_distances.append(edit_distance(item, item2))
    df_distances.append(some_array)

I haven't tested this solution so there might be bugs but the general principle should work. If you don't have an edit distance function, you can use this implementation https://en.wikibooks.org/wiki/Algorithm_Implementation/Strings/Levenshtein_distance#Python or one of the many others freely available

我没有测试过这个解决方案,所以可能存在错误,但一般原则应该有效。如果您没有编辑距离功能,可以使用此实现https://en.wikibooks.org/wiki/Algorithm_Implementation/Strings/Levenshtein_distance#Python或其他许多免费提供的功能