熊猫:组合后没有重复的列/找到唯一的单词

时间:2022-04-01 22:55:34

I have a dataframe where I would like to concatenate certain columns.

我有一个数据框,我想连接某些列。

My issue is that the text in these columns may or may not contain duplicate information. I would like to strip out the duplicates in order to retain only the relevant information.

我的问题是这些列中的文本可能包含也可能不包含重复信息。我想删除重复项,以便仅保留相关信息。

For example, if I had a data frame such as:

例如,如果我有一个数据框,例如:

pd.read_csv("animal.csv")

  animal1         animal2        label  
1 cat dog         dolphin        19
2 dog cat         cat            72
3 pilchard 26     koala          26
4 newt bat 81     bat            81

I want to combine the columns but retain only unique information from each of the strings.

我想组合列,但只保留每个字符串的唯一信息。

You can see that in row 2, 'cat' is contained in both columns 'Animal1' and 'Animal2'. In row 3, the number 26 is in both column 'Animal1' and 'Label'. Whereas in row 4, information that is in columns 'Animal2' and 'Label' are already contained in order in 'Animal1'.

您可以看到在第2行中,'cat'包含在'Animal1'和'Animal2'列中。在第3行中,数字26位于“Animal1”和“Label”列中。而在第4行中,“Animal2”和“标签”列中的信息已经按顺序包含在“Animal1”中。

I combine the columns by doing the following

我通过执行以下操作来组合列

animals["detail"] = animals["animal1"].map(str) + animals["animal2"].map(str) + animals["label"].map(str)

  animal1         animal2        label        detail  
1 cat dog         dolphin        19           cat dog dolphin 19
2 dog cat         cat            72           dog cat cat 72
3 pilchard 26     koala          26           pilchard 26 koala 26
4 newt bat 81     bat            81           newt bat 81 bat 81

Row 1 is fine, but the other rows, of course, contain duplicates as described above.

第1行很好,但其他行当然包含如上所述的重复行。

The output I would desire is:

我想要的输出是:

  animal1         animal2        label        detail  
1 cat dog         dolphin        19           cat dog dolphin 19
2 dog cat         cat            72           dog cat 72
3 pilchard 26     koala          26           pilchard koala 26
4 newt bat 81     bat            81           newt bat 81

or if I could retain only the first unique instance of each word/ number per row in the detail column, this would also be suitable i.e.:

或者如果我只能保留详细列中每行的每个单词/数字的第一个唯一实例,这也是合适的,即:

  detail 
1 cat dog dolphin 19
2 dog cat 72
3 pilchard koala 26
4 newt bat 81

I've had a look at doing this for a string in python e.g. How can I remove duplicate words in a string with Python?, How to get all the unique words in the data frame?, show distinct column values in pyspark dataframe: python but can't figure out how to apply this to individual rows within the detail column. I've looked at splitting the text after I've combined the columns, then using apply and lambda, but haven't got this to work yet. Or is there perhaps a way to do it when combining the columns?

我已经看过为python中的字符串执行此操作,例如如何使用Python删除字符串中的重复单词?,如何获取数据框中的所有唯一单词?,在pyspark dataframe:python中显示不同的列值,但无法弄清楚如何将其应用于详细列。我已经看过在组合列之后拆分文本,然后使用apply和lambda,但还没有让它工作。或者在组合列时可能有办法做到这一点?

I have the solution in R but want to recode in python.

我在R中有解决方案,但想在python中重新编码。

Would greatly appreciate any help or advice. I'm currently using Spyder(Python 3.5)

非常感谢任何帮助或建议。我目前正在使用Spyder(Python 3.5)

3 个解决方案

#1


4  

You can add custom function where first split by whitespace, then get unique values by pandas.unique and last join to string back:

你可以添加首先按空格分割的自定义函数,然后通过pandas.unique获取唯一值,最后连接到字符串:

animals["detail"] = animals["animal1"].map(str) + ' ' + 
                    animals["animal2"].map(str) + ' ' +
                    animals["label"].map(str)

animals["detail"] = animals["detail"].apply(lambda x: ' '.join(pd.unique(x.split())))
print (animals)
       animal1  animal2  label              detail
1      cat dog  dolphin     19  cat dog dolphin 19
2      dog cat      cat     72          dog cat 72
3  pilchard 26    koala     26   pilchard 26 koala
4  newt bat 81      bat     81         newt bat 81

Also is possible join values in apply:

也可以在申请中加入值:

animals["detail"] = animals.astype(str)
                           .apply(lambda x: ' '.join(pd.unique(' '.join(x).split())),axis=1)
print (animals)
       animal1  animal2  label              detail
1      cat dog  dolphin     19  cat dog dolphin 19
2      dog cat      cat     72          dog cat 72
3  pilchard 26    koala     26   pilchard 26 koala
4  newt bat 81      bat     81         newt bat 81

Solution with set, but it change order:

设置解决方案,但它改变了顺序:

animals["detail"] = animals.astype(str)
                           .apply(lambda x: ' '.join(set(' '.join(x).split())), axis=1)
print (animals)
       animal1  animal2  label              detail
1      cat dog  dolphin     19  cat dolphin 19 dog
2      dog cat      cat     72          cat dog 72
3  pilchard 26    koala     26   26 pilchard koala
4  newt bat 81      bat     81         bat 81 newt

#2


2  

If you want to keep the order of the appearance of the words, you can first split words in each column, merge them, remove duplicates and finally concat them together to a new column.

如果要保持单词外观的顺序,可以先在每列中拆分单词,合并它们,删除重复项,最后将它们连接到一个新列。

df['detail'] = df.astype(str).T.apply(lambda x: x.str.split())
                 .apply(lambda x: ' '.join(pd.Series(sum(x,[])).drop_duplicates()))

df
Out[46]: 
         animal1   animal2   label                 detail
0      1 cat dog   dolphin       19  1 cat dog dolphin 19
1      2 dog cat       cat       72          2 dog cat 72
2  3 pilchard 26     koala       26   3 pilchard 26 koala
3  4 newt bat 81       bat       81         4 newt bat 81

#3


1  

I'd suggest to remove the duplicates at the end of the process by using python set.

我建议使用python set删除过程结束时的重复项。

here is an example function to do so:

这是一个示例函数:

def dedup(value):
    words = set(value.split(' '))
    return ' '.join(words)

That works like this:

这是这样的:

val = 'dog cat cat 81'
print dedup(val)

81 dog cat

81只狗猫

in case you want the details ordered you can use oredereddict from collections or pd.unique instead of set.

如果您想要订购详细信息,可以使用集合中的oredereddict或pd.unique而不是set。

then just apply it (similar to map) on your details columns for the desired result:

然后在您的详细信息列上应用它(类似于map)以获得所需的结果:

animals.detail = animals.detail.apply(dedup)

#1


4  

You can add custom function where first split by whitespace, then get unique values by pandas.unique and last join to string back:

你可以添加首先按空格分割的自定义函数,然后通过pandas.unique获取唯一值,最后连接到字符串:

animals["detail"] = animals["animal1"].map(str) + ' ' + 
                    animals["animal2"].map(str) + ' ' +
                    animals["label"].map(str)

animals["detail"] = animals["detail"].apply(lambda x: ' '.join(pd.unique(x.split())))
print (animals)
       animal1  animal2  label              detail
1      cat dog  dolphin     19  cat dog dolphin 19
2      dog cat      cat     72          dog cat 72
3  pilchard 26    koala     26   pilchard 26 koala
4  newt bat 81      bat     81         newt bat 81

Also is possible join values in apply:

也可以在申请中加入值:

animals["detail"] = animals.astype(str)
                           .apply(lambda x: ' '.join(pd.unique(' '.join(x).split())),axis=1)
print (animals)
       animal1  animal2  label              detail
1      cat dog  dolphin     19  cat dog dolphin 19
2      dog cat      cat     72          dog cat 72
3  pilchard 26    koala     26   pilchard 26 koala
4  newt bat 81      bat     81         newt bat 81

Solution with set, but it change order:

设置解决方案,但它改变了顺序:

animals["detail"] = animals.astype(str)
                           .apply(lambda x: ' '.join(set(' '.join(x).split())), axis=1)
print (animals)
       animal1  animal2  label              detail
1      cat dog  dolphin     19  cat dolphin 19 dog
2      dog cat      cat     72          cat dog 72
3  pilchard 26    koala     26   26 pilchard koala
4  newt bat 81      bat     81         bat 81 newt

#2


2  

If you want to keep the order of the appearance of the words, you can first split words in each column, merge them, remove duplicates and finally concat them together to a new column.

如果要保持单词外观的顺序,可以先在每列中拆分单词,合并它们,删除重复项,最后将它们连接到一个新列。

df['detail'] = df.astype(str).T.apply(lambda x: x.str.split())
                 .apply(lambda x: ' '.join(pd.Series(sum(x,[])).drop_duplicates()))

df
Out[46]: 
         animal1   animal2   label                 detail
0      1 cat dog   dolphin       19  1 cat dog dolphin 19
1      2 dog cat       cat       72          2 dog cat 72
2  3 pilchard 26     koala       26   3 pilchard 26 koala
3  4 newt bat 81       bat       81         4 newt bat 81

#3


1  

I'd suggest to remove the duplicates at the end of the process by using python set.

我建议使用python set删除过程结束时的重复项。

here is an example function to do so:

这是一个示例函数:

def dedup(value):
    words = set(value.split(' '))
    return ' '.join(words)

That works like this:

这是这样的:

val = 'dog cat cat 81'
print dedup(val)

81 dog cat

81只狗猫

in case you want the details ordered you can use oredereddict from collections or pd.unique instead of set.

如果您想要订购详细信息,可以使用集合中的oredereddict或pd.unique而不是set。

then just apply it (similar to map) on your details columns for the desired result:

然后在您的详细信息列上应用它(类似于map)以获得所需的结果:

animals.detail = animals.detail.apply(dedup)