I have a dataframe where I would like to concatenate certain columns.
我有一个数据框,我想连接某些列。
My issue is that the text in these columns may or may not contain duplicate information. I would like to strip out the duplicates in order to retain only the relevant information.
我的问题是这些列中的文本可能包含也可能不包含重复信息。我想删除重复项,以便仅保留相关信息。
For example, if I had a data frame such as:
例如,如果我有一个数据框,例如:
pd.read_csv("animal.csv")
animal1 animal2 label
1 cat dog dolphin 19
2 dog cat cat 72
3 pilchard 26 koala 26
4 newt bat 81 bat 81
I want to combine the columns but retain only unique information from each of the strings.
我想组合列,但只保留每个字符串的唯一信息。
You can see that in row 2, 'cat' is contained in both columns 'Animal1' and 'Animal2'. In row 3, the number 26 is in both column 'Animal1' and 'Label'. Whereas in row 4, information that is in columns 'Animal2' and 'Label' are already contained in order in 'Animal1'.
您可以看到在第2行中,'cat'包含在'Animal1'和'Animal2'列中。在第3行中,数字26位于“Animal1”和“Label”列中。而在第4行中,“Animal2”和“标签”列中的信息已经按顺序包含在“Animal1”中。
I combine the columns by doing the following
我通过执行以下操作来组合列
animals["detail"] = animals["animal1"].map(str) + animals["animal2"].map(str) + animals["label"].map(str)
animal1 animal2 label detail
1 cat dog dolphin 19 cat dog dolphin 19
2 dog cat cat 72 dog cat cat 72
3 pilchard 26 koala 26 pilchard 26 koala 26
4 newt bat 81 bat 81 newt bat 81 bat 81
Row 1 is fine, but the other rows, of course, contain duplicates as described above.
第1行很好,但其他行当然包含如上所述的重复行。
The output I would desire is:
我想要的输出是:
animal1 animal2 label detail
1 cat dog dolphin 19 cat dog dolphin 19
2 dog cat cat 72 dog cat 72
3 pilchard 26 koala 26 pilchard koala 26
4 newt bat 81 bat 81 newt bat 81
or if I could retain only the first unique instance of each word/ number per row in the detail column, this would also be suitable i.e.:
或者如果我只能保留详细列中每行的每个单词/数字的第一个唯一实例,这也是合适的,即:
detail
1 cat dog dolphin 19
2 dog cat 72
3 pilchard koala 26
4 newt bat 81
I've had a look at doing this for a string in python e.g. How can I remove duplicate words in a string with Python?, How to get all the unique words in the data frame?, show distinct column values in pyspark dataframe: python but can't figure out how to apply this to individual rows within the detail column. I've looked at splitting the text after I've combined the columns, then using apply and lambda, but haven't got this to work yet. Or is there perhaps a way to do it when combining the columns?
我已经看过为python中的字符串执行此操作,例如如何使用Python删除字符串中的重复单词?,如何获取数据框中的所有唯一单词?,在pyspark dataframe:python中显示不同的列值,但无法弄清楚如何将其应用于详细列。我已经看过在组合列之后拆分文本,然后使用apply和lambda,但还没有让它工作。或者在组合列时可能有办法做到这一点?
I have the solution in R but want to recode in python.
我在R中有解决方案,但想在python中重新编码。
Would greatly appreciate any help or advice. I'm currently using Spyder(Python 3.5)
非常感谢任何帮助或建议。我目前正在使用Spyder(Python 3.5)
3 个解决方案
#1
4
You can add custom function where first split by whitespace, then get unique values by pandas.unique
and last join to string back:
你可以添加首先按空格分割的自定义函数,然后通过pandas.unique获取唯一值,最后连接到字符串:
animals["detail"] = animals["animal1"].map(str) + ' ' +
animals["animal2"].map(str) + ' ' +
animals["label"].map(str)
animals["detail"] = animals["detail"].apply(lambda x: ' '.join(pd.unique(x.split())))
print (animals)
animal1 animal2 label detail
1 cat dog dolphin 19 cat dog dolphin 19
2 dog cat cat 72 dog cat 72
3 pilchard 26 koala 26 pilchard 26 koala
4 newt bat 81 bat 81 newt bat 81
Also is possible join values in apply
:
也可以在申请中加入值:
animals["detail"] = animals.astype(str)
.apply(lambda x: ' '.join(pd.unique(' '.join(x).split())),axis=1)
print (animals)
animal1 animal2 label detail
1 cat dog dolphin 19 cat dog dolphin 19
2 dog cat cat 72 dog cat 72
3 pilchard 26 koala 26 pilchard 26 koala
4 newt bat 81 bat 81 newt bat 81
Solution with set
, but it change order:
设置解决方案,但它改变了顺序:
animals["detail"] = animals.astype(str)
.apply(lambda x: ' '.join(set(' '.join(x).split())), axis=1)
print (animals)
animal1 animal2 label detail
1 cat dog dolphin 19 cat dolphin 19 dog
2 dog cat cat 72 cat dog 72
3 pilchard 26 koala 26 26 pilchard koala
4 newt bat 81 bat 81 bat 81 newt
#2
2
If you want to keep the order of the appearance of the words, you can first split words in each column, merge them, remove duplicates and finally concat them together to a new column.
如果要保持单词外观的顺序,可以先在每列中拆分单词,合并它们,删除重复项,最后将它们连接到一个新列。
df['detail'] = df.astype(str).T.apply(lambda x: x.str.split())
.apply(lambda x: ' '.join(pd.Series(sum(x,[])).drop_duplicates()))
df
Out[46]:
animal1 animal2 label detail
0 1 cat dog dolphin 19 1 cat dog dolphin 19
1 2 dog cat cat 72 2 dog cat 72
2 3 pilchard 26 koala 26 3 pilchard 26 koala
3 4 newt bat 81 bat 81 4 newt bat 81
#3
1
I'd suggest to remove the duplicates at the end of the process by using python set
.
我建议使用python set删除过程结束时的重复项。
here is an example function to do so:
这是一个示例函数:
def dedup(value):
words = set(value.split(' '))
return ' '.join(words)
That works like this:
这是这样的:
val = 'dog cat cat 81'
print dedup(val)
81 dog cat
81只狗猫
in case you want the details ordered you can use oredereddict
from collections or pd.unique
instead of set.
如果您想要订购详细信息,可以使用集合中的oredereddict或pd.unique而不是set。
then just apply
it (similar to map
) on your details columns for the desired result:
然后在您的详细信息列上应用它(类似于map)以获得所需的结果:
animals.detail = animals.detail.apply(dedup)
#1
4
You can add custom function where first split by whitespace, then get unique values by pandas.unique
and last join to string back:
你可以添加首先按空格分割的自定义函数,然后通过pandas.unique获取唯一值,最后连接到字符串:
animals["detail"] = animals["animal1"].map(str) + ' ' +
animals["animal2"].map(str) + ' ' +
animals["label"].map(str)
animals["detail"] = animals["detail"].apply(lambda x: ' '.join(pd.unique(x.split())))
print (animals)
animal1 animal2 label detail
1 cat dog dolphin 19 cat dog dolphin 19
2 dog cat cat 72 dog cat 72
3 pilchard 26 koala 26 pilchard 26 koala
4 newt bat 81 bat 81 newt bat 81
Also is possible join values in apply
:
也可以在申请中加入值:
animals["detail"] = animals.astype(str)
.apply(lambda x: ' '.join(pd.unique(' '.join(x).split())),axis=1)
print (animals)
animal1 animal2 label detail
1 cat dog dolphin 19 cat dog dolphin 19
2 dog cat cat 72 dog cat 72
3 pilchard 26 koala 26 pilchard 26 koala
4 newt bat 81 bat 81 newt bat 81
Solution with set
, but it change order:
设置解决方案,但它改变了顺序:
animals["detail"] = animals.astype(str)
.apply(lambda x: ' '.join(set(' '.join(x).split())), axis=1)
print (animals)
animal1 animal2 label detail
1 cat dog dolphin 19 cat dolphin 19 dog
2 dog cat cat 72 cat dog 72
3 pilchard 26 koala 26 26 pilchard koala
4 newt bat 81 bat 81 bat 81 newt
#2
2
If you want to keep the order of the appearance of the words, you can first split words in each column, merge them, remove duplicates and finally concat them together to a new column.
如果要保持单词外观的顺序,可以先在每列中拆分单词,合并它们,删除重复项,最后将它们连接到一个新列。
df['detail'] = df.astype(str).T.apply(lambda x: x.str.split())
.apply(lambda x: ' '.join(pd.Series(sum(x,[])).drop_duplicates()))
df
Out[46]:
animal1 animal2 label detail
0 1 cat dog dolphin 19 1 cat dog dolphin 19
1 2 dog cat cat 72 2 dog cat 72
2 3 pilchard 26 koala 26 3 pilchard 26 koala
3 4 newt bat 81 bat 81 4 newt bat 81
#3
1
I'd suggest to remove the duplicates at the end of the process by using python set
.
我建议使用python set删除过程结束时的重复项。
here is an example function to do so:
这是一个示例函数:
def dedup(value):
words = set(value.split(' '))
return ' '.join(words)
That works like this:
这是这样的:
val = 'dog cat cat 81'
print dedup(val)
81 dog cat
81只狗猫
in case you want the details ordered you can use oredereddict
from collections or pd.unique
instead of set.
如果您想要订购详细信息,可以使用集合中的oredereddict或pd.unique而不是set。
then just apply
it (similar to map
) on your details columns for the desired result:
然后在您的详细信息列上应用它(类似于map)以获得所需的结果:
animals.detail = animals.detail.apply(dedup)