熊猫的dataframe将一个列分割成多个列。

时间:2022-10-28 15:38:37

I have a pandas dataframe looks like as below:

我有一只熊猫dataframe看起来像下面:

date     |    location          | occurance <br>
------------------------------------------------------
somedate |united_kingdom_london | 5  
somedate |united_state_newyork  | 5   

I want it to transform into

我想让它变成。

date     | country        | city    | occurance <br>
---------------------------------------------------
somedate | united kingdom | london  | 5  
---------------------------------------------------
somedate | united state   | newyork | 5     

I am new to Python and after some research I have written following code, but seems to unable to extract country and city:

我是Python的新手,经过一些研究,我写了以下代码,但似乎无法提取国家和城市:

df.location= df.location.replace({'-': ' '}, regex=True)
df.location= df.location.replace({'_': ' '}, regex=True)

temp_location = df['location'].str.split(' ').tolist() 

location_data = pd.DataFrame(temp_location, columns=['country', 'city'])

I appreciate your response.

我很欣赏你的反应。

6 个解决方案

#1


3  

Starting with this:

开始:

df = pd.DataFrame({'Date': ['somedate', 'somedate'],
                   'location': ['united_kingdom_london', 'united_state_newyork'],
                   'occurence': [5, 5]})

Try this:

试试这个:

df['Country'] = df['location'].str.rpartition('_')[0].str.replace("_", " ")
df['City']    = df['location'].str.rpartition('_')[2]
df[['Date','Country', 'City', 'occurence']]

      Date        Country      City  occurence
0  somedate  united kingdom   london          5
1  somedate    united state  newyork          5

Borrowing idea from @MaxU

从@MaxU借想法

df[['Country'," " , 'City']] = (df.location.str.replace('_',' ').str.rpartition(' ', expand= True ))
df[['Date','Country', 'City','occurence' ]]

      Date        Country      City  occurence
0  somedate  united kingdom   london          5
1  somedate    united state  newyork          5

#2


1  

Another solution with str.rsplit, which works nice if country has no _ (contains only one word):

另一个使用str.rsplit的解决方案,如果country没有_(只包含一个单词),那么它很好用:

import pandas as pd

df = pd.DataFrame({'date': {0: 'somedate', 1: 'somedate', 2: 'somedate'}, 
                   'location': {0: 'slovakia_bratislava', 
                                1: 'united_kingdom_london', 
                                2: 'united_state_newyork'}, 
                   'occurance <br>': {0: 5, 1: 5, 2: 5}})    
print (df)
       date               location  occurance <br>
0  somedate    slovakia_bratislava               5
1  somedate  united_kingdom_london               5
2  somedate   united_state_newyork               5

df[['country','city']] = df.location.str.replace('_', ' ').str.rsplit(n=1, expand=True)
#change ordering of columns, remove location column
cols = df.columns.tolist()
df = df[cols[:1] + cols[3:5] + cols[2:3]]
print (df)
       date         country        city  occurance <br>
0  somedate        slovakia  bratislava               5
1  somedate  united kingdom      london               5
2  somedate    united state     newyork               5

#3


0  

Consider splitting the column's string value using rfind()

考虑使用rfind()分解列的字符串值

import pandas as pd

df = pd.DataFrame({'Date': ['somedate', 'somedate'],
                   'location': ['united_kingdom_london', 'united_state_newyork'],
                   'occurence': [5, 5]})

df['country'] = df['location'].apply(lambda x: x[0:x.rfind('_')])
df['city'] = df['location'].apply(lambda x: x[x.rfind('_')+1:])

df = df[['Date', 'country', 'city', 'occurence']]
print(df)

#        Date         country     city  occurence
# 0  somedate  united_kingdom   london          5
# 1  somedate    united_state  newyork          5

#4


0  

Try this:

试试这个:

temp_location = {}
splits = df['location'].str.split(' ')
temp_location['country'] = splits[0:-1].tolist() 
temp_location['city'] = splits[-1].tolist() 

location_data = pd.DataFrame(temp_location)

If you want it back in the original df:

如果你想让它回到原来的df:

df['country'] = splits[0:-1].tolist() 
df['city'] = splits[-1].tolist() 

#5


0  

Something like this works

这样的作品

import pandas as pd

df = pd.DataFrame({'Date': ['somedate', 'somedate'],
                   'location': ['united_kingdom_london', 'united_state_newyork'],
                   'occurence': [5, 5]})

df.location = df.location.str[::-1].str.replace("_", " ", 1).str[::-1]
newcols = df.location.str.split(" ")
newcols = pd.DataFrame(df.location.str.split(" ").tolist(),
                         columns=["country", "city"])
newcols.country = newcols.country.str.replace("_", " ")
df = pd.concat([df, newcols], axis=1)
df.drop("location", axis=1, inplace=True)
print(df)

         Date  occurence         country     city
  0  somedate          5  united kingdom   london
  1  somedate          5    united state  newyork

You could use regex in the replace for a more complicated pattern but if it's just the word after the last _ I find it easier to just reverse the str twice as a hack rather than fiddling around with regular expressions

你可以用regex代替一个更复杂的模式,但是如果它仅仅是在最后一个词之后的单词,我发现它更容易将str反转两次作为一个hack,而不是摆弄正则表达式。

#6


0  

I would use .str.extract() method:

我将使用.str.extract()方法:

In [107]: df
Out[107]:
       Date               location  occurence
0  somedate  united_kingdom_london          5
1  somedate   united_state_newyork          5
2  somedate         germany_munich          5

In [108]: df[['country','city']] = (df.location.str.replace('_',' ')
   .....:                             .str.extract(r'(.*)\s+([^\s]*)', expand=True))

In [109]: df
Out[109]:
       Date               location  occurence         country     city
0  somedate  united_kingdom_london          5  united kingdom   london
1  somedate   united_state_newyork          5    united state  newyork
2  somedate         germany_munich          5         germany   munich

In [110]: df = df.drop('location', 1)

In [111]: df
Out[111]:
       Date  occurence         country     city
0  somedate          5  united kingdom   london
1  somedate          5    united state  newyork
2  somedate          5         germany   munich

PS please be aware that it's not possible to parse properly (to distinguish) between rows containing two-words country + one-word city and rows containing one-word country + two-words city (unless you have a full list of countries so you check it against this list)...

请注意,不可能正确地解析包含两个单词的国家和一个单词城市的行和包含一个单词的国家+两个单词的城市的行(除非您有完整的国家列表,所以您可以在这个列表中查看它)……

#1


3  

Starting with this:

开始:

df = pd.DataFrame({'Date': ['somedate', 'somedate'],
                   'location': ['united_kingdom_london', 'united_state_newyork'],
                   'occurence': [5, 5]})

Try this:

试试这个:

df['Country'] = df['location'].str.rpartition('_')[0].str.replace("_", " ")
df['City']    = df['location'].str.rpartition('_')[2]
df[['Date','Country', 'City', 'occurence']]

      Date        Country      City  occurence
0  somedate  united kingdom   london          5
1  somedate    united state  newyork          5

Borrowing idea from @MaxU

从@MaxU借想法

df[['Country'," " , 'City']] = (df.location.str.replace('_',' ').str.rpartition(' ', expand= True ))
df[['Date','Country', 'City','occurence' ]]

      Date        Country      City  occurence
0  somedate  united kingdom   london          5
1  somedate    united state  newyork          5

#2


1  

Another solution with str.rsplit, which works nice if country has no _ (contains only one word):

另一个使用str.rsplit的解决方案,如果country没有_(只包含一个单词),那么它很好用:

import pandas as pd

df = pd.DataFrame({'date': {0: 'somedate', 1: 'somedate', 2: 'somedate'}, 
                   'location': {0: 'slovakia_bratislava', 
                                1: 'united_kingdom_london', 
                                2: 'united_state_newyork'}, 
                   'occurance <br>': {0: 5, 1: 5, 2: 5}})    
print (df)
       date               location  occurance <br>
0  somedate    slovakia_bratislava               5
1  somedate  united_kingdom_london               5
2  somedate   united_state_newyork               5

df[['country','city']] = df.location.str.replace('_', ' ').str.rsplit(n=1, expand=True)
#change ordering of columns, remove location column
cols = df.columns.tolist()
df = df[cols[:1] + cols[3:5] + cols[2:3]]
print (df)
       date         country        city  occurance <br>
0  somedate        slovakia  bratislava               5
1  somedate  united kingdom      london               5
2  somedate    united state     newyork               5

#3


0  

Consider splitting the column's string value using rfind()

考虑使用rfind()分解列的字符串值

import pandas as pd

df = pd.DataFrame({'Date': ['somedate', 'somedate'],
                   'location': ['united_kingdom_london', 'united_state_newyork'],
                   'occurence': [5, 5]})

df['country'] = df['location'].apply(lambda x: x[0:x.rfind('_')])
df['city'] = df['location'].apply(lambda x: x[x.rfind('_')+1:])

df = df[['Date', 'country', 'city', 'occurence']]
print(df)

#        Date         country     city  occurence
# 0  somedate  united_kingdom   london          5
# 1  somedate    united_state  newyork          5

#4


0  

Try this:

试试这个:

temp_location = {}
splits = df['location'].str.split(' ')
temp_location['country'] = splits[0:-1].tolist() 
temp_location['city'] = splits[-1].tolist() 

location_data = pd.DataFrame(temp_location)

If you want it back in the original df:

如果你想让它回到原来的df:

df['country'] = splits[0:-1].tolist() 
df['city'] = splits[-1].tolist() 

#5


0  

Something like this works

这样的作品

import pandas as pd

df = pd.DataFrame({'Date': ['somedate', 'somedate'],
                   'location': ['united_kingdom_london', 'united_state_newyork'],
                   'occurence': [5, 5]})

df.location = df.location.str[::-1].str.replace("_", " ", 1).str[::-1]
newcols = df.location.str.split(" ")
newcols = pd.DataFrame(df.location.str.split(" ").tolist(),
                         columns=["country", "city"])
newcols.country = newcols.country.str.replace("_", " ")
df = pd.concat([df, newcols], axis=1)
df.drop("location", axis=1, inplace=True)
print(df)

         Date  occurence         country     city
  0  somedate          5  united kingdom   london
  1  somedate          5    united state  newyork

You could use regex in the replace for a more complicated pattern but if it's just the word after the last _ I find it easier to just reverse the str twice as a hack rather than fiddling around with regular expressions

你可以用regex代替一个更复杂的模式,但是如果它仅仅是在最后一个词之后的单词,我发现它更容易将str反转两次作为一个hack,而不是摆弄正则表达式。

#6


0  

I would use .str.extract() method:

我将使用.str.extract()方法:

In [107]: df
Out[107]:
       Date               location  occurence
0  somedate  united_kingdom_london          5
1  somedate   united_state_newyork          5
2  somedate         germany_munich          5

In [108]: df[['country','city']] = (df.location.str.replace('_',' ')
   .....:                             .str.extract(r'(.*)\s+([^\s]*)', expand=True))

In [109]: df
Out[109]:
       Date               location  occurence         country     city
0  somedate  united_kingdom_london          5  united kingdom   london
1  somedate   united_state_newyork          5    united state  newyork
2  somedate         germany_munich          5         germany   munich

In [110]: df = df.drop('location', 1)

In [111]: df
Out[111]:
       Date  occurence         country     city
0  somedate          5  united kingdom   london
1  somedate          5    united state  newyork
2  somedate          5         germany   munich

PS please be aware that it's not possible to parse properly (to distinguish) between rows containing two-words country + one-word city and rows containing one-word country + two-words city (unless you have a full list of countries so you check it against this list)...

请注意,不可能正确地解析包含两个单词的国家和一个单词城市的行和包含一个单词的国家+两个单词的城市的行(除非您有完整的国家列表,所以您可以在这个列表中查看它)……