
时间:2022-10-07 04:27:48

Essentially, what I am trying to do is join Table_A to Table_B using a key to do a lookup in Table_B to pull column records for names present in Table_A.


Table_B can be thought of as the master name table that stores various attributes about a name. Table_A represents incoming data with information about a name.

可以将Table_B视为存储有关名称的各种属性的主名称表。 Table_A表示包含名称信息的传入数据。

There are two columns that represent a name - a column named 'raw_name' and a column named 'real_name'. The 'raw_name' has the string "code_" before the real_name.

有两列代表名称 - 名为“raw_name”的列和名为“real_name”的列。 'raw_name'在real_name之前有字符串“code_”。


raw_name = CE993_VincentHanna

real_name = VincentHanna

Key = real_name, which exists in Table_A and Table_B

Key = real_name,存在于Table_A和Table_B中

Please see the mySQL tables and query here: http://sqlfiddle.com/#!9/65e13/1


For all real_names in Table_A that DO-NOT exist in Table_B I want to store raw_name/real_name pairs into an object so I can send an alert to the data-entry staff for manual insertion.

对于表_A中表示不存在的所有real_names,我想将raw_name / real_name对存储到对象中,以便我可以向数据输入人员发送警报以进行手动插入。

For all real_names in Table_A that DO exist in Table_B, which means we know about this name and can add the new raw_name associated with this real_name into our master Table_B


In mySQL, this is easy to do as you can see in my sqlfidde example. I join on real_name and I compress/collapse the result by groupby a.real_name since I don't care if there are multiple records in Table_B for the same real_name.

在mySQL中,这很容易,就像我在sqlfidde示例中看到的那样。我加入了real_name,我通过groupby a.real_name压缩/折叠结果,因为我不关心Table_B中是否存在同一个real_name的多个记录。

All I want is to pull the attributes (stats1, stats2, stats3) so I can assign them to the newly discovered raw_name.


In the mySQL query result I can then separate the NULL records to be sent for manual data-entry and automatically insert the remaining records into Table_B.


Now, I am trying to do the same in Pandas but am stuck at the point of groupby on real-name.


e = {'raw_name': pd.Series(['AW103_Waingro', 'CE993_VincentHanna', 'EES43_NeilMcCauley', 'SME16_ChrisShiherlis',
                          'MEC14_MichaelCheritto', 'OTP23_RogerVanZant', 'MDU232_AlanMarciano']),
     'real_name': pd.Series(['Waingro', 'VincentHanna', 'NeilMcCauley', 'ChrisShiherlis', 'MichaelCheritto', 
                           'RogerVanZant', 'AlanMarciano'])}

f = {'raw_name': pd.Series(['SME893_VincentHanna', 'TVA405_VincentHanna', 'MET783_NeilMcCauley', 
                            'CE321_NeilMcCauley', 'CIN453_NeilMcCauley', 'NIPS16_ChrisShiherlis',
                            'ALTW12_MichaelCheritto', 'NSP42_MichaelCheritto', 'CONS23_RogerVanZant',
     'real_name': pd.Series(['VincentHanna', 'VincentHanna', 'NeilMcCauley', 'NeilMcCauley', 'NeilMcCauley',
                             'ChrisShiherlis', 'MichaelCheritto', 'MichaelCheritto', 'RogerVanZant',
     'stats1': pd.Series(['meh1', 'meh1', 'yo1', 'yo1', 'yo1', 'hello1', 'bye1', 'bye1', 'namaste1',
     'stats2': pd.Series(['meh2', 'meh2', 'yo2', 'yo2', 'yo2', 'hello2', 'bye2', 'bye2', 'namaste2',
     'stats3': pd.Series(['meh3', 'meh3', 'yo3', 'yo3', 'yo3', 'hello3', 'bye3', 'bye3', 'namaste3',

df_e = pd.DataFrame(e)
df_f = pd.DataFrame(f)

df_new = pd.merge(df_e, df_f, how='left', on='real_name', suffixes=['_left', '_right'])

df_new_grouped = df_new.groupby(df_new['raw_name_left'])

Now how do I compress/collapse the groups in df_new_grouped on real-name like I did in mySQL.


Once I have an object with the collapsed results I can slice the dataframe to report real_names we don't have a record of (NULL values) and those that we already know and can store the newly discovered raw_name.


2 个解决方案


You can drop duplicates based on columns raw_name_left and also remove the raw_name_right column using drop


In [99]: df_new.drop_duplicates('raw_name_left').drop('raw_name_right', 1)
            raw_name_left        real_name    stats1    stats2    stats3
0           AW103_Waingro          Waingro       NaN       NaN       NaN
1      CE993_VincentHanna     VincentHanna      meh1      meh2      meh3
3      EES43_NeilMcCauley     NeilMcCauley       yo1       yo2       yo3
6    SME16_ChrisShiherlis   ChrisShiherlis    hello1    hello2    hello3
7   MEC14_MichaelCheritto  MichaelCheritto      bye1      bye2      bye3
9      OTP23_RogerVanZant     RogerVanZant  namaste1  namaste2  namaste3
11    MDU232_AlanMarciano     AlanMarciano       NaN       NaN       NaN


Just to be thorough, this can also be done using Groupby, which I found on Wes McKinney's blog although drop_duplicates is cleaner and more efficient.

为了彻底,这也可以使用Groupby完成,我在Wes McKinney的博客上找到了虽然drop_duplicates更清洁,更有效。


>index = [gp_keys[0] for gp_keys in df_new_grouped.groups.values()]
>unique_df = df_new.reindex(index)


You can drop duplicates based on columns raw_name_left and also remove the raw_name_right column using drop


In [99]: df_new.drop_duplicates('raw_name_left').drop('raw_name_right', 1)
            raw_name_left        real_name    stats1    stats2    stats3
0           AW103_Waingro          Waingro       NaN       NaN       NaN
1      CE993_VincentHanna     VincentHanna      meh1      meh2      meh3
3      EES43_NeilMcCauley     NeilMcCauley       yo1       yo2       yo3
6    SME16_ChrisShiherlis   ChrisShiherlis    hello1    hello2    hello3
7   MEC14_MichaelCheritto  MichaelCheritto      bye1      bye2      bye3
9      OTP23_RogerVanZant     RogerVanZant  namaste1  namaste2  namaste3
11    MDU232_AlanMarciano     AlanMarciano       NaN       NaN       NaN


Just to be thorough, this can also be done using Groupby, which I found on Wes McKinney's blog although drop_duplicates is cleaner and more efficient.

为了彻底,这也可以使用Groupby完成,我在Wes McKinney的博客上找到了虽然drop_duplicates更清洁,更有效。


>index = [gp_keys[0] for gp_keys in df_new_grouped.groups.values()]
>unique_df = df_new.reindex(index)