I have a dataframe data
(with lengthy and inconsistent text string notes) and matching IDs. My goal is to extract relevant sub-strings of interest using the list
of sub-strings and create a new column for the extracted sub-strings. I was told that regex was a good place to start but I'm yet to come up with a good pattern that can produce matching result. I'm hoping that someone see's this and directs me in the right way to solve this.
我有一个数据帧数据(带有冗长且不一致的文本字符串注释)和匹配的ID。我的目标是使用子字符串列表提取相关的相关子字符串,并为提取的子字符串创建一个新列。我被告知正则表达式是一个很好的起点,但我还没有想出一个可以产生匹配结果的好模式。我希望有人看到这个并指导我以正确的方式解决这个问题。
list = ['sentara williamsburg regional medical',
'shady grove adventist hospital',
'sibley memorial hospital',
'southern maryland hospital center',
'st. mary`s hospital',
'suburban hospital healthcare system',
'the cancer center at lake manassas',
'ucla medical center',
'united medical center- greater southeast community',
'univ of md charles regional medical ctr',
'university of maryland medical center',
'university of north carolina hospital',
'university of virginia health system',
'unknown facility',
'va medical center',
'virginia hospital center-arlington',
'walter reed army medical center',
'washington adventist hospital',
'washington hospital center',
'wellstar health system, inc',
'winchester medical center']
data:
ID Notes
530.0 Cancer is best diag @Wwashington Adventist Hospital
651.0 nan
692.0 GMC-009 can be accessed at ST. Mary`s but not in UCLA Med. Center
993.0 I'm not sure of Sibley; however, Shady Grove Adventist Hosp. is great hospital
044.0 nan
055.0 2015-01-20 was the day she visited WR Army Medical Center in WDC
476.0 nan
Expected output - case really does not matter!
预期输出 - 案例真无所谓!
data_out:
ID Notes
530.0 Washington Adventist Hospital
651.0 nan
692.0 ST. Mary`s Hospital, UCLA Medical Center
993.0 Sibley Memorial Hoapital, Shady Grove Adventist Hospital
044.0 nan
055.0 Walter Reed Army Medical Center
476.0 nan
2 个解决方案
#1
0
Updated: This code runs through all words of the list and compares them to the "Notes" column. If there is a word which is in "list" and also in "Notes", this word will be written in the new column "output". You have to play around with regular expressions to get the desired result. Note: Due to the fact, that the words in "list" could appear completely different but has the same meaning as the words in "Column" (abbreviations, spelling, mistakes, Csase-sensitive) it whould be difficult to attain all different cases. Thus maybe it whould be usefull to solve this problem with the "bag of word approach"...
更新:此代码遍历列表中的所有单词,并将它们与“Notes”列进行比较。如果有一个单词在“list”中,也在“Notes”中,则该单词将写入新列“output”中。您必须使用正则表达式来获得所需的结果。注意:由于“list”中的单词可能看起来完全不同但与“Column”(缩写,拼写,错误,Csase敏感)中的单词具有相同的含义,因此难以实现所有不同的情况。因此,也许用“袋方法”来解决这个问题是有用的...
#Create a new list
newlist=[]
#Split the sentences of the "Notes" column
[newlist.append(data.loc[i,"Notes"].split(" ")) for i in range(len(data["Notes"]))]
#Create the new column "output" and default the values to be the same as in the column "Notes"
data["output"]=data["Notes"]
#Run through all words
for i in range(len(list)):
for j in range(len(newlist)):
for element in range(len(newlist[j])):
if re.search(newlist[j][element],list[i]):
data.loc[j,"output"]= "' '{0}".format(newlist[j][element])
If there is a more vectorized approach I whould be grateful to comments
如果有更多的矢量化方法,我会感激评论
#2
0
I'd do smth. like:
我会做的。喜欢:
import re
reg = re.compile('|'.join(your_list))
results = reg.match(your_data)
#1
0
Updated: This code runs through all words of the list and compares them to the "Notes" column. If there is a word which is in "list" and also in "Notes", this word will be written in the new column "output". You have to play around with regular expressions to get the desired result. Note: Due to the fact, that the words in "list" could appear completely different but has the same meaning as the words in "Column" (abbreviations, spelling, mistakes, Csase-sensitive) it whould be difficult to attain all different cases. Thus maybe it whould be usefull to solve this problem with the "bag of word approach"...
更新:此代码遍历列表中的所有单词,并将它们与“Notes”列进行比较。如果有一个单词在“list”中,也在“Notes”中,则该单词将写入新列“output”中。您必须使用正则表达式来获得所需的结果。注意:由于“list”中的单词可能看起来完全不同但与“Column”(缩写,拼写,错误,Csase敏感)中的单词具有相同的含义,因此难以实现所有不同的情况。因此,也许用“袋方法”来解决这个问题是有用的...
#Create a new list
newlist=[]
#Split the sentences of the "Notes" column
[newlist.append(data.loc[i,"Notes"].split(" ")) for i in range(len(data["Notes"]))]
#Create the new column "output" and default the values to be the same as in the column "Notes"
data["output"]=data["Notes"]
#Run through all words
for i in range(len(list)):
for j in range(len(newlist)):
for element in range(len(newlist[j])):
if re.search(newlist[j][element],list[i]):
data.loc[j,"output"]= "' '{0}".format(newlist[j][element])
If there is a more vectorized approach I whould be grateful to comments
如果有更多的矢量化方法,我会感激评论
#2
0
I'd do smth. like:
我会做的。喜欢:
import re
reg = re.compile('|'.join(your_list))
results = reg.match(your_data)