现在有一个需求,比如给定如下数据:
0-0-0 0:0:0 #### the 68th annual golden globe awards #### the king s speech earns 7 nominations #### <LOCATION>LOS ANGELES</LOCATION> <ORGANIZATION>Dec Xinhua Kings Speech</ORGANIZATION> historical drama British king stammer beat competitors Tuesday grab seven nominations Golden Globe Awards nominations included best film drama nod contested award organizers said films competing best picture <ORGANIZATION>Social Network Black Swan Fighter Inception Kings Speech</ORGANIZATION> earned nominations best performance actor olin <PERSON>Firth</PERSON> best performance actress <PERSON>Helena Bonham</PERSON> arter best supporting actor <PERSON>Geoffrey Rush</PERSON> best director <PERSON>Tom Hooper</PERSON> best screenplay <PERSON>David Seidler</PERSON> best movie score <ORGANIZATION>Alexandre Desplat Social Network Fighter</ORGANIZATION> earned nods apiece Black Swan Inception Kids Right tied place movie race nominations best motion picture comedy musical category <ORGANIZATION>Alice Wonderland Burlesque Kids Right Red Tourist</ORGANIZATION> compete Nominated best actor motion picture olin <ORGANIZATION>Firth Kings Speech James Franco Hours Ryan Gosling Blue Valentine Mark Wahlberg Fighter Jesse Eisenberg Social Network</ORGANIZATION> best actress motion picture nominees <PERSON>Halle Berry Frankie Alice Nicole Kidman</PERSON> Rabbit Hole <PERSON>Jennifer Lawrence</PERSON> <ORGANIZATION>Winters Bone Natalie Portman Black Swan Michelle Williams Blue Valentine TV</ORGANIZATION> categories Glee nominee nods followed Rock Boardwalk Empire Dexter Good Wife Mad Men Modern Family Pillars Earth Temple <PERSON>Grandin</PERSON> tied nods apiece awards announced Jan
要求按行把<></>标签内的字符串中的空格替换成下划线_,并且将数据转换形式,例:<X>A B C</X>需要转换成A_B_C/X
由于正则表达式匹配是贪婪模式,即尽可能匹配到靠后,那么就非常麻烦,而且仅仅是用?是无法真正保证是非贪婪的。所以需要在正则匹配时给之前匹配好的字符串标一个名字。
python下,正则最终写出来是这样:
1 LABEL_PATTERN = re.compile('(<(?P<label>\S+)>.+?</(?P=label)>)')
接下来我们需要做是在原字符串中找出对应的子串,并且记下他们的位置,接下来就是预处理出需要替换成的样子,再用一个正则就好了。
1 LABEL_CONTENT_PATTERN = re.compile('<(?P<label>\S+)>(.*?)</(?P=label)>')
对字符串集合做整次的map,对每一个字符串进行匹配,再吧这两部分匹配结果zip在一起,就可以获得一个start-end的tuple,大致这样。
1 ('<LOCATION>LOS ANGELES</LOCATION>', 'LOS_ANGELES/LOCATION') 2 ('<ORGANIZATION>Dec Xinhua Kings Speech</ORGANIZATION>', 'Dec_Xinhua_Kings_Speech/ORGANIZATION') 3 ('<ORGANIZATION>Social Network Black Swan Fighter Inception Kings Speech</ORGANIZATION>', 'Social_Network_Black_Swan_Fighter_Inception_Kings_Speech/ORGANIZATION') 4 ('<PERSON>Firth</PERSON>', 'Firth/PERSON') 5 ('<PERSON>Helena Bonham</PERSON>', 'Helena_Bonham/PERSON') 6 ('<PERSON>Geoffrey Rush</PERSON>', 'Geoffrey_Rush/PERSON') 7 ('<PERSON>Tom Hooper</PERSON>', 'Tom_Hooper/PERSON') 8 ('<PERSON>David Seidler</PERSON>', 'David_Seidler/PERSON') 9 ('<ORGANIZATION>Alexandre Desplat Social Network Fighter</ORGANIZATION>', 'Alexandre_Desplat_Social_Network_Fighter/ORGANIZATION') 10 ('<ORGANIZATION>Alice Wonderland Burlesque Kids Right Red Tourist</ORGANIZATION>', 'Alice_Wonderland_Burlesque_Kids_Right_Red_Tourist/ORGANIZATION') 11 ('<ORGANIZATION>Firth Kings Speech James Franco Hours Ryan Gosling Blue Valentine Mark Wahlberg Fighter Jesse Eisenberg Social Network</ORGANIZATION>', 'Firth_Kings_Speech_James_Franco_Hours_Ryan_Gosling_Blue_Valentine_Mark_Wahlberg_Fighter_Jesse_Eisenberg_Social_Network/ORGANIZATION') 12 ('<PERSON>Halle Berry Frankie Alice Nicole Kidman</PERSON>', 'Halle_Berry_Frankie_Alice_Nicole_Kidman/PERSON') 13 ('<PERSON>Jennifer Lawrence</PERSON>', 'Jennifer_Lawrence/PERSON') 14 ('<ORGANIZATION>Winters Bone Natalie Portman Black Swan Michelle Williams Blue Valentine TV</ORGANIZATION>', 'Winters_Bone_Natalie_Portman_Black_Swan_Michelle_Williams_Blue_Valentine_TV/ORGANIZATION') 15 ('<PERSON>Grandin</PERSON>', 'Grandin/PERSON') 16 ('<LOCATION>BEIJING</LOCATION>', 'BEIJING/LOCATION') 17 ('<ORGANIZATION>Xinhua Sanlu Group</ORGANIZATION>', 'Xinhua_Sanlu_Group/ORGANIZATION') 18 ('<LOCATION>Gansu</LOCATION>', 'Gansu/LOCATION') 19 ('<ORGANIZATION>Sanlu</ORGANIZATION>', 'Sanlu/ORGANIZATION')
处理的代码如下:
1 def read_file(path): 2 if not os.path.exists(path): 3 print 'path : \''+ path + '\' not find.' 4 return [] 5 content = '' 6 try: 7 with open(path, 'r') as fp: 8 content += reduce(lambda x,y:x+y, fp) 9 finally: 10 fp.close() 11 return content.split('\n') 12 13 def get_label(each): 14 pair = zip(LABEL_PATTERN.findall(each), 15 map(lambda x: x[1].replace(' ', '_')+'/'+x[0], LABEL_CONTENT_PATTERN.findall(each))) 16 return map(lambda x: (x[0][0], x[1]), pair) 17 18 src = read_file(FILE_PATH) 19 pattern = map(get_label, src)
接下来简单处理以下就好:
1 for i in range(0, len(src)): 2 for pat in pattern[i]: 3 src[i] = re.sub(pat[0], pat[1], src[i])
所有代码:
1 # -*- coding: utf-8 -*- 2 import re 3 import os 4 5 # FILE_PATH = '/home/kirai/workspace/sina_news_process/disworded_sina_news_attr_handled.txt' 6 FILE_PATH = '/home/kirai/workspace/sina_news_process/test.txt' 7 LABEL_PATTERN = re.compile('(<(?P<label>\S+)>.+?</(?P=label)>)') 8 LABEL_CONTENT_PATTERN = re.compile('<(?P<label>\S+)>(.*?)</(?P=label)>') 9 10 def read_file(path): 11 if not os.path.exists(path): 12 print 'path : \''+ path + '\' not find.' 13 return [] 14 content = '' 15 try: 16 with open(path, 'r') as fp: 17 content += reduce(lambda x,y:x+y, fp) 18 finally: 19 fp.close() 20 return content.split('\n') 21 22 def get_label(each): 23 pair = zip(LABEL_PATTERN.findall(each), 24 map(lambda x: x[1].replace(' ', '_')+'/'+x[0], LABEL_CONTENT_PATTERN.findall(each))) 25 return map(lambda x: (x[0][0], x[1]), pair) 26 27 src = read_file(FILE_PATH) 28 pattern = map(get_label, src) 29 30 for i in range(0, len(src)): 31 for pat in pattern[i]: 32 src[i] = re.sub(pat[0], pat[1], src[i])