Apologies if this is a simple question, I'm still pretty new to this, but I've spent a while looking for an answer and haven't found anything. I have a list that looks something like this horrifying mess:
抱歉,如果这是一个简单的问题,我对此仍然很陌生,但我花了一段时间寻找答案,但没有找到任何答案。我有一个看起来像这个可怕的混乱的列表:
['Organization name} ', '> (777) 777-7777} ', ' class="lsn-mB6 adr">1 Address, MA 02114 } ', ' class="lsn-serpListRadius lsn-fr">.2 Miles} MORE INFO YOUR LISTING MAP if (typeof(serps) !== \'undefined\') serps.arrArticleIds.push(\'4603114\'); ', 'Other organization} ', '> (555) 555-5555} ', ' class="lsn-mB6 adr">301 Address, MA 02121 } ', ' class="lsn-serpListRadius lsn-fr">.2 Miles} MORE INFO CLAIM YOUR LISTING MAP if (typeof(serps) !== \'undefined\') serps.arrArticleIds.push(\'4715945\'); ', 'Organization} ']
And I need to process it so that HTML.py can turn the information in it into a table. For some reason, HTML.py simply can't handle the monster elements (eg. 'class="lsn-serpListRadius lsn-fr">.2 Miles} MORE INFO YOUR LISTING MAP if (typeof(serps) !== \'undefined\') serps.arrArticleIds.push(\'4603114\'); ', etc). Fortunately for me, I don't actually care about the information in the monster elements and want to get rid of them.
我需要处理它,以便HTML.py可以将其中的信息转换为表格。出于某种原因,HTML.py根本无法处理怪物元素(例如'class =“lsn-serpListRadius lsn-fr”>。2英里}更多信息您的上市地图if(typeof(serps)!== \' undefined \')serps.arrArticleIds.push(\''4603114 \');'等)。对我来说幸运的是,我实际上并不关心怪物元素中的信息并想要摆脱它们。
I tried writing a regex that would match all more-than-two-letter all-caps words, to identify the monster elements, and got this:
我尝试编写一个匹配所有超过两个字母的全大写单词的正则表达式来识别怪物元素,并得到了这个:
re.compile('[^a-z]*[A-Z][^a-z]*\w{3,}')
But I don't know how to apply that to deleting the elements containing matches to that regex from the list. How would I do that/is that the right way to go about it?
但我不知道如何应用它来从列表中删除包含与该正则表达式匹配的元素。我该怎么做/这是正确的方法吗?
5 个解决方案
#1
21
I think your regex is incorrect, to match all entries that contain all-cap words with three or more characters, you should use something like this with re.search
:
我认为你的正则表达式是不正确的,为了匹配包含三个或更多字符的全部字词的所有条目,你应该使用re.search这样的东西:
regex = re.compile(r'\b[A-Z]{3,}\b')
With that you can filter using a list comprehension or the filter
built-in function:
有了它,您可以使用列表推导或过滤器内置函数进行过滤:
full = ['Organization name} ', '> (777) 777-7777} ', ' class="lsn-mB6 adr">1 Address, MA 02114 } ', ' class="lsn-serpListRadius lsn-fr">.2 Miles} MORE INFO YOUR LISTING MAP if (typeof(serps) !== \'undefined\') serps.arrArticleIds.push(\'4603114\'); ', 'Other organization} ', '> (555) 555-5555} ', ' class="lsn-mB6 adr">301 Address, MA 02121 } ', ' class="lsn-serpListRadius lsn-fr">.2 Miles} MORE INFO CLAIM YOUR LISTING MAP if (typeof(serps) !== \'undefined\') serps.arrArticleIds.push(\'4715945\'); ', 'Organization} ']
regex = re.compile(r'\b[A-Z]{3,}\b')
# use only one of the following lines, whichever you prefer
filtered = filter(lambda i: not regex.search(i), full)
filtered = [i for i in full if not regex.search(i)]
Results in the following list (which I think is what you are looking for:
结果列在以下列表中(我认为您正在寻找:
>>> pprint.pprint(filtered)
['Organization name} ',
'> (777) 777-7777} ',
' class="lsn-mB6 adr">1 Address, MA 02114 } ',
'Other organization} ',
'> (555) 555-5555} ',
' class="lsn-mB6 adr">301 Address, MA 02121 } ',
'Organization} ']
#2
4
First, store your regex, then use a list comprehension:
首先,存储你的正则表达式,然后使用列表理解:
regex = re.compile('[^a-z]*[A-Z][^a-z]*\w{3,}')
okay_items = [x for x in all_items if not regex.match(x)]
#3
1
without regex
没有正则表达式
def isNotMonster(x):
return not any((len(word) > 2) and (word == word.upper()) for word in x.split())
okay_items = filter(isNotMonster, all_items)
#4
0
Or the very same but without compiling regex:
或者相同但没有编译正则表达式:
from re import match
ll = ['Organization name} ', '> (777) 777-7777} ', ' class="lsn-mB6 adr">1 Address, MA 02114 } ', ' class="lsn-serpListRadius lsn-fr">.2 Miles} MORE INFO YOUR LISTING MAP if (typeof(serps) !== \'undefined\') serps.arrArticleIds.push(\'4603114\'); ', 'Other organization} ', '> (555) 555-5555} ', ' class="lsn-mB6 adr">301 Address, MA 02121 } ', ' class="lsn-serpListRadius lsn-fr">.2 Miles} MORE INFO CLAIM YOUR LISTING MAP if (typeof(serps) !== \'undefined\') serps.arrArticleIds.push(\'4715945\'); ', 'Organization} ']
filteredData = [x for x in ll if not match(r'[^a-z]*[A-Z][^a-z]*\w{3,}', x)]
Edited:
编辑:
from re import compile
rex = compile('[^a-z]*[A-Z][^a-z]*\w{3,}')
filteredData = [x for x in ll if not rex.match(x)]
#5
0
element = 'string_to_search'
for item in y_list_of_items:
if element in item:
y_list_of_items.remove(item)
#1
21
I think your regex is incorrect, to match all entries that contain all-cap words with three or more characters, you should use something like this with re.search
:
我认为你的正则表达式是不正确的,为了匹配包含三个或更多字符的全部字词的所有条目,你应该使用re.search这样的东西:
regex = re.compile(r'\b[A-Z]{3,}\b')
With that you can filter using a list comprehension or the filter
built-in function:
有了它,您可以使用列表推导或过滤器内置函数进行过滤:
full = ['Organization name} ', '> (777) 777-7777} ', ' class="lsn-mB6 adr">1 Address, MA 02114 } ', ' class="lsn-serpListRadius lsn-fr">.2 Miles} MORE INFO YOUR LISTING MAP if (typeof(serps) !== \'undefined\') serps.arrArticleIds.push(\'4603114\'); ', 'Other organization} ', '> (555) 555-5555} ', ' class="lsn-mB6 adr">301 Address, MA 02121 } ', ' class="lsn-serpListRadius lsn-fr">.2 Miles} MORE INFO CLAIM YOUR LISTING MAP if (typeof(serps) !== \'undefined\') serps.arrArticleIds.push(\'4715945\'); ', 'Organization} ']
regex = re.compile(r'\b[A-Z]{3,}\b')
# use only one of the following lines, whichever you prefer
filtered = filter(lambda i: not regex.search(i), full)
filtered = [i for i in full if not regex.search(i)]
Results in the following list (which I think is what you are looking for:
结果列在以下列表中(我认为您正在寻找:
>>> pprint.pprint(filtered)
['Organization name} ',
'> (777) 777-7777} ',
' class="lsn-mB6 adr">1 Address, MA 02114 } ',
'Other organization} ',
'> (555) 555-5555} ',
' class="lsn-mB6 adr">301 Address, MA 02121 } ',
'Organization} ']
#2
4
First, store your regex, then use a list comprehension:
首先,存储你的正则表达式,然后使用列表理解:
regex = re.compile('[^a-z]*[A-Z][^a-z]*\w{3,}')
okay_items = [x for x in all_items if not regex.match(x)]
#3
1
without regex
没有正则表达式
def isNotMonster(x):
return not any((len(word) > 2) and (word == word.upper()) for word in x.split())
okay_items = filter(isNotMonster, all_items)
#4
0
Or the very same but without compiling regex:
或者相同但没有编译正则表达式:
from re import match
ll = ['Organization name} ', '> (777) 777-7777} ', ' class="lsn-mB6 adr">1 Address, MA 02114 } ', ' class="lsn-serpListRadius lsn-fr">.2 Miles} MORE INFO YOUR LISTING MAP if (typeof(serps) !== \'undefined\') serps.arrArticleIds.push(\'4603114\'); ', 'Other organization} ', '> (555) 555-5555} ', ' class="lsn-mB6 adr">301 Address, MA 02121 } ', ' class="lsn-serpListRadius lsn-fr">.2 Miles} MORE INFO CLAIM YOUR LISTING MAP if (typeof(serps) !== \'undefined\') serps.arrArticleIds.push(\'4715945\'); ', 'Organization} ']
filteredData = [x for x in ll if not match(r'[^a-z]*[A-Z][^a-z]*\w{3,}', x)]
Edited:
编辑:
from re import compile
rex = compile('[^a-z]*[A-Z][^a-z]*\w{3,}')
filteredData = [x for x in ll if not rex.match(x)]
#5
0
element = 'string_to_search'
for item in y_list_of_items:
if element in item:
y_list_of_items.remove(item)