I have long string (28MB) of normal sentences. I want to remove all words what are fully in capital letters (like TNT, USA, OMG).
我有长串(28MB)的普通句子。我想删除大写字母全部的所有单词(如TNT,USA,OMG)。
So from sentance:
所以从发送:
Jump over TNT in There.
I would like to get:
我想得到:
Jump over in There.
Is there any way, how to do it without splitting the text into list and itereate? Is it possible to use regex somehow to do is?
有什么办法,如何在不将文本拆分成列表并迭代的情况下执行此操作?有可能以某种方式使用正则表达式吗?
3 个解决方案
#1
4
You can use the set of capital letters [A-Z]
captured with word boundary \b
:
您可以使用用单词边界\ b捕获的大写字母[A-Z]集合:
import re
line = 'Jump over TNT in There NOW'
m = re.sub(r'\b[A-Z]+\b', '', line)
#'Jump over in There '
#2
2
Use the module re
,
使用模块re,
import re
line = 'Jump over TNT in There.'
new_line = re.sub(r'[A-Z]+(?![a-z])', '', line)
print(new_line)
# Output
Jump over in There.
#3
1
I would do something like this:
我会做这样的事情:
import string
def onlyUpper(word):
for c in word:
if not c.isupper():
return False
return True
s = "Jump over TNT in There."
for char in string.punctuation:
s = s.replace(char, ' ')
words = s.split()
good_words = []
for w in words:
if not onlyUpper(w):
good_words.append(w)
result = ""
for w in good_words:
result = result + w + " "
print result
#1
4
You can use the set of capital letters [A-Z]
captured with word boundary \b
:
您可以使用用单词边界\ b捕获的大写字母[A-Z]集合:
import re
line = 'Jump over TNT in There NOW'
m = re.sub(r'\b[A-Z]+\b', '', line)
#'Jump over in There '
#2
2
Use the module re
,
使用模块re,
import re
line = 'Jump over TNT in There.'
new_line = re.sub(r'[A-Z]+(?![a-z])', '', line)
print(new_line)
# Output
Jump over in There.
#3
1
I would do something like this:
我会做这样的事情:
import string
def onlyUpper(word):
for c in word:
if not c.isupper():
return False
return True
s = "Jump over TNT in There."
for char in string.punctuation:
s = s.replace(char, ' ')
words = s.split()
good_words = []
for w in words:
if not onlyUpper(w):
good_words.append(w)
result = ""
for w in good_words:
result = result + w + " "
print result