I have a list of tokenized text (list_of_words) that looks something like this:
我有一个标记文本列表(list_of_words),它看起来是这样的:
list_of_words =
['08/20/2014',
'10:04:27',
'pm',
'complet',
'vendor',
'per',
'mfg/recommend',
'08/20/2014',
'10:04:27',
'pm',
'complet',
...]
and I'm trying to strip out all the instances of dates and times from this list. I've tried using the .remove() function, to no avail. I've tried passing wildcard characters, such as '../../...." to a list of stopwords I was sorting with, but that didn't work. I finally tried writing the following code:
我试图从这个列表中去掉所有日期和时间的实例。我尝试过使用.remove()函数,但没有效果。我试着通过通配符,如“. . / . . /”.... stopwords我排序的列表,但没有成功。最后我试着写了以下代码:
for line in list_of_words:
if re.search('[0-9]{2}/[09]{2}/[0-9]{4}',line):
list_of_words.remove(line)
but that doesn't work either. How can I strip out everything formatted like a date or time from my list?
但这也不管用。我如何从我的列表中删除所有格式化为日期或时间的内容?
3 个解决方案
#1
8
Description
^(?:(?:[0-9]{2}[:\/,]){2}[0-9]{2,4}|am|pm)$
This regular expression will do the following:
这个正则表达式将执行以下操作:
- find strings which look like dates
12/23/2016
and times12:34:56
- 查找日期为12/23/2016和时间为12:34:56的字符串
- find strings which also are also
am
orpm
which are probably part of the preceding time in the source list - 查找同样也是am或pm的字符串,它们可能是源列表中前面时间的一部分
Example
Live Demo
现场演示
- Regex: https://regex101.com/r/yE8oB9/2
- Regex:https://regex101.com/r/yE8oB9/2
- Python: http://codepad.org/X9D3pd7s
- Python:http://codepad.org/X9D3pd7s
Sample List
示例列表
08/20/2014
10:04:27
pm
complete
vendor
per
mfg/recommend
08/20/2014
10:04:27
pm
complete
List After Processing
列表处理后
complete
vendor
per
mfg/recommend
complete
Sample Python Script
Python脚本示例
import re
SourceList = ['08/20/2014',
'10:04:27',
'pm',
'complete',
'vendor',
'per',
'mfg/recommend',
'08/20/2014',
'10:04:27',
'pm',
'complete']
OutputList = filter(
lambda ThisWord: not re.match('^(?:(?:[0-9]{2}[:\/,]){2}[0-9]{2,4}|am|pm)$', ThisWord),
SourceList)
for ThisValue in OutputList:
print ThisValue
Explanation
NODE EXPLANATION
----------------------------------------------------------------------
^ the beginning of the string
----------------------------------------------------------------------
(?: group, but do not capture:
----------------------------------------------------------------------
(?: group, but do not capture (2 times):
----------------------------------------------------------------------
[0-9]{2} any character of: '0' to '9' (2 times)
----------------------------------------------------------------------
[:\/,] any character of: ':', '\/', ','
----------------------------------------------------------------------
){2} end of grouping
----------------------------------------------------------------------
[0-9]{2,4} any character of: '0' to '9' (between 2
and 4 times (matching the most amount
possible))
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
am 'am'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
pm 'pm'
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
$ before an optional \n, and the end of the
string
----------------------------------------------------------------------
#2
6
if you want math the time and date string in your list, maybe you can try below regex:
如果你想在你的列表中计算时间和日期字符串,也许你可以试试下面的regex:
[0-9]{2}[\/,:][0-9]{2}[\/,:][0-9]{2,4}
add the python code:
添加python代码:
import re
list_of_words = [
'08/20/2014',
'10:04:27',
'pm',
'complet',
'vendor',
'per',
'mfg/recommend',
'08/20/2014',
'10:04:27',
'pm',
'complet'
]
new_list = [item for item in list_of_words if not re.search(r'[0-9]{2}[\/,:][0-9]{2}[\/,:][0-9]{2,4}', item)]
#3
1
Try this:
试试这个:
import re
list_of_words = ['08/20/2014',
'10:04:27',
'pm',
'complet',
'vendor',
'per',
'mfg/recommend',
'08/20/2014',
'10:04:27',
'pm', 'complet']
list_of_words = filter(
lambda x: not re.match('[0-9]{2}[\/,:][0-9]{2}[\/,:][0-9]{2,4}', x),
list_of_words)
#1
8
Description
^(?:(?:[0-9]{2}[:\/,]){2}[0-9]{2,4}|am|pm)$
This regular expression will do the following:
这个正则表达式将执行以下操作:
- find strings which look like dates
12/23/2016
and times12:34:56
- 查找日期为12/23/2016和时间为12:34:56的字符串
- find strings which also are also
am
orpm
which are probably part of the preceding time in the source list - 查找同样也是am或pm的字符串,它们可能是源列表中前面时间的一部分
Example
Live Demo
现场演示
- Regex: https://regex101.com/r/yE8oB9/2
- Regex:https://regex101.com/r/yE8oB9/2
- Python: http://codepad.org/X9D3pd7s
- Python:http://codepad.org/X9D3pd7s
Sample List
示例列表
08/20/2014
10:04:27
pm
complete
vendor
per
mfg/recommend
08/20/2014
10:04:27
pm
complete
List After Processing
列表处理后
complete
vendor
per
mfg/recommend
complete
Sample Python Script
Python脚本示例
import re
SourceList = ['08/20/2014',
'10:04:27',
'pm',
'complete',
'vendor',
'per',
'mfg/recommend',
'08/20/2014',
'10:04:27',
'pm',
'complete']
OutputList = filter(
lambda ThisWord: not re.match('^(?:(?:[0-9]{2}[:\/,]){2}[0-9]{2,4}|am|pm)$', ThisWord),
SourceList)
for ThisValue in OutputList:
print ThisValue
Explanation
NODE EXPLANATION
----------------------------------------------------------------------
^ the beginning of the string
----------------------------------------------------------------------
(?: group, but do not capture:
----------------------------------------------------------------------
(?: group, but do not capture (2 times):
----------------------------------------------------------------------
[0-9]{2} any character of: '0' to '9' (2 times)
----------------------------------------------------------------------
[:\/,] any character of: ':', '\/', ','
----------------------------------------------------------------------
){2} end of grouping
----------------------------------------------------------------------
[0-9]{2,4} any character of: '0' to '9' (between 2
and 4 times (matching the most amount
possible))
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
am 'am'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
pm 'pm'
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
$ before an optional \n, and the end of the
string
----------------------------------------------------------------------
#2
6
if you want math the time and date string in your list, maybe you can try below regex:
如果你想在你的列表中计算时间和日期字符串,也许你可以试试下面的regex:
[0-9]{2}[\/,:][0-9]{2}[\/,:][0-9]{2,4}
add the python code:
添加python代码:
import re
list_of_words = [
'08/20/2014',
'10:04:27',
'pm',
'complet',
'vendor',
'per',
'mfg/recommend',
'08/20/2014',
'10:04:27',
'pm',
'complet'
]
new_list = [item for item in list_of_words if not re.search(r'[0-9]{2}[\/,:][0-9]{2}[\/,:][0-9]{2,4}', item)]
#3
1
Try this:
试试这个:
import re
list_of_words = ['08/20/2014',
'10:04:27',
'pm',
'complet',
'vendor',
'per',
'mfg/recommend',
'08/20/2014',
'10:04:27',
'pm', 'complet']
list_of_words = filter(
lambda x: not re.match('[0-9]{2}[\/,:][0-9]{2}[\/,:][0-9]{2,4}', x),
list_of_words)