I have a list of tokenized text (list_of_words) that looks something like this:

我有一个标记文本列表(list_of_words)，它看起来是这样的:

list_of_words = 
['08/20/2014',
 '10:04:27',
 'pm',
 'complet',
 'vendor',
 'per',
 'mfg/recommend',
 '08/20/2014',
 '10:04:27',
 'pm',
 'complet',
 ...]

and I'm trying to strip out all the instances of dates and times from this list. I've tried using the .remove() function, to no avail. I've tried passing wildcard characters, such as '../../...." to a list of stopwords I was sorting with, but that didn't work. I finally tried writing the following code:

我试图从这个列表中去掉所有日期和时间的实例。我尝试过使用.remove()函数，但没有效果。我试着通过通配符,如“. . / . . /”.... stopwords我排序的列表,但没有成功。最后我试着写了以下代码:

for line in list_of_words:
    if re.search('[0-9]{2}/[09]{2}/[0-9]{4}',line):
        list_of_words.remove(line)

but that doesn't work either. How can I strip out everything formatted like a date or time from my list?

但这也不管用。我如何从我的列表中删除所有格式化为日期或时间的内容?

3 个解决方案

#1

Description

^(?:(?:[0-9]{2}[:\/,]){2}[0-9]{2,4}|am|pm)$

如何从Python中的列表中删除日期

This regular expression will do the following:

这个正则表达式将执行以下操作:

find strings which look like dates 12/23/2016 and times 12:34:56
查找日期为12/23/2016和时间为12:34:56的字符串
find strings which also are also am or pm which are probably part of the preceding time in the source list
查找同样也是am或pm的字符串，它们可能是源列表中前面时间的一部分

Example

Live Demo

现场演示

Regex: https://regex101.com/r/yE8oB9/2
Regex:https://regex101.com/r/yE8oB9/2
Python: http://codepad.org/X9D3pd7s
Python:http://codepad.org/X9D3pd7s

Sample List

示例列表

08/20/2014
10:04:27
pm
complete
vendor
per
mfg/recommend
08/20/2014
10:04:27
pm
complete

List After Processing

列表处理后

complete
vendor
per
mfg/recommend
complete

Sample Python Script

Python脚本示例

import re

SourceList = ['08/20/2014',
                 '10:04:27',
                 'pm',
                 'complete',
                 'vendor',
                 'per',
                 'mfg/recommend',
                 '08/20/2014',
                 '10:04:27',
                 'pm', 
                 'complete']

OutputList = filter(
    lambda ThisWord: not re.match('^(?:(?:[0-9]{2}[:\/,]){2}[0-9]{2,4}|am|pm)$', ThisWord),
    SourceList)


for ThisValue in OutputList:
  print ThisValue

Explanation

NODE                     EXPLANATION
----------------------------------------------------------------------
  ^                        the beginning of the string
----------------------------------------------------------------------
  (?:                      group, but do not capture:
----------------------------------------------------------------------
    (?:                      group, but do not capture (2 times):
----------------------------------------------------------------------
      [0-9]{2}                 any character of: '0' to '9' (2 times)
----------------------------------------------------------------------
      [:\/,]                   any character of: ':', '\/', ','
----------------------------------------------------------------------
    ){2}                     end of grouping
----------------------------------------------------------------------
    [0-9]{2,4}               any character of: '0' to '9' (between 2
                             and 4 times (matching the most amount
                             possible))
----------------------------------------------------------------------
   |                        OR
----------------------------------------------------------------------
    am                       'am'
----------------------------------------------------------------------
   |                        OR
----------------------------------------------------------------------
    pm                       'pm'
----------------------------------------------------------------------
  )                        end of grouping
----------------------------------------------------------------------
  $                        before an optional \n, and the end of the
                           string
----------------------------------------------------------------------

#2

if you want math the time and date string in your list, maybe you can try below regex:

如果你想在你的列表中计算时间和日期字符串，也许你可以试试下面的regex:

[0-9]{2}[\/,:][0-9]{2}[\/,:][0-9]{2,4}

add the python code:

添加python代码:

import re

list_of_words = [
 '08/20/2014',
 '10:04:27',
 'pm',
 'complet',
 'vendor',
 'per',
 'mfg/recommend',
 '08/20/2014',
 '10:04:27',
 'pm',
 'complet'
]
new_list = [item for item in list_of_words if not re.search(r'[0-9]{2}[\/,:][0-9]{2}[\/,:][0-9]{2,4}', item)]

#3

Try this:

试试这个:

import re

list_of_words = ['08/20/2014',
                 '10:04:27',
                 'pm',
                 'complet',
                 'vendor',
                 'per',
                 'mfg/recommend',
                 '08/20/2014',
                 '10:04:27',
                 'pm', 'complet']

list_of_words = filter(
    lambda x: not re.match('[0-9]{2}[\/,:][0-9]{2}[\/,:][0-9]{2,4}', x),
    list_of_words)

#1

Description

^(?:(?:[0-9]{2}[:\/,]){2}[0-9]{2,4}|am|pm)$

如何从Python中的列表中删除日期

This regular expression will do the following:

这个正则表达式将执行以下操作:

find strings which look like dates 12/23/2016 and times 12:34:56
查找日期为12/23/2016和时间为12:34:56的字符串
find strings which also are also am or pm which are probably part of the preceding time in the source list
查找同样也是am或pm的字符串，它们可能是源列表中前面时间的一部分

Example

Live Demo

现场演示

Regex: https://regex101.com/r/yE8oB9/2
Regex:https://regex101.com/r/yE8oB9/2
Python: http://codepad.org/X9D3pd7s
Python:http://codepad.org/X9D3pd7s

Sample List

示例列表

08/20/2014
10:04:27
pm
complete
vendor
per
mfg/recommend
08/20/2014
10:04:27
pm
complete

List After Processing

列表处理后

complete
vendor
per
mfg/recommend
complete

Sample Python Script

Python脚本示例

import re

SourceList = ['08/20/2014',
                 '10:04:27',
                 'pm',
                 'complete',
                 'vendor',
                 'per',
                 'mfg/recommend',
                 '08/20/2014',
                 '10:04:27',
                 'pm', 
                 'complete']

OutputList = filter(
    lambda ThisWord: not re.match('^(?:(?:[0-9]{2}[:\/,]){2}[0-9]{2,4}|am|pm)$', ThisWord),
    SourceList)


for ThisValue in OutputList:
  print ThisValue

Explanation

NODE                     EXPLANATION
----------------------------------------------------------------------
  ^                        the beginning of the string
----------------------------------------------------------------------
  (?:                      group, but do not capture:
----------------------------------------------------------------------
    (?:                      group, but do not capture (2 times):
----------------------------------------------------------------------
      [0-9]{2}                 any character of: '0' to '9' (2 times)
----------------------------------------------------------------------
      [:\/,]                   any character of: ':', '\/', ','
----------------------------------------------------------------------
    ){2}                     end of grouping
----------------------------------------------------------------------
    [0-9]{2,4}               any character of: '0' to '9' (between 2
                             and 4 times (matching the most amount
                             possible))
----------------------------------------------------------------------
   |                        OR
----------------------------------------------------------------------
    am                       'am'
----------------------------------------------------------------------
   |                        OR
----------------------------------------------------------------------
    pm                       'pm'
----------------------------------------------------------------------
  )                        end of grouping
----------------------------------------------------------------------
  $                        before an optional \n, and the end of the
                           string
----------------------------------------------------------------------

#2

if you want math the time and date string in your list, maybe you can try below regex:

如果你想在你的列表中计算时间和日期字符串，也许你可以试试下面的regex:

[0-9]{2}[\/,:][0-9]{2}[\/,:][0-9]{2,4}

add the python code:

添加python代码:

import re

list_of_words = [
 '08/20/2014',
 '10:04:27',
 'pm',
 'complet',
 'vendor',
 'per',
 'mfg/recommend',
 '08/20/2014',
 '10:04:27',
 'pm',
 'complet'
]
new_list = [item for item in list_of_words if not re.search(r'[0-9]{2}[\/,:][0-9]{2}[\/,:][0-9]{2,4}', item)]

#3

Try this:

试试这个:

import re

list_of_words = ['08/20/2014',
                 '10:04:27',
                 'pm',
                 'complet',
                 'vendor',
                 'per',
                 'mfg/recommend',
                 '08/20/2014',
                 '10:04:27',
                 'pm', 'complet']

list_of_words = filter(
    lambda x: not re.match('[0-9]{2}[\/,:][0-9]{2}[\/,:][0-9]{2,4}', x),
    list_of_words)

秒客网

如何从Python中的列表中删除日期

3 个解决方案

#1

Description

Example

Explanation

#2

#3

#1

Description

Example

Explanation

#2

#3

相关文章