图书馆用文本识别荷兰语日期?

时间:2020-12-19 09:06:07

I have a series of strings (in Dutch) that contain a date in either a DD-MM-YYYY format or in a textual format DD month YYYY. See an example selection here: https://paste.ee/p/XZLha. I'm looking for a Python (2.7) library that is able to recognize the date from these text strings.

我有一系列字符串(荷兰语),包含DD-MM-YYYY格式的日期或DD月YYYY的文本格式。请在此处查看示例选择:https://paste.ee/p/XZLha。我正在寻找一个能够从这些文本字符串中识别日期的Python(2.7)库。

  • dateutil is not able to properly parse Dutch
  • dateutil无法正确解析荷兰语

  • dateparser is not able to parse fuzzy strings - it only accepts strings with dates and days. It can handle Dutch dates though.
  • dateparser无法解析模糊字符串 - 它只接受带有日期和日期的字符串。它可以处理荷兰的日期。

I'd like to get your input on possible solutions. I'm considering stripping the text around the dates away and working with dateparser.

我想得到您对可能解决方案的意见。我正在考虑在日期之外删除文本并使用dateparser。

2 个解决方案

#1


0  

Below is an example using regular expressions as @Shiva recommended. It will probably need some refinement but the concept is there:

以下是使用正则表达式作为@Shiva推荐的示例。它可能需要一些改进,但概念是:

import re

SOURCE_DATA_SAMPLE = """gedaan te Amsterdam, op 13-4-2010, door
gedaan te Amsterdam, op 13 april 2010, door
gedaan te Amsterdam, op 12 juni 2003, door
gedaan te Amsterdam, op 12 juni 2002, door
Aldus gedaan op 24 oktober 2003 door
Aldus gedaan op 5 december 2003 door
Aldus gedaan op 5 december 2003 door
Aldus gedaan op 8 april 2004 door
Aldus gedaan op 16 april 2004 door
Aldus gedaan op 23 april 2004 door
Aldus gedaan op 10 september 2004 door
Aldus gedaan op 30 september 2004 door"""

DATE_REGEX = re.compile("(\d{1,2}-\d{1,2}-\d{4})|(\d{1,2} \w.*? \d{4})")

def find_date(line):
    matched = DATE_REGEX.search(line)
    if (matched):
        return matched.group(0)
    else:
        return None

for line in SOURCE_DATA_SAMPLE.split("\n"):
    date = find_date(line)
    print(date)

There's a great site called Regex101 that can help writing expressions; the sample I used for the above is here: https://regex101.com/r/wMFfx4/2

有一个名为Regex101的好网站可以帮助编写表达式;我用于上述的样本在这里:https://regex101.com/r/wMFfx4/2

#2


0  

The built-in datetime module's datetime.strpdate() can parse dates in a number of formats, including locale-specific ones. You still need to extract a date from the text first - either with a regex of some other means of analysis (e.g. maybe they are in some known, very specific parts of the text).

内置的datetime模块的datetime.strpdate()可以解析多种格式的日期,包括特定于语言环境的日期。您仍然需要首先从文本中提取日期 - 或者使用其他一些分析方法的正则表达式(例如,它们可能位于文本的某些已知的,非常特定的部分)。

#1


0  

Below is an example using regular expressions as @Shiva recommended. It will probably need some refinement but the concept is there:

以下是使用正则表达式作为@Shiva推荐的示例。它可能需要一些改进,但概念是:

import re

SOURCE_DATA_SAMPLE = """gedaan te Amsterdam, op 13-4-2010, door
gedaan te Amsterdam, op 13 april 2010, door
gedaan te Amsterdam, op 12 juni 2003, door
gedaan te Amsterdam, op 12 juni 2002, door
Aldus gedaan op 24 oktober 2003 door
Aldus gedaan op 5 december 2003 door
Aldus gedaan op 5 december 2003 door
Aldus gedaan op 8 april 2004 door
Aldus gedaan op 16 april 2004 door
Aldus gedaan op 23 april 2004 door
Aldus gedaan op 10 september 2004 door
Aldus gedaan op 30 september 2004 door"""

DATE_REGEX = re.compile("(\d{1,2}-\d{1,2}-\d{4})|(\d{1,2} \w.*? \d{4})")

def find_date(line):
    matched = DATE_REGEX.search(line)
    if (matched):
        return matched.group(0)
    else:
        return None

for line in SOURCE_DATA_SAMPLE.split("\n"):
    date = find_date(line)
    print(date)

There's a great site called Regex101 that can help writing expressions; the sample I used for the above is here: https://regex101.com/r/wMFfx4/2

有一个名为Regex101的好网站可以帮助编写表达式;我用于上述的样本在这里:https://regex101.com/r/wMFfx4/2

#2


0  

The built-in datetime module's datetime.strpdate() can parse dates in a number of formats, including locale-specific ones. You still need to extract a date from the text first - either with a regex of some other means of analysis (e.g. maybe they are in some known, very specific parts of the text).

内置的datetime模块的datetime.strpdate()可以解析多种格式的日期,包括特定于语言环境的日期。您仍然需要首先从文本中提取日期 - 或者使用其他一些分析方法的正则表达式(例如,它们可能位于文本的某些已知的,非常特定的部分)。