Python正则表达式(regex)匹配逗号分隔的数字——为什么它不能工作?

时间:2021-01-21 22:29:02

I am trying to parse transaction letters from my (German) bank. I'd like to extract all the numbers from the following string which turns out to be harder than I thought. Option 2 does almost what I want. I now want to modify it to capture e.g. 80 as well.

我正在解析我(德国)银行的交易信件。我想从下面的字符串中提取出所有的数字这比我想象的要难。选项2几乎实现了我想要的。我现在想修改它以捕获,例如,80。

My first try is option 1 which only returns garbage. Why is it returning so many empty strings? It should always have at least a number from the first \d+, no?

我的第一个尝试是选项1,它只返回垃圾。为什么它返回这么多空字符串?它应该至少有一个来自第一个\d+的数字,不是吗?

Option 3 works (or at least works as expected), so somehow I am answering my own question. I guess I'm mostly banging my head about why option 2 does not work.

选项3起作用(或者至少像预期的那样起作用),所以我以某种方式回答了我自己的问题。我想我一直在思考为什么第二种选择行不通。

# -*- coding: utf-8 -*-
import re


my_str = """
Dividendengutschrift für inländische Wertpapiere

Depotinhaber    : ME

Extag           :  18.04.2013          Bruttodividende
Zahlungstag     :  18.04.2013          pro Stück       :       0,9800 EUR
Valuta          :  18.04.2013

                                       Bruttodividende :        78,40 EUR
                                      *Einbeh. Steuer  :        20,67 EUR
                                       Nettodividende  :        78,40 EUR

                                       Endbetrag       :        57,73 EUR
"""

print re.findall(r'\d+(,\d+)?', my_str)
print re.findall(r'\d+,\d+', my_str)
print re.findall(r'[-+]?\d*,\d+|\d+', my_str)

Output is

输出是

['', '', '', '', '', '', ',98', '', '', '', '', ',40', ',67', ',40', ',73']
['0,9800', '78,40', '20,67', '78,40', '57,73']
['18', '04', '2013', '18', '04', '2013', '0,9800', '18', '04', '2013', '78,40', '20,67', '78,40', '57,73']

6 个解决方案

#1


8  

Option 1 is the most suitable of the regex, but it is not working correctly because findall will return what is matched by the capture group (), not the complete match.

选项1是最适合regex的,但是它不能正常工作,因为findall将返回捕获组()匹配的内容,而不是完整的匹配。

For example, the first three matches in your example will be the 18, 04 and 2013, and in each case the capture group will be unmatched so an empty string will be added to the results list.

例如,示例中的前三种匹配将是18、04和2013,在每种情况下,捕获组都是不匹配的,因此将向结果列表添加一个空字符串。

The solution is to make the group non-capturing

解决方案是使组不捕获

r'\d+(?:,\d+)?'

Option 2 does not work only so far as it won't match sequences that don't contain a comma.

选项2并不只起作用,因为它不会匹配不包含逗号的序列。

Option 3 isn't great because it will match e.g. +,1.

选项3不太好,因为它会匹配例如+ 1。

#2


3  

I'd like to extract all the numbers from the following string ...

我想从下面的字符串中提取所有的数字……

By "numbers", if you mean both the currency amounts AND the dates, I think that this will do what you want:

“数字”,如果你指的是货币数量和日期,我认为这将实现你想要的:

print re.findall(r'[0-9][0-9,.]+', my_str)

Output:

输出:

['18.04.2013', '18.04.2013', '0,9800', '18.04.2013', '78,40', '20,67', '78,40', '57,73']

If by "numbers" you mean only the currency amounts, then use

如果你所说的“数字”只是指货币数量,那就使用它

print re.findall(r'[0-9]+,[0-9]+', my_str)

Or perhaps better yet,

或者更好的是,

print re.findall(r'[0-9]+,[0-9]+ EUR', my_str)

#3


2  

Here is a solution, which parse the statement and put the result in a dictionary called bank_statement:

这里有一个解决方案,它解析语句并将结果放入一个名为bank_statement的字典中:

# -*- coding: utf-8 -*-
import itertools

my_str = """
Dividendengutschrift für inländische Wertpapiere

Depotinhaber    : ME

Extag           :  18.04.2013          Bruttodividende
Zahlungstag     :  18.04.2013          pro Stück       :       0,9800 EUR
Valuta          :  18.04.2013

                                       Bruttodividende :        78,40 EUR
                                      *Einbeh. Steuer  :        20,67 EUR
                                       Nettodividende  :        78,40 EUR

                                       Endbetrag       :        57,73 EUR
"""

bank_statement = {}

for line in my_str.split('\n'):
    tokens = line.split()
    #print tokens


    it = iter(tokens)
    category = ''
    for token in it:
        if token == ':':
            category = category.strip(' *')
            bank_statement[category] = next(it)
            category = ''
        else:
            category += ' ' + token

# bank_statement now has all the values
print '\n'.join('{0:.<18} {1}'.format(k, v) \
                for k, v in sorted(bank_statement.items()))

The Output of this code:

本代码的输出:

Bruttodividende... 78,40  
Depotinhaber...... ME  
Einbeh. Steuer.... 20,67  
Endbetrag......... 57,73  
Extag............. 18.04.2013  
Nettodividende.... 78,40  
Valuta............ 18.04.2013  
Zahlungstag....... 18.04.2013  
pro Stück........ 0,9800  

Discussion

  • The code scans the statement string line by line
  • 代码逐行扫描语句字符串
  • It then breaks each line into tokens
  • 然后它将每一行分解为令牌。
  • Scanning through the tokens and look for the colon. If found, use the part before the colon as category, and the part after that as value. bank_statement['Extag'] for example, has the value of '18.04.2013'
  • 浏览标记并查找冒号。如果找到,将冒号前的部分作为类别,然后将后面的部分作为值。例如bank_statement['Extag']的值为'18.04.2013'
  • Please note that all the values are strings, not number, but it is trivia to convert them.
  • 请注意,所有的值都是字符串,不是数字,但是转换它们是琐事。

#4


1  

This question is relevant; the following

这个问题是相关的;以下

print re.findall(r'\d+(?:,\d+)?', my_str)
                       ^^     

ouputs

输出

['18', '04', '2013', '18', '04', '2013', '0,9800', '18', '04', '2013', '78,40', '20,67', '78,40', '57,73']

Excluding the "dotted" numbers is a little more complicated:

排除“虚线”数字有点复杂:

print re.findall(r'(?<!\d\.)\b\d+(?:,\d+)?\b(?!\.\d)', my_str)
                   ^^^^^^^^^^^            ^^^^^^^^^^

This outputs

这个输出

['0,9800', '78,40', '20,67', '78,40', '57,73']

#5


0  

Try this one:

试试这个:

re.findall(r'\d+(?:[\d,.]*\d)', my_str)

This regex requires that the match at least starts with a number, then any amount of a mix of numbers, comma's and periods, and then it should end with a number too.

这个regex要求匹配至少从一个数字开始,然后是任意数量的数字、逗号和句点,然后它也应该以一个数字结束。

#6


0  

Option 2 doesn't match numbers like '18.04.2013' because you are matching '\d+,\d+' which means

选项2不匹配“18.04.2013”这样的数字,因为您匹配的是“\d+,\d+”的意思

digit (one or more) comma digit (one or more)

数字(一个或多个)逗号(一个或多个)

For parsing digits in your case I'll use

对于您的情况,我将使用解析数字

\s(\d+[^\s]+)

which translates to

这翻译

space (get digit [one or more] get everything != space)

space = \s
get digit = \d
one or more = + (so it becomes \d+)
get everything != space = [^\s]
one or more = + (so it becomes [^\s]+

#1


8  

Option 1 is the most suitable of the regex, but it is not working correctly because findall will return what is matched by the capture group (), not the complete match.

选项1是最适合regex的,但是它不能正常工作,因为findall将返回捕获组()匹配的内容,而不是完整的匹配。

For example, the first three matches in your example will be the 18, 04 and 2013, and in each case the capture group will be unmatched so an empty string will be added to the results list.

例如,示例中的前三种匹配将是18、04和2013,在每种情况下,捕获组都是不匹配的,因此将向结果列表添加一个空字符串。

The solution is to make the group non-capturing

解决方案是使组不捕获

r'\d+(?:,\d+)?'

Option 2 does not work only so far as it won't match sequences that don't contain a comma.

选项2并不只起作用,因为它不会匹配不包含逗号的序列。

Option 3 isn't great because it will match e.g. +,1.

选项3不太好,因为它会匹配例如+ 1。

#2


3  

I'd like to extract all the numbers from the following string ...

我想从下面的字符串中提取所有的数字……

By "numbers", if you mean both the currency amounts AND the dates, I think that this will do what you want:

“数字”,如果你指的是货币数量和日期,我认为这将实现你想要的:

print re.findall(r'[0-9][0-9,.]+', my_str)

Output:

输出:

['18.04.2013', '18.04.2013', '0,9800', '18.04.2013', '78,40', '20,67', '78,40', '57,73']

If by "numbers" you mean only the currency amounts, then use

如果你所说的“数字”只是指货币数量,那就使用它

print re.findall(r'[0-9]+,[0-9]+', my_str)

Or perhaps better yet,

或者更好的是,

print re.findall(r'[0-9]+,[0-9]+ EUR', my_str)

#3


2  

Here is a solution, which parse the statement and put the result in a dictionary called bank_statement:

这里有一个解决方案,它解析语句并将结果放入一个名为bank_statement的字典中:

# -*- coding: utf-8 -*-
import itertools

my_str = """
Dividendengutschrift für inländische Wertpapiere

Depotinhaber    : ME

Extag           :  18.04.2013          Bruttodividende
Zahlungstag     :  18.04.2013          pro Stück       :       0,9800 EUR
Valuta          :  18.04.2013

                                       Bruttodividende :        78,40 EUR
                                      *Einbeh. Steuer  :        20,67 EUR
                                       Nettodividende  :        78,40 EUR

                                       Endbetrag       :        57,73 EUR
"""

bank_statement = {}

for line in my_str.split('\n'):
    tokens = line.split()
    #print tokens


    it = iter(tokens)
    category = ''
    for token in it:
        if token == ':':
            category = category.strip(' *')
            bank_statement[category] = next(it)
            category = ''
        else:
            category += ' ' + token

# bank_statement now has all the values
print '\n'.join('{0:.<18} {1}'.format(k, v) \
                for k, v in sorted(bank_statement.items()))

The Output of this code:

本代码的输出:

Bruttodividende... 78,40  
Depotinhaber...... ME  
Einbeh. Steuer.... 20,67  
Endbetrag......... 57,73  
Extag............. 18.04.2013  
Nettodividende.... 78,40  
Valuta............ 18.04.2013  
Zahlungstag....... 18.04.2013  
pro Stück........ 0,9800  

Discussion

  • The code scans the statement string line by line
  • 代码逐行扫描语句字符串
  • It then breaks each line into tokens
  • 然后它将每一行分解为令牌。
  • Scanning through the tokens and look for the colon. If found, use the part before the colon as category, and the part after that as value. bank_statement['Extag'] for example, has the value of '18.04.2013'
  • 浏览标记并查找冒号。如果找到,将冒号前的部分作为类别,然后将后面的部分作为值。例如bank_statement['Extag']的值为'18.04.2013'
  • Please note that all the values are strings, not number, but it is trivia to convert them.
  • 请注意,所有的值都是字符串,不是数字,但是转换它们是琐事。

#4


1  

This question is relevant; the following

这个问题是相关的;以下

print re.findall(r'\d+(?:,\d+)?', my_str)
                       ^^     

ouputs

输出

['18', '04', '2013', '18', '04', '2013', '0,9800', '18', '04', '2013', '78,40', '20,67', '78,40', '57,73']

Excluding the "dotted" numbers is a little more complicated:

排除“虚线”数字有点复杂:

print re.findall(r'(?<!\d\.)\b\d+(?:,\d+)?\b(?!\.\d)', my_str)
                   ^^^^^^^^^^^            ^^^^^^^^^^

This outputs

这个输出

['0,9800', '78,40', '20,67', '78,40', '57,73']

#5


0  

Try this one:

试试这个:

re.findall(r'\d+(?:[\d,.]*\d)', my_str)

This regex requires that the match at least starts with a number, then any amount of a mix of numbers, comma's and periods, and then it should end with a number too.

这个regex要求匹配至少从一个数字开始,然后是任意数量的数字、逗号和句点,然后它也应该以一个数字结束。

#6


0  

Option 2 doesn't match numbers like '18.04.2013' because you are matching '\d+,\d+' which means

选项2不匹配“18.04.2013”这样的数字,因为您匹配的是“\d+,\d+”的意思

digit (one or more) comma digit (one or more)

数字(一个或多个)逗号(一个或多个)

For parsing digits in your case I'll use

对于您的情况,我将使用解析数字

\s(\d+[^\s]+)

which translates to

这翻译

space (get digit [one or more] get everything != space)

space = \s
get digit = \d
one or more = + (so it becomes \d+)
get everything != space = [^\s]
one or more = + (so it becomes [^\s]+