I am trying to parse transaction letters from my (German) bank. I'd like to extract all the numbers from the following string which turns out to be harder than I thought. Option 2 does almost what I want. I now want to modify it to capture e.g. 80 as well.
我正在解析我(德国)银行的交易信件。我想从下面的字符串中提取出所有的数字这比我想象的要难。选项2几乎实现了我想要的。我现在想修改它以捕获,例如,80。
My first try is option 1 which only returns garbage. Why is it returning so many empty strings? It should always have at least a number from the first \d+, no?
我的第一个尝试是选项1,它只返回垃圾。为什么它返回这么多空字符串?它应该至少有一个来自第一个\d+的数字,不是吗?
Option 3 works (or at least works as expected), so somehow I am answering my own question. I guess I'm mostly banging my head about why option 2 does not work.
选项3起作用(或者至少像预期的那样起作用),所以我以某种方式回答了我自己的问题。我想我一直在思考为什么第二种选择行不通。
# -*- coding: utf-8 -*-
import re
my_str = """
Dividendengutschrift für inländische Wertpapiere
Depotinhaber : ME
Extag : 18.04.2013 Bruttodividende
Zahlungstag : 18.04.2013 pro Stück : 0,9800 EUR
Valuta : 18.04.2013
Bruttodividende : 78,40 EUR
*Einbeh. Steuer : 20,67 EUR
Nettodividende : 78,40 EUR
Endbetrag : 57,73 EUR
"""
print re.findall(r'\d+(,\d+)?', my_str)
print re.findall(r'\d+,\d+', my_str)
print re.findall(r'[-+]?\d*,\d+|\d+', my_str)
Output is
输出是
['', '', '', '', '', '', ',98', '', '', '', '', ',40', ',67', ',40', ',73']
['0,9800', '78,40', '20,67', '78,40', '57,73']
['18', '04', '2013', '18', '04', '2013', '0,9800', '18', '04', '2013', '78,40', '20,67', '78,40', '57,73']
6 个解决方案
#1
8
Option 1 is the most suitable of the regex, but it is not working correctly because findall
will return what is matched by the capture group ()
, not the complete match.
选项1是最适合regex的,但是它不能正常工作,因为findall将返回捕获组()匹配的内容,而不是完整的匹配。
For example, the first three matches in your example will be the 18
, 04
and 2013
, and in each case the capture group will be unmatched so an empty string will be added to the results list.
例如,示例中的前三种匹配将是18、04和2013,在每种情况下,捕获组都是不匹配的,因此将向结果列表添加一个空字符串。
The solution is to make the group non-capturing
解决方案是使组不捕获
r'\d+(?:,\d+)?'
Option 2 does not work only so far as it won't match sequences that don't contain a comma.
选项2并不只起作用,因为它不会匹配不包含逗号的序列。
Option 3 isn't great because it will match e.g. +,1
.
选项3不太好,因为它会匹配例如+ 1。
#2
3
I'd like to extract all the numbers from the following string ...
我想从下面的字符串中提取所有的数字……
By "numbers", if you mean both the currency amounts AND the dates, I think that this will do what you want:
“数字”,如果你指的是货币数量和日期,我认为这将实现你想要的:
print re.findall(r'[0-9][0-9,.]+', my_str)
Output:
输出:
['18.04.2013', '18.04.2013', '0,9800', '18.04.2013', '78,40', '20,67', '78,40', '57,73']
If by "numbers" you mean only the currency amounts, then use
如果你所说的“数字”只是指货币数量,那就使用它
print re.findall(r'[0-9]+,[0-9]+', my_str)
Or perhaps better yet,
或者更好的是,
print re.findall(r'[0-9]+,[0-9]+ EUR', my_str)
#3
2
Here is a solution, which parse the statement and put the result in a dictionary called bank_statement
:
这里有一个解决方案,它解析语句并将结果放入一个名为bank_statement的字典中:
# -*- coding: utf-8 -*-
import itertools
my_str = """
Dividendengutschrift für inländische Wertpapiere
Depotinhaber : ME
Extag : 18.04.2013 Bruttodividende
Zahlungstag : 18.04.2013 pro Stück : 0,9800 EUR
Valuta : 18.04.2013
Bruttodividende : 78,40 EUR
*Einbeh. Steuer : 20,67 EUR
Nettodividende : 78,40 EUR
Endbetrag : 57,73 EUR
"""
bank_statement = {}
for line in my_str.split('\n'):
tokens = line.split()
#print tokens
it = iter(tokens)
category = ''
for token in it:
if token == ':':
category = category.strip(' *')
bank_statement[category] = next(it)
category = ''
else:
category += ' ' + token
# bank_statement now has all the values
print '\n'.join('{0:.<18} {1}'.format(k, v) \
for k, v in sorted(bank_statement.items()))
The Output of this code:
本代码的输出:
Bruttodividende... 78,40
Depotinhaber...... ME
Einbeh. Steuer.... 20,67
Endbetrag......... 57,73
Extag............. 18.04.2013
Nettodividende.... 78,40
Valuta............ 18.04.2013
Zahlungstag....... 18.04.2013
pro Stück........ 0,9800
Discussion
- The code scans the statement string line by line
- 代码逐行扫描语句字符串
- It then breaks each line into tokens
- 然后它将每一行分解为令牌。
- Scanning through the tokens and look for the colon. If found, use the part before the colon as category, and the part after that as value.
bank_statement['Extag']
for example, has the value of '18.04.2013' - 浏览标记并查找冒号。如果找到,将冒号前的部分作为类别,然后将后面的部分作为值。例如bank_statement['Extag']的值为'18.04.2013'
- Please note that all the values are strings, not number, but it is trivia to convert them.
- 请注意,所有的值都是字符串,不是数字,但是转换它们是琐事。
#4
1
This question is relevant; the following
这个问题是相关的;以下
print re.findall(r'\d+(?:,\d+)?', my_str)
^^
ouputs
输出
['18', '04', '2013', '18', '04', '2013', '0,9800', '18', '04', '2013', '78,40', '20,67', '78,40', '57,73']
Excluding the "dotted" numbers is a little more complicated:
排除“虚线”数字有点复杂:
print re.findall(r'(?<!\d\.)\b\d+(?:,\d+)?\b(?!\.\d)', my_str)
^^^^^^^^^^^ ^^^^^^^^^^
This outputs
这个输出
['0,9800', '78,40', '20,67', '78,40', '57,73']
#5
0
Try this one:
试试这个:
re.findall(r'\d+(?:[\d,.]*\d)', my_str)
This regex requires that the match at least starts with a number, then any amount of a mix of numbers, comma's and periods, and then it should end with a number too.
这个regex要求匹配至少从一个数字开始,然后是任意数量的数字、逗号和句点,然后它也应该以一个数字结束。
#6
0
Option 2 doesn't match numbers like '18.04.2013' because you are matching '\d+,\d+' which means
选项2不匹配“18.04.2013”这样的数字,因为您匹配的是“\d+,\d+”的意思
digit (one or more) comma digit (one or more)
数字(一个或多个)逗号(一个或多个)
For parsing digits in your case I'll use
对于您的情况,我将使用解析数字
\s(\d+[^\s]+)
which translates to
这翻译
space (get digit [one or more] get everything != space)
space = \s
get digit = \d
one or more = + (so it becomes \d+)
get everything != space = [^\s]
one or more = + (so it becomes [^\s]+
#1
8
Option 1 is the most suitable of the regex, but it is not working correctly because findall
will return what is matched by the capture group ()
, not the complete match.
选项1是最适合regex的,但是它不能正常工作,因为findall将返回捕获组()匹配的内容,而不是完整的匹配。
For example, the first three matches in your example will be the 18
, 04
and 2013
, and in each case the capture group will be unmatched so an empty string will be added to the results list.
例如,示例中的前三种匹配将是18、04和2013,在每种情况下,捕获组都是不匹配的,因此将向结果列表添加一个空字符串。
The solution is to make the group non-capturing
解决方案是使组不捕获
r'\d+(?:,\d+)?'
Option 2 does not work only so far as it won't match sequences that don't contain a comma.
选项2并不只起作用,因为它不会匹配不包含逗号的序列。
Option 3 isn't great because it will match e.g. +,1
.
选项3不太好,因为它会匹配例如+ 1。
#2
3
I'd like to extract all the numbers from the following string ...
我想从下面的字符串中提取所有的数字……
By "numbers", if you mean both the currency amounts AND the dates, I think that this will do what you want:
“数字”,如果你指的是货币数量和日期,我认为这将实现你想要的:
print re.findall(r'[0-9][0-9,.]+', my_str)
Output:
输出:
['18.04.2013', '18.04.2013', '0,9800', '18.04.2013', '78,40', '20,67', '78,40', '57,73']
If by "numbers" you mean only the currency amounts, then use
如果你所说的“数字”只是指货币数量,那就使用它
print re.findall(r'[0-9]+,[0-9]+', my_str)
Or perhaps better yet,
或者更好的是,
print re.findall(r'[0-9]+,[0-9]+ EUR', my_str)
#3
2
Here is a solution, which parse the statement and put the result in a dictionary called bank_statement
:
这里有一个解决方案,它解析语句并将结果放入一个名为bank_statement的字典中:
# -*- coding: utf-8 -*-
import itertools
my_str = """
Dividendengutschrift für inländische Wertpapiere
Depotinhaber : ME
Extag : 18.04.2013 Bruttodividende
Zahlungstag : 18.04.2013 pro Stück : 0,9800 EUR
Valuta : 18.04.2013
Bruttodividende : 78,40 EUR
*Einbeh. Steuer : 20,67 EUR
Nettodividende : 78,40 EUR
Endbetrag : 57,73 EUR
"""
bank_statement = {}
for line in my_str.split('\n'):
tokens = line.split()
#print tokens
it = iter(tokens)
category = ''
for token in it:
if token == ':':
category = category.strip(' *')
bank_statement[category] = next(it)
category = ''
else:
category += ' ' + token
# bank_statement now has all the values
print '\n'.join('{0:.<18} {1}'.format(k, v) \
for k, v in sorted(bank_statement.items()))
The Output of this code:
本代码的输出:
Bruttodividende... 78,40
Depotinhaber...... ME
Einbeh. Steuer.... 20,67
Endbetrag......... 57,73
Extag............. 18.04.2013
Nettodividende.... 78,40
Valuta............ 18.04.2013
Zahlungstag....... 18.04.2013
pro Stück........ 0,9800
Discussion
- The code scans the statement string line by line
- 代码逐行扫描语句字符串
- It then breaks each line into tokens
- 然后它将每一行分解为令牌。
- Scanning through the tokens and look for the colon. If found, use the part before the colon as category, and the part after that as value.
bank_statement['Extag']
for example, has the value of '18.04.2013' - 浏览标记并查找冒号。如果找到,将冒号前的部分作为类别,然后将后面的部分作为值。例如bank_statement['Extag']的值为'18.04.2013'
- Please note that all the values are strings, not number, but it is trivia to convert them.
- 请注意,所有的值都是字符串,不是数字,但是转换它们是琐事。
#4
1
This question is relevant; the following
这个问题是相关的;以下
print re.findall(r'\d+(?:,\d+)?', my_str)
^^
ouputs
输出
['18', '04', '2013', '18', '04', '2013', '0,9800', '18', '04', '2013', '78,40', '20,67', '78,40', '57,73']
Excluding the "dotted" numbers is a little more complicated:
排除“虚线”数字有点复杂:
print re.findall(r'(?<!\d\.)\b\d+(?:,\d+)?\b(?!\.\d)', my_str)
^^^^^^^^^^^ ^^^^^^^^^^
This outputs
这个输出
['0,9800', '78,40', '20,67', '78,40', '57,73']
#5
0
Try this one:
试试这个:
re.findall(r'\d+(?:[\d,.]*\d)', my_str)
This regex requires that the match at least starts with a number, then any amount of a mix of numbers, comma's and periods, and then it should end with a number too.
这个regex要求匹配至少从一个数字开始,然后是任意数量的数字、逗号和句点,然后它也应该以一个数字结束。
#6
0
Option 2 doesn't match numbers like '18.04.2013' because you are matching '\d+,\d+' which means
选项2不匹配“18.04.2013”这样的数字,因为您匹配的是“\d+,\d+”的意思
digit (one or more) comma digit (one or more)
数字(一个或多个)逗号(一个或多个)
For parsing digits in your case I'll use
对于您的情况,我将使用解析数字
\s(\d+[^\s]+)
which translates to
这翻译
space (get digit [one or more] get everything != space)
space = \s
get digit = \d
one or more = + (so it becomes \d+)
get everything != space = [^\s]
one or more = + (so it becomes [^\s]+