Python:从字符串中提取数字。

时间:2022-09-13 09:27:24

I would extract all the numbers contained in a string. Which is the better suited for the purpose, regular expressions or the isdigit() method?

我将提取一个字符串中包含的所有数字。哪一个更适合用于目的、正则表达式或isdigit()方法?

Example:

例子:

line = "hello 12 hi 89"

Result:

结果:

[12, 89]

11 个解决方案

#1


284  

If you only want to extract only positive integers, try the following:

如果只需要提取正整数,请尝试以下步骤:

>>> str = "h3110 23 cat 444.4 rabbit 11 2 dog"
>>> [int(s) for s in str.split() if s.isdigit()]
[23, 11, 2]

I would argue that this is better than the regex example for three reasons. First, you don't need another module; secondly, it's more readable because you don't need to parse the regex mini-language; and third, it is faster (and thus likely more pythonic):

我认为这比regex的例子更好,有三个原因。首先,你不需要另一个模块;其次,它更具可读性,因为您不需要解析regex迷你语言;第三,它的速度更快(因此更可能是python):

python -m timeit -s "str = 'h3110 23 cat 444.4 rabbit 11 2 dog' * 1000" "[s for s in str.split() if s.isdigit()]"
100 loops, best of 3: 2.84 msec per loop

python -m timeit -s "import re" "str = 'h3110 23 cat 444.4 rabbit 11 2 dog' * 1000" "re.findall('\\b\\d+\\b', str)"
100 loops, best of 3: 5.66 msec per loop

This will not recognize floats, negative integers, or integers in hexadecimal format. If you can't accept these limitations, slim's answer below will do the trick.

这将不能识别浮点数、负整数或十六进制格式的整数。如果你不能接受这些限制,斯利姆的回答将会奏效。

#2


280  

I'd use a regexp :

我会使用regexp:

>>> import re
>>> re.findall(r'\d+', 'hello 42 I\'m a 32 string 30')
['42', '32', '30']

This would also match 42 from bla42bla. If you only want numbers delimited by word boundaries (space, period, comma), you can use \b :

这也将是42,来自布拉42。如果你只是想用文字边界(空格、句号、逗号)分隔,你可以用\b:

>>> re.findall(r'\b\d+\b', 'he33llo 42 I\'m a 32 string 30')
['42', '32', '30']

To end up with a list of numbers instead of a list of strings:

最后得到一个数字列表而不是字符串列表:

>>> [int(s) for s in re.findall(r'\b\d+\b', 'he33llo 42 I\'m a 32 string 30')]
[42, 32, 30]

#3


68  

This is more than a bit late, but you can extend the regex expression to account for scientific notation too.

这已经有点晚了,但是您可以扩展正则表达式来解释科学符号。

import re

# Format is [(<string>, <expected output>), ...]
ss = [("apple-12.34 ba33na fanc-14.23e-2yapple+45e5+67.56E+3",
       ['-12.34', '33', '-14.23e-2', '+45e5', '+67.56E+3']),
      ('hello X42 I\'m a Y-32.35 string Z30',
       ['42', '-32.35', '30']),
      ('he33llo 42 I\'m a 32 string -30', 
       ['33', '42', '32', '-30']),
      ('h3110 23 cat 444.4 rabbit 11 2 dog', 
       ['3110', '23', '444.4', '11', '2']),
      ('hello 12 hi 89', 
       ['12', '89']),
      ('4', 
       ['4']),
      ('I like 74,600 commas not,500', 
       ['74,600', '500']),
      ('I like bad math 1+2=.001', 
       ['1', '+2', '.001'])]

for s, r in ss:
    rr = re.findall("[-+]?[.]?[\d]+(?:,\d\d\d)*[\.]?\d*(?:[eE][-+]?\d+)?", s)
    if rr == r:
        print('GOOD')
    else:
        print('WRONG', rr, 'should be', r)

Gives all good!

给所有的好!

Additionally, you can look at the AWS Glue built-in regex

此外,您还可以查看AWS Glue内置的regex。

#4


51  

I'm assuming you want floats not just integers so I'd do something like this:

我假设你想要浮点数不只是整数所以我要这样做

l = []
for t in s.split():
    try:
        l.append(float(t))
    except ValueError:
        pass

Note that some of the other solutions posted here don't work with negative numbers:

请注意,这里张贴的一些其他解决方案并不适用于负数:

>>> re.findall(r'\b\d+\b', 'he33llo 42 I\'m a 32 string -30')
['42', '32', '30']

>>> '-3'.isdigit()
False

#5


27  

If you know it will be only one number in the string, i.e 'hello 12 hi', you can try filter.

如果你知道它只是弦中的一个数字,我。e '你好12嗨',你可以试试滤镜。

For example:

例如:

In [1]: int(filter(str.isdigit, '200 grams'))
Out[1]: 200
In [2]: int(filter(str.isdigit, 'Counters: 55'))
Out[2]: 55
In [3]: int(filter(str.isdigit, 'more than 23 times'))
Out[3]: 23

But be carefull !!! :

但小心! ! !:

In [4]: int(filter(str.isdigit, '200 grams 5'))
Out[4]: 2005

#6


5  

I am amazed to see that no one has yet mentioned the usage of itertools.groupby as an alternative to achieve this.

我惊讶地发现,还没有人提到迭代工具的使用。groupby作为实现这一目标的备选方案。

You may use itertools.groupby() along with str.isdigit() in order to extract numbers from string as:

您可以使用itertools.groupby()和str.isdigit()来从字符串中提取数字,如:

from itertools import groupby
my_str = "hello 12 hi 89"

l = [int(''.join(i)) for is_digit, i in groupby(my_str, str.isdigit) if is_digit]

The value hold by l will be:

l的值为:

[12, 89]

PS: This is just for illustration purpose to show that as an alternative we could also use groupby to achieve this. But this is not a recommended solution. If you want to achieve this, you should be using accepted answer of fmark based on using list comprehension with str.isdigit as filter.

PS:这只是为了说明,作为替代我们也可以使用groupby来实现这一点。但这不是一个推荐的解决方案。如果你想实现这个目标,你应该使用基于列表理解的fmark的被接受的答案。

#7


4  

This answer also contains the case when the number is float in the string

这个答案也包含在字符串中浮点数的情况。

def get_first_nbr_from_str(input_str):
    '''
    :param input_str: strings that contains digit and words
    :return: the number extracted from the input_str
    demo:
    'ab324.23.123xyz': 324.23
    '.5abc44': 0.5
    '''
    if not input_str and not isinstance(input_str, str):
        return 0
    out_number = ''
    for ele in input_str:
        if (ele == '.' and '.' not in out_number) or ele.isdigit():
            out_number += ele
        elif out_number:
            break
    return float(out_number)

#8


2  

Since none of these dealt with real world financial numbers in excel and word docs that I needed to find, here is my variation. It handles ints, floats, negative numbers, currency numbers (because it doesn't reply on split), and has the option to drop the decimal part and just return ints, or return everything.

因为这些都不是我需要找到的excel和word文档中的真实世界财务数据,这是我的变化。它处理ints、浮点数、负数、货币数(因为它没有对split进行应答),并且可以选择将小数部分删除,然后返回ints,或者返回所有内容。

It also handles Indian Laks number system where commas appear irregularly, not every 3 numbers apart.

它还处理印度的Laks号码系统,在那里逗号不定期出现,而不是每3个数字分开。

It does not handle scientific notation or negative numbers put inside parentheses in budgets -- will appear positive.

在预算中括号内的科学符号或负数将会显得很积极。

It also does not extract dates. There are better ways for finding dates in strings.

它也不提取日期。在字符串中找到日期有更好的方法。

import re
def find_numbers(string, ints=True):            
    numexp = re.compile(r'[-]?\d[\d,]*[\.]?[\d{2}]*') #optional - in front
    numbers = numexp.findall(string)    
    numbers = [x.replace(',','') for x in numbers]
    if ints is True:
        return [int(x.replace(',','').split('.')[0]) for x in numbers]            
    else:
        return numbers

#9


1  

# extract numbers from garbage string:
s = '12//n,_@#$%3.14kjlw0xdadfackvj1.6e-19&*ghn334'
newstr = ''.join((ch if ch in '0123456789.-e' else ' ') for ch in s)
listOfNumbers = [float(i) for i in newstr.split()]
print(listOfNumbers)
[12.0, 3.14, 0.0, 1.6e-19, 334.0]

#10


0  

@jmnas, I liked your answer, but it didn't find floats. I'm working on a script to parse code going to a CNC mill and needed to find both X and Y dimensions that can be integers or floats, so I adapted your code to the following. This finds int, float with positive and negative vals. Still doesn't find hex formatted values but you could add "x" and "A" through "F" to the num_char tuple and I think it would parse things like '0x23AC'.

@jmnas,我喜欢你的答案,但它没有找到浮点数。我正在编写一个脚本,用于解析将要进入数控工厂的代码,并且需要找到可以是整数或浮点数的X和Y维度,所以我将您的代码修改为以下内容。这发现了int, float带有正的和负的。仍然没有找到十六进制格式的值,但是您可以将“x”和“A”通过“F”添加到num_char tuple,我认为它将解析诸如“0x23AC”之类的东西。

s = 'hello X42 I\'m a Y-32.35 string Z30'
xy = ("X", "Y")
num_char = (".", "+", "-")

l = []

tokens = s.split()
for token in tokens:

    if token.startswith(xy):
        num = ""
        for char in token:
            # print(char)
            if char.isdigit() or (char in num_char):
                num = num + char

        try:
            l.append(float(num))
        except ValueError:
            pass

print(l)

#11


0  

The best option I found is below. It will extract a number and can eliminate any type of char.

我找到的最佳选择是在下面。它将提取一个数字并消除任何类型的char。

def extract_nbr(input_str):
    if input_str is None or input_str == '':
        return 0

    out_number = ''
    for ele in input_str:
        if ele.isdigit():
            out_number += ele
    return float(out_number)    

#1


284  

If you only want to extract only positive integers, try the following:

如果只需要提取正整数,请尝试以下步骤:

>>> str = "h3110 23 cat 444.4 rabbit 11 2 dog"
>>> [int(s) for s in str.split() if s.isdigit()]
[23, 11, 2]

I would argue that this is better than the regex example for three reasons. First, you don't need another module; secondly, it's more readable because you don't need to parse the regex mini-language; and third, it is faster (and thus likely more pythonic):

我认为这比regex的例子更好,有三个原因。首先,你不需要另一个模块;其次,它更具可读性,因为您不需要解析regex迷你语言;第三,它的速度更快(因此更可能是python):

python -m timeit -s "str = 'h3110 23 cat 444.4 rabbit 11 2 dog' * 1000" "[s for s in str.split() if s.isdigit()]"
100 loops, best of 3: 2.84 msec per loop

python -m timeit -s "import re" "str = 'h3110 23 cat 444.4 rabbit 11 2 dog' * 1000" "re.findall('\\b\\d+\\b', str)"
100 loops, best of 3: 5.66 msec per loop

This will not recognize floats, negative integers, or integers in hexadecimal format. If you can't accept these limitations, slim's answer below will do the trick.

这将不能识别浮点数、负整数或十六进制格式的整数。如果你不能接受这些限制,斯利姆的回答将会奏效。

#2


280  

I'd use a regexp :

我会使用regexp:

>>> import re
>>> re.findall(r'\d+', 'hello 42 I\'m a 32 string 30')
['42', '32', '30']

This would also match 42 from bla42bla. If you only want numbers delimited by word boundaries (space, period, comma), you can use \b :

这也将是42,来自布拉42。如果你只是想用文字边界(空格、句号、逗号)分隔,你可以用\b:

>>> re.findall(r'\b\d+\b', 'he33llo 42 I\'m a 32 string 30')
['42', '32', '30']

To end up with a list of numbers instead of a list of strings:

最后得到一个数字列表而不是字符串列表:

>>> [int(s) for s in re.findall(r'\b\d+\b', 'he33llo 42 I\'m a 32 string 30')]
[42, 32, 30]

#3


68  

This is more than a bit late, but you can extend the regex expression to account for scientific notation too.

这已经有点晚了,但是您可以扩展正则表达式来解释科学符号。

import re

# Format is [(<string>, <expected output>), ...]
ss = [("apple-12.34 ba33na fanc-14.23e-2yapple+45e5+67.56E+3",
       ['-12.34', '33', '-14.23e-2', '+45e5', '+67.56E+3']),
      ('hello X42 I\'m a Y-32.35 string Z30',
       ['42', '-32.35', '30']),
      ('he33llo 42 I\'m a 32 string -30', 
       ['33', '42', '32', '-30']),
      ('h3110 23 cat 444.4 rabbit 11 2 dog', 
       ['3110', '23', '444.4', '11', '2']),
      ('hello 12 hi 89', 
       ['12', '89']),
      ('4', 
       ['4']),
      ('I like 74,600 commas not,500', 
       ['74,600', '500']),
      ('I like bad math 1+2=.001', 
       ['1', '+2', '.001'])]

for s, r in ss:
    rr = re.findall("[-+]?[.]?[\d]+(?:,\d\d\d)*[\.]?\d*(?:[eE][-+]?\d+)?", s)
    if rr == r:
        print('GOOD')
    else:
        print('WRONG', rr, 'should be', r)

Gives all good!

给所有的好!

Additionally, you can look at the AWS Glue built-in regex

此外,您还可以查看AWS Glue内置的regex。

#4


51  

I'm assuming you want floats not just integers so I'd do something like this:

我假设你想要浮点数不只是整数所以我要这样做

l = []
for t in s.split():
    try:
        l.append(float(t))
    except ValueError:
        pass

Note that some of the other solutions posted here don't work with negative numbers:

请注意,这里张贴的一些其他解决方案并不适用于负数:

>>> re.findall(r'\b\d+\b', 'he33llo 42 I\'m a 32 string -30')
['42', '32', '30']

>>> '-3'.isdigit()
False

#5


27  

If you know it will be only one number in the string, i.e 'hello 12 hi', you can try filter.

如果你知道它只是弦中的一个数字,我。e '你好12嗨',你可以试试滤镜。

For example:

例如:

In [1]: int(filter(str.isdigit, '200 grams'))
Out[1]: 200
In [2]: int(filter(str.isdigit, 'Counters: 55'))
Out[2]: 55
In [3]: int(filter(str.isdigit, 'more than 23 times'))
Out[3]: 23

But be carefull !!! :

但小心! ! !:

In [4]: int(filter(str.isdigit, '200 grams 5'))
Out[4]: 2005

#6


5  

I am amazed to see that no one has yet mentioned the usage of itertools.groupby as an alternative to achieve this.

我惊讶地发现,还没有人提到迭代工具的使用。groupby作为实现这一目标的备选方案。

You may use itertools.groupby() along with str.isdigit() in order to extract numbers from string as:

您可以使用itertools.groupby()和str.isdigit()来从字符串中提取数字,如:

from itertools import groupby
my_str = "hello 12 hi 89"

l = [int(''.join(i)) for is_digit, i in groupby(my_str, str.isdigit) if is_digit]

The value hold by l will be:

l的值为:

[12, 89]

PS: This is just for illustration purpose to show that as an alternative we could also use groupby to achieve this. But this is not a recommended solution. If you want to achieve this, you should be using accepted answer of fmark based on using list comprehension with str.isdigit as filter.

PS:这只是为了说明,作为替代我们也可以使用groupby来实现这一点。但这不是一个推荐的解决方案。如果你想实现这个目标,你应该使用基于列表理解的fmark的被接受的答案。

#7


4  

This answer also contains the case when the number is float in the string

这个答案也包含在字符串中浮点数的情况。

def get_first_nbr_from_str(input_str):
    '''
    :param input_str: strings that contains digit and words
    :return: the number extracted from the input_str
    demo:
    'ab324.23.123xyz': 324.23
    '.5abc44': 0.5
    '''
    if not input_str and not isinstance(input_str, str):
        return 0
    out_number = ''
    for ele in input_str:
        if (ele == '.' and '.' not in out_number) or ele.isdigit():
            out_number += ele
        elif out_number:
            break
    return float(out_number)

#8


2  

Since none of these dealt with real world financial numbers in excel and word docs that I needed to find, here is my variation. It handles ints, floats, negative numbers, currency numbers (because it doesn't reply on split), and has the option to drop the decimal part and just return ints, or return everything.

因为这些都不是我需要找到的excel和word文档中的真实世界财务数据,这是我的变化。它处理ints、浮点数、负数、货币数(因为它没有对split进行应答),并且可以选择将小数部分删除,然后返回ints,或者返回所有内容。

It also handles Indian Laks number system where commas appear irregularly, not every 3 numbers apart.

它还处理印度的Laks号码系统,在那里逗号不定期出现,而不是每3个数字分开。

It does not handle scientific notation or negative numbers put inside parentheses in budgets -- will appear positive.

在预算中括号内的科学符号或负数将会显得很积极。

It also does not extract dates. There are better ways for finding dates in strings.

它也不提取日期。在字符串中找到日期有更好的方法。

import re
def find_numbers(string, ints=True):            
    numexp = re.compile(r'[-]?\d[\d,]*[\.]?[\d{2}]*') #optional - in front
    numbers = numexp.findall(string)    
    numbers = [x.replace(',','') for x in numbers]
    if ints is True:
        return [int(x.replace(',','').split('.')[0]) for x in numbers]            
    else:
        return numbers

#9


1  

# extract numbers from garbage string:
s = '12//n,_@#$%3.14kjlw0xdadfackvj1.6e-19&*ghn334'
newstr = ''.join((ch if ch in '0123456789.-e' else ' ') for ch in s)
listOfNumbers = [float(i) for i in newstr.split()]
print(listOfNumbers)
[12.0, 3.14, 0.0, 1.6e-19, 334.0]

#10


0  

@jmnas, I liked your answer, but it didn't find floats. I'm working on a script to parse code going to a CNC mill and needed to find both X and Y dimensions that can be integers or floats, so I adapted your code to the following. This finds int, float with positive and negative vals. Still doesn't find hex formatted values but you could add "x" and "A" through "F" to the num_char tuple and I think it would parse things like '0x23AC'.

@jmnas,我喜欢你的答案,但它没有找到浮点数。我正在编写一个脚本,用于解析将要进入数控工厂的代码,并且需要找到可以是整数或浮点数的X和Y维度,所以我将您的代码修改为以下内容。这发现了int, float带有正的和负的。仍然没有找到十六进制格式的值,但是您可以将“x”和“A”通过“F”添加到num_char tuple,我认为它将解析诸如“0x23AC”之类的东西。

s = 'hello X42 I\'m a Y-32.35 string Z30'
xy = ("X", "Y")
num_char = (".", "+", "-")

l = []

tokens = s.split()
for token in tokens:

    if token.startswith(xy):
        num = ""
        for char in token:
            # print(char)
            if char.isdigit() or (char in num_char):
                num = num + char

        try:
            l.append(float(num))
        except ValueError:
            pass

print(l)

#11


0  

The best option I found is below. It will extract a number and can eliminate any type of char.

我找到的最佳选择是在下面。它将提取一个数字并消除任何类型的char。

def extract_nbr(input_str):
    if input_str is None or input_str == '':
        return 0

    out_number = ''
    for ele in input_str:
        if ele.isdigit():
            out_number += ele
    return float(out_number)