从Python中的字符串中除去所有非数字字符(“。”)。

I've got a pretty good working snippit of code, but I was wondering if anyone has any better suggestions on how to do this:

我的代码非常好用，但是我想知道有没有人对如何做到这一点有更好的建议:

val = ''.join([c for c in val if c in '1234567890.'])

What would you do?

你会怎么做?

6 个解决方案

#1

107

You can use a regular expression (using the re module) to accomplish the same thing. The example below matches runs of [^\d.] (any character that's not a decimal digit or a period) and replaces them with the empty string. Note that if the pattern is compiled with the UNICODE flag the resulting string could still include non-ASCII numbers. Also, the result after removing "non-numeric" characters is not necessarily a valid number.

您可以使用正则表达式(使用re模块)来完成相同的事情。下面的例子是[\d]。(任何不是十进制数字或周期的字符)并用空字符串替换它们。注意，如果模式是用UNICODE标志编译的，结果字符串仍然可以包含非ascii码。此外，删除“非数字”字符后的结果不一定是有效数字。

>>> import re
>>> non_decimal = re.compile(r'[^\d.]+')
>>> non_decimal.sub('', '12.34fe4e')
'12.344'

#2

Another 'pythonic' approach

另一个“神谕的”方式

filter( lambda x: x in '0123456789.', s )

过滤器(x: x in '0123456789)。' s)

but regex is faster.

但是正则表达式是更快。

#3

Here's some sample code:

这里有一些示例代码:

$ cat a.py
a = '27893jkasnf8u2qrtq2ntkjh8934yt8.298222rwagasjkijw'
for i in xrange(1000000):
    ''.join([c for c in a if c in '1234567890.'])

$ cat b.py
import re

non_decimal = re.compile(r'[^\d.]+')

a = '27893jkasnf8u2qrtq2ntkjh8934yt8.298222rwagasjkijw'
for i in xrange(1000000):
    non_decimal.sub('', a)

$ cat c.py
a = '27893jkasnf8u2qrtq2ntkjh8934yt8.298222rwagasjkijw'
for i in xrange(1000000):
    ''.join([c for c in a if c.isdigit() or c == '.'])

$ cat d.py
a = '27893jkasnf8u2qrtq2ntkjh8934yt8.298222rwagasjkijw'
for i in xrange(1000000):
    b = []
    for c in a:
        if c.isdigit() or c == '.': continue
        b.append(c)

    ''.join(b)

And the timing results:

和计时结果:

$ time python a.py
real    0m24.735s
user    0m21.049s
sys     0m0.456s

$ time python b.py
real    0m10.775s
user    0m9.817s
sys     0m0.236s

$ time python c.py
real    0m38.255s
user    0m32.718s
sys     0m0.724s

$ time python d.py
real    0m46.040s
user    0m41.515s
sys     0m0.832s

Looks like the regex is the winner so far.

看起来regex是迄今为止的赢家。

Personally, I find the regex just as readable as the list comprehension. If you're doing it just a few times then you'll probably take a bigger hit on compiling the regex. Do what jives with your code and coding style.

就我个人而言，我发现regex与列表理解一样具有可读性。如果你只做了几次，那么你可能会在编译正则表达式时受到更大的影响。用你的代码和编码风格去做。

#4

Mine solution is simpler using regex:

我的解决方案使用regex更简单:

import re 
re.sub("[^0-9^.]", "", data)

#5

import string
filter(lambda c: c in string.digits + '.', s)

#6

If the set of characters were larger, using sets as below might be faster. As it is, this is a bit slower than a.py.

如果字符集更大，则使用以下集合可能更快。实际上，这比a稍慢。

dec = set('1234567890.')

a = '27893jkasnf8u2qrtq2ntkjh8934yt8.298222rwagasjkijw'
for i in xrange(1000000):
    ''.join(ch for ch in a if ch in dec)

At least on my system, you can save a tiny bit of time (and memory if your string were long enough to matter) by using a generator expression instead of a list comprehension in a.py:

至少在我的系统中，您可以通过使用生成器表达式而不是a.py中的列表理解来节省一小段时间(和内存，如果字符串足够长的话)。

a = '27893jkasnf8u2qrtq2ntkjh8934yt8.298222rwagasjkijw'
for i in xrange(1000000):
    ''.join(c for c in a if c in '1234567890.')

Oh, and here's the fastest way I've found by far on this test string (much faster than regex) if you are doing this many, many times and are willing to put up with the overhead of building a couple of character tables.

哦，这是我在这个测试字符串中找到的最快的方法(比regex快得多)，如果你这么做了很多次，并且愿意忍受构建两个字符表的开销。

chrs = ''.join(chr(i) for i in xrange(256))
deletable = ''.join(ch for ch in chrs if ch not in '1234567890.')

a = '27893jkasnf8u2qrtq2ntkjh8934yt8.298222rwagasjkijw'
for i in xrange(1000000):
    a.translate(chrs, deletable)

On my system, that runs in ~1.0 seconds where the regex b.py runs in ~4.3 seconds.

在我的系统上，运行在~1.0秒内的regex b。py运行在~4.3秒。

#1

107

>>> import re
>>> non_decimal = re.compile(r'[^\d.]+')
>>> non_decimal.sub('', '12.34fe4e')
'12.344'

#2