使用Python只保留字符串中的某些字符？

In my program I have a string like this:

在我的程序中,我有一个这样的字符串:

ag ct oso gcota

Using python, my goal is to get rid of the white space and keep only the a,t,c,and g characters. I understand how to get rid of the white space (I'm just using line = line.replace(" ", "")). But how can I get rid of the characters that I don't need when they could be any other letter in the alphabet?

使用python,我的目标是摆脱空白区域,只保留a,t,c和g字符。我理解如何摆脱空白区域(我只是使用line = line.replace(“”,“”))。但是,如果它们可能是字母表中的任何其他字母,我怎么能摆脱我不需要的字符?

3 个解决方案

#1

A very elegant and fast way is to use regular expressions:

一种非常优雅和快速的方法是使用正则表达式:

import re

str = 'ag ct oso gcota'
str = re.sub('[^atcg]', '', str)

"""str is now 'agctgcta"""

#2

I might do something like:

我可能会这样做:

chars_i_want = set('atcg')
final_string = ''.join(c for c in start_string if c in chars_i_want)

This is probably the easiest way to do this.

这可能是最简单的方法。

Another option would be to use str.translate to do the work:

另一个选择是使用str.translate来完成工作:

import string
chars_to_remove = string.printable.translate(None,'acgt')
final_string = start_string.translate(None,chars_to_remove)

I'm not sure which would perform better. It'd need to be timed via timeit to know definitively.

我不确定哪个会表现得更好。它需要通过timeit定时才能明确地知道。

update: Timings!

import re
import string

def test_re(s,regex=re.compile('[^atgc]')):
    return regex.sub(s,'')

def test_join1(s,chars_keep=set('atgc')):
    return ''.join(c for c in s if c in chars_keep)

def test_join2(s,chars_keep=set('atgc')):
    """ list-comp is faster, but less 'idiomatic' """
    return ''.join([c for c in s if c in chars_keep])

def translate(s,chars_to_remove = string.printable.translate(None,'acgt')):
    return s.translate(None,chars_to_remove)

import timeit

s = 'ag ct oso gcota'
for func in "test_re","test_join1","test_join2","translate":
    print func,timeit.timeit('{0}(s)'.format(func),'from __main__ import s,{0}'.format(func))

Sadly (for me), regex wins on my machine:

可悲的是(对我来说),正则表达式赢得了我的机器:

test_re 0.901512145996
test_join1 6.00346088409
test_join2 3.66561293602
translate 1.0741918087

#3

Did people test mgilson's test_re() function before upvoting? The arguments to re.sub() are reversed, so it was doing substitution in an empty string, and always returns empty string.

人们在upvoting之前测试了mgilson的test_re()函数吗? re.sub()的参数是相反的,所以它在空字符串中进行替换,并且总是返回空字符串。

I work in python 3.4; string.translate() only takes one argument, a dict. Because there is overhead in building this dict, I moved it out of the function. To be fair, I also moved the regex compilation out of the function (this didn't make a noticeable difference).

我在python 3.4中工作; string.translate()只接受一个参数,一个字典。因为构建这个dict有开销,所以我把它移出了函数。公平地说,我还将正则表达式编译移出了函数(这没有明显的区别)。

import re
import string

regex=re.compile('[^atgc]')

chars_to_remove = string.printable.translate({ ord('a'): None, ord('c'): None, ord('g'): None, ord('t'): None })
cmap = {}
for c in chars_to_remove:
    cmap[ord(c)] = None

def test_re(s):
    return regex.sub('',s)

def test_join1(s,chars_keep=set('atgc')):
    return ''.join(c for c in s if c in chars_keep)

def test_join2(s,chars_keep=set('atgc')):
    """ list-comp is faster, but less 'idiomatic' """
    return ''.join([c for c in s if c in chars_keep])

def translate(s):
    return s.translate(cmap)

import timeit

s = 'ag ct oso gcota'
for func in "test_re","test_join1","test_join2","translate":
    print(func,timeit.timeit('{0}(s)'.format(func),'from __main__ import s,{0}'.format(func)))

Here are the timings:

以下是时间安排:

test_re 3.3141989699797705
test_join1 2.4452173250028864
test_join2 2.081048655003542
translate 1.9390292020107154

It's too bad string.translate() doesn't have an option to control what to do with characters that aren't in the map. The current implementation is to keep them, but we could just as well have the option to remove them, in cases where the characters we want to keep are far fewer than the ones we want to remove (oh hello, unicode).

这太糟糕了,string.translate()没有选项来控制如何处理不在地图中的字符。目前的实现是保留它们,但我们也可以选择删除它们,如果我们想要保留的字符远远少于我们要删除的字符(哦,你好,unicode)。

#1