使用惯用Python删除pandas列中的空格和换行符?

时间:2021-05-30 22:54:54

I am using the below method to replace all the spaces and new line characters in the pandas dataframe column headers.

我使用下面的方法来替换pandas dataframe列标题中的所有空格和换行符。

My question is:

我的问题是:

Is a more efficient way to loop using the list comprehensions in the below code ?

使用下面代码中的列表推导循环是一种更有效的方法吗?

def headerfiller(df):
    for i in [" ","\n"]:
        df.columns = [c.replace(i,"_") for c in df.columns]

3 个解决方案

#1


3  

Using str.translate:

使用str.translate:

>>> tbl = str.maketrans(' \n', '__')
>>> 'a b c\n'.translate(tbl)
'a_b_c_'

try:
    tbl = str.maketrans('_ \n', '__')  # Python 3.x
except AttributeError:
    import string
    tbl = string.maketrans('_ \n', '__')  # Python 2.x

def headerfiller(df):
    df.columns = [c.translate(tbl) for c in df.columns]

Using regular expression substitution:

使用正则表达式替换:

>>> import re
>>> re.sub(r'[ \n]', '_', 'a b c\n')
'a_b_c_'

import re

def headerfiller(df):
    df.columns = [re.sub(r' \n', '_', c) for c in df.columns]

#2


4  

You can use the string methods available for index objects, in this case columns.str.replace() which allows you to do this without looping over the values yourself:

您可以使用可用于索引对象的字符串方法,在本例中为columns.str.replace(),它允许您在不自行循环值的情况下执行此操作:

In [23]: df = pd.DataFrame(np.random.randn(3,3), columns=['a\nb', 'c d', 'e\n f'])

In [24]: df.columns
Out[24]: Index([u'a\nb', u'c d', u'e\n f'], dtype='object')

In [25]: df.columns.str.replace(' |\n', '_')
Out[25]: Index([u'a_b', u'c_d', u'e__f'], dtype='object')

And by using a regular expression, you can replace spaces and newlines at the same time. See the docs: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.replace.html (for Series, but the method is the same for Index)

通过使用正则表达式,您可以同时替换空格和换行符。请参阅文档:http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.replace.html(对于Series,但方法与Index相同)

#3


0  

You could split() and '_'.join():

你可以split()和'_'。join():

def headerfiller(df):
    df.columns = ['_'.join(c.split()) for c in df.columns]

It'll lose trailing whitespace and newlines though (if that matters) and compress multiple spaces etc. to a single "_":

它会丢失尾随空格和换行符(如果这很重要)并将多个空格等压缩为单个“_”:

In [26]: "_".join("a  b    c\n\n\n".split())
Out[26]: 'a_b_c'

#1


3  

Using str.translate:

使用str.translate:

>>> tbl = str.maketrans(' \n', '__')
>>> 'a b c\n'.translate(tbl)
'a_b_c_'

try:
    tbl = str.maketrans('_ \n', '__')  # Python 3.x
except AttributeError:
    import string
    tbl = string.maketrans('_ \n', '__')  # Python 2.x

def headerfiller(df):
    df.columns = [c.translate(tbl) for c in df.columns]

Using regular expression substitution:

使用正则表达式替换:

>>> import re
>>> re.sub(r'[ \n]', '_', 'a b c\n')
'a_b_c_'

import re

def headerfiller(df):
    df.columns = [re.sub(r' \n', '_', c) for c in df.columns]

#2


4  

You can use the string methods available for index objects, in this case columns.str.replace() which allows you to do this without looping over the values yourself:

您可以使用可用于索引对象的字符串方法,在本例中为columns.str.replace(),它允许您在不自行循环值的情况下执行此操作:

In [23]: df = pd.DataFrame(np.random.randn(3,3), columns=['a\nb', 'c d', 'e\n f'])

In [24]: df.columns
Out[24]: Index([u'a\nb', u'c d', u'e\n f'], dtype='object')

In [25]: df.columns.str.replace(' |\n', '_')
Out[25]: Index([u'a_b', u'c_d', u'e__f'], dtype='object')

And by using a regular expression, you can replace spaces and newlines at the same time. See the docs: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.replace.html (for Series, but the method is the same for Index)

通过使用正则表达式,您可以同时替换空格和换行符。请参阅文档:http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.replace.html(对于Series,但方法与Index相同)

#3


0  

You could split() and '_'.join():

你可以split()和'_'。join():

def headerfiller(df):
    df.columns = ['_'.join(c.split()) for c in df.columns]

It'll lose trailing whitespace and newlines though (if that matters) and compress multiple spaces etc. to a single "_":

它会丢失尾随空格和换行符(如果这很重要)并将多个空格等压缩为单个“_”:

In [26]: "_".join("a  b    c\n\n\n".split())
Out[26]: 'a_b_c'