用python从txt中删除空格

时间:2021-10-08 03:10:12

I have a .txt file (scraped as pre-formatted text from a website) where the data looks like this:

我有一个.txt文件(从网站上格式化为预先格式化的文本),其中数据如下所示:

B, NICKOLAS                       CT144531X       D1026    JUDGE ANNIE WHITE JOHNSON  
ANDREWS VS BALL                   JA-15-0050      D0015    JUDGE EDWARD A ROBERTS        

I'd like to remove all extra spaces (they're actually different number of spaces, not tabs) in between the columns. I'd also then like to replace it with some delimiter (tab or pipe since there's commas within the data), like so:

我想删除列之间的所有额外空格(它们实际上是不同数量的空格,而不是制表符)。我还想用一些分隔符替换它(tab或pipe,因为数据中有逗号),如下所示:

ANDREWS VS BALL|JA-15-0050|D0015|JUDGE EDWARD A ROBERTS

Looked around and found that the best options are using regex or shlex to split. Two similar scenarios:

环顾四周,发现最好的选择是使用正则表达式或shlex来分割。两个类似的场景:

6 个解决方案

#1


5  

s = """B, NICKOLAS                       CT144531X       D1026    JUDGE ANNIE WHITE JOHNSON  
ANDREWS VS BALL                   JA-15-0050      D0015    JUDGE EDWARD A ROBERTS
"""

# Update
re.sub(r"(\S)\ {2,}(\S)(\n?)", r"\1|\2\3", s)
In [71]: print re.sub(r"(\S)\ {2,}(\S)(\n?)", r"\1|\2\3", s)
B, NICKOLAS|CT144531X|D1026|JUDGE ANNIE WHITE JOHNSON  
ANDREWS VS BALL|JA-15-0050|D0015|JUDGE EDWARD A ROBERTS

#2


7  

You can apply the regex '\s{2,}' (two or more whitespace characters) to each line and substitute the matches with a single '|' character.

您可以将正则表达式'\ s {2,}'(两个或更多个空格字符)应用于每一行,并将匹配项替换为单个“|”字符。

>>> import re
>>> line = 'ANDREWS VS BALL                   JA-15-0050      D0015    JUDGE EDWARD A ROBERTS        '
>>> re.sub('\s{2,}', '|', line.strip())
'ANDREWS VS BALL|JA-15-0050|D0015|JUDGE EDWARD A ROBERTS'

Stripping any leading and trailing whitespace from the line before applying re.sub ensures that you won't get '|' characters at the start and end of the line.

在应用re.sub之前从行中剥离任何前导和尾随空格可确保您不会获得“|”该行的开头和结尾处的字符。

Your actual code should look similar to this:

您的实际代码应该与此类似:

import re
with open(filename) as f:
    for line in f:
        subbed = re.sub('\s{2,}', '|', line.strip())
        # do something here

#3


6  

What about this?

那这个呢?

your_string ='ANDREWS VS BALL                   JA-15-0050      D0015    JUDGE EDWARD A ROBERTS'
print re.sub(r'\s{2,}','|',your_string.strip())

Output:

输出:

ANDREWS VS BALL|JA-15-0050|D0015|JUDGE EDWARD A ROBERTS

Expanation:

Expanation:

I've used re.sub() which takes 3 parameter, a pattern, a string you want to replace with and the string you want to work on.

我使用了re.sub(),它带有3个参数,一个模式,一个你要替换的字符串和你想要处理的字符串。

What I've done is taking at least two space together , I 've replaced them with a | and applied it on your string.

我所做的是将至少两个空间放在一起,我用|替换它们并将其应用于您的字符串。

#4


3  

Considering there are at least two spaces separating the columns, you can use this:

考虑到至少有两个空格分隔列,您可以使用:

lines = [
'B, NICKOLAS                       CT144531X       D1026    JUDGE ANNIE WHITE JOHNSON  ',
'ANDREWS VS BALL                   JA-15-0050      D0015    JUDGE EDWARD A ROBERTS        '
]

for line in lines:
    parts = []
    for part in line.split('  '):
        part = part.strip()
        if part:  # checking if stripped part is a non-empty string
            parts.append(part)
    print('|'.join(parts))

Output for your input:

输入的输出:

B, NICKOLAS|CT144531X|D1026|JUDGE ANNIE WHITE JOHNSON
ANDREWS VS BALL|JA-15-0050|D0015|JUDGE EDWARD A ROBERTS

#5


3  

It looks like your data is in a "text-table" format.

看起来您的数据采用“文本表”格式。

I recommend using the first row to figure out the start point and length of each column (either by hand or write a script with regex to determine the likely columns), then writing a script to iterate the rows of the file, slice the row into column segments, and apply strip to each segment.

我建议使用第一行来计算每列的起点和长度(手动或用正则表达式编写脚本以确定可能的列),然后编写脚本来迭代文件的行,将行切成列段,并将条带应用于每个段。

If you use a regex, you must keep track of the number of columns and raise an error if any given row has more than the expected number of columns (or a different number than the rest). Splitting on two-or-more spaces will break if a column's value has two-or-more spaces, which is not just entirely possible, but also likely. Text-tables like this aren't designed to be split on a regex, they're designed to be split on the column index positions.

如果使用正则表达式,则必须跟踪列数并在任何给定行超过预期列数(或与其余列不同的数量)时引发错误。如果列的值具有两个或更多空格,则拆分两个或多个空格将会中断,这不仅是完全可能的,而且可能也是如此。像这样的文本表不是设计为在正则表达式上拆分,它们被设计为在列索引位置上拆分。

In terms of saving the data, you can use the csv module to write/read into a csv file. That will let you handle quoting and escaping characters better than specifying a delimiter. If one of your columns has a | character as a value, unless you're encoding the data with a strategy that handles escapes or quoted literals, your output will break on read.

在保存数据方面,您可以使用csv模块写入/读入csv文件。这将使您比指定分隔符更好地处理引用和转义字符。如果您的某个列有|字符作为值,除非您使用处理转义或引用文字的策略对数据进行编码,否则输出将在读取时中断。

Parsing the text above would look something like this (i nested a list comprehension with brackets instead of the traditional format so it's easier to understand):

解析上面的文本看起来像这样(我使用括号而不是传统格式嵌套列表理解,因此更容易理解):

cols = ((0,34),
        (34, 50),
        (50, 59),
        (59, None),
        )
for line in lines:
    cleaned = [i.strip() for i in [line[s:e] for (s, e) in cols]]
    print cleaned

then you can write it with something like:

然后你可以用以下内容写它:

import csv
with open('output.csv', 'wb') as csvfile:
    spamwriter = csv.writer(csvfile, delimiter='|',
                            quotechar='"', quoting=csv.QUOTE_MINIMAL)
    for line in lines:
        spamwriter.writerow([line[col_start:col_end].strip()
                             for (col_start, col_end) in cols
                             ])

#6


0  

Looks like this library can solve this quite nicely: http://docs.astropy.org/en/stable/io/ascii/fixed_width_gallery.html#fixed-width-gallery

看起来这个库可以很好地解决这个问题:http://docs.astropy.org/en/stable/io/ascii/fixed_width_gallery.html#fixed-width-gallery

Impressive...

令人印象深刻的...

#1


5  

s = """B, NICKOLAS                       CT144531X       D1026    JUDGE ANNIE WHITE JOHNSON  
ANDREWS VS BALL                   JA-15-0050      D0015    JUDGE EDWARD A ROBERTS
"""

# Update
re.sub(r"(\S)\ {2,}(\S)(\n?)", r"\1|\2\3", s)
In [71]: print re.sub(r"(\S)\ {2,}(\S)(\n?)", r"\1|\2\3", s)
B, NICKOLAS|CT144531X|D1026|JUDGE ANNIE WHITE JOHNSON  
ANDREWS VS BALL|JA-15-0050|D0015|JUDGE EDWARD A ROBERTS

#2


7  

You can apply the regex '\s{2,}' (two or more whitespace characters) to each line and substitute the matches with a single '|' character.

您可以将正则表达式'\ s {2,}'(两个或更多个空格字符)应用于每一行,并将匹配项替换为单个“|”字符。

>>> import re
>>> line = 'ANDREWS VS BALL                   JA-15-0050      D0015    JUDGE EDWARD A ROBERTS        '
>>> re.sub('\s{2,}', '|', line.strip())
'ANDREWS VS BALL|JA-15-0050|D0015|JUDGE EDWARD A ROBERTS'

Stripping any leading and trailing whitespace from the line before applying re.sub ensures that you won't get '|' characters at the start and end of the line.

在应用re.sub之前从行中剥离任何前导和尾随空格可确保您不会获得“|”该行的开头和结尾处的字符。

Your actual code should look similar to this:

您的实际代码应该与此类似:

import re
with open(filename) as f:
    for line in f:
        subbed = re.sub('\s{2,}', '|', line.strip())
        # do something here

#3


6  

What about this?

那这个呢?

your_string ='ANDREWS VS BALL                   JA-15-0050      D0015    JUDGE EDWARD A ROBERTS'
print re.sub(r'\s{2,}','|',your_string.strip())

Output:

输出:

ANDREWS VS BALL|JA-15-0050|D0015|JUDGE EDWARD A ROBERTS

Expanation:

Expanation:

I've used re.sub() which takes 3 parameter, a pattern, a string you want to replace with and the string you want to work on.

我使用了re.sub(),它带有3个参数,一个模式,一个你要替换的字符串和你想要处理的字符串。

What I've done is taking at least two space together , I 've replaced them with a | and applied it on your string.

我所做的是将至少两个空间放在一起,我用|替换它们并将其应用于您的字符串。

#4


3  

Considering there are at least two spaces separating the columns, you can use this:

考虑到至少有两个空格分隔列,您可以使用:

lines = [
'B, NICKOLAS                       CT144531X       D1026    JUDGE ANNIE WHITE JOHNSON  ',
'ANDREWS VS BALL                   JA-15-0050      D0015    JUDGE EDWARD A ROBERTS        '
]

for line in lines:
    parts = []
    for part in line.split('  '):
        part = part.strip()
        if part:  # checking if stripped part is a non-empty string
            parts.append(part)
    print('|'.join(parts))

Output for your input:

输入的输出:

B, NICKOLAS|CT144531X|D1026|JUDGE ANNIE WHITE JOHNSON
ANDREWS VS BALL|JA-15-0050|D0015|JUDGE EDWARD A ROBERTS

#5


3  

It looks like your data is in a "text-table" format.

看起来您的数据采用“文本表”格式。

I recommend using the first row to figure out the start point and length of each column (either by hand or write a script with regex to determine the likely columns), then writing a script to iterate the rows of the file, slice the row into column segments, and apply strip to each segment.

我建议使用第一行来计算每列的起点和长度(手动或用正则表达式编写脚本以确定可能的列),然后编写脚本来迭代文件的行,将行切成列段,并将条带应用于每个段。

If you use a regex, you must keep track of the number of columns and raise an error if any given row has more than the expected number of columns (or a different number than the rest). Splitting on two-or-more spaces will break if a column's value has two-or-more spaces, which is not just entirely possible, but also likely. Text-tables like this aren't designed to be split on a regex, they're designed to be split on the column index positions.

如果使用正则表达式,则必须跟踪列数并在任何给定行超过预期列数(或与其余列不同的数量)时引发错误。如果列的值具有两个或更多空格,则拆分两个或多个空格将会中断,这不仅是完全可能的,而且可能也是如此。像这样的文本表不是设计为在正则表达式上拆分,它们被设计为在列索引位置上拆分。

In terms of saving the data, you can use the csv module to write/read into a csv file. That will let you handle quoting and escaping characters better than specifying a delimiter. If one of your columns has a | character as a value, unless you're encoding the data with a strategy that handles escapes or quoted literals, your output will break on read.

在保存数据方面,您可以使用csv模块写入/读入csv文件。这将使您比指定分隔符更好地处理引用和转义字符。如果您的某个列有|字符作为值,除非您使用处理转义或引用文字的策略对数据进行编码,否则输出将在读取时中断。

Parsing the text above would look something like this (i nested a list comprehension with brackets instead of the traditional format so it's easier to understand):

解析上面的文本看起来像这样(我使用括号而不是传统格式嵌套列表理解,因此更容易理解):

cols = ((0,34),
        (34, 50),
        (50, 59),
        (59, None),
        )
for line in lines:
    cleaned = [i.strip() for i in [line[s:e] for (s, e) in cols]]
    print cleaned

then you can write it with something like:

然后你可以用以下内容写它:

import csv
with open('output.csv', 'wb') as csvfile:
    spamwriter = csv.writer(csvfile, delimiter='|',
                            quotechar='"', quoting=csv.QUOTE_MINIMAL)
    for line in lines:
        spamwriter.writerow([line[col_start:col_end].strip()
                             for (col_start, col_end) in cols
                             ])

#6


0  

Looks like this library can solve this quite nicely: http://docs.astropy.org/en/stable/io/ascii/fixed_width_gallery.html#fixed-width-gallery

看起来这个库可以很好地解决这个问题:http://docs.astropy.org/en/stable/io/ascii/fixed_width_gallery.html#fixed-width-gallery

Impressive...

令人印象深刻的...