如何在新行字符上拆分python字符串

时间:2022-08-22 12:54:43

In python3 in Win7 I read a web page into a string.

在Win7的python3中,我将网页读入字符串。

I then want to split the string into a list at newline characters.

然后,我想将字符串拆分为换行符中的列表。

I can't enter the newline into my code as the argument in split(), because I get a syntax error 'EOL while scanning string literal'

我不能在我的代码中输入换行符作为split()中的参数,因为我在扫描字符串文字时遇到语法错误'EOL'

If I type in the characters \ and n, I get a Unicode error.

如果我输入字符\和n,我会收到Unicode错误。

Is there any way to do it?

有什么办法吗?

2 个解决方案

#1


24  

✨ Splitting line in Python:

Have you tried using str.splitlines() method?:

你尝试过使用str.splitlines()方法吗?:

From the docs:

来自文档:

str.splitlines([keepends])

str.splitlines([keepends])

Return a list of the lines in the string, breaking at line boundaries. Line breaks are not included in the resulting list unless keepends is given and true.

返回字符串中的行列表,在行边界处断开。除非给出keepends且为true,否则换行符不包括在结果列表中。

For example:

例如:

>>> 'Line 1\n\nLine 3\rLine 4\r\n'.splitlines()
['Line 1', '', 'Line 3', 'Line 4']

>>> 'Line 1\n\nLine 3\rLine 4\r\n'.splitlines(True)
['Line 1\n', '\n', 'Line 3\r', 'Line 4\r\n']

???? Which delimiters are considered?

This method uses the universal newlines approach to splitting lines.

此方法使用通用换行方法来分割线条。

The main difference between Python 2.X and Python 3.X is that the former uses the universal newlines approach to splitting lines, so "\r", "\n", and "\r\n" are considered line boundaries for 8-bit strings, while the latter uses a superset of it that also includes:

Python 2.X和Python 3.X之间的主要区别在于前者使用通用换行方法来分割行,因此“\ r”,“\ n”和“\ r \ n”被视为8的行边界-bit字符串,而后者使用它的超集,其中还包括:

  • \v or \x0b: Line Tabulation (added in Python 3.2).
  • \ v或\ x0b:行制表(在Python 3.2中添加)。
  • \f or \x0c: Form Feed (added in Python 3.2).
  • \ f或\ x0c:Form Feed(在Python 3.2中添加)。
  • \x1c: File Separator.
  • \ x1c:文件分隔符。
  • \x1d: Group Separator.
  • \ x1d:组分隔符。
  • \x1e: Record Separator.
  • \ x1e:记录分隔符。
  • \x85: Next Line (C1 Control Code).
  • \ x85:下一行(C1控制代码)。
  • \u2028: Line Separator.
  • \ u2028:行分隔符。
  • \u2029: Paragraph Separator.
  • \ u2029:段落分隔符。

???? splitlines VS split:

Unlike str.split() when a delimiter string sep is given, this method returns an empty list for the empty string, and a terminal line break does not result in an extra line:

与str.split()不同,当给定分隔符字符串sep时,此方法返回空字符串的空列表,并且终止换行符不会产生额外的行:

>>> ''.splitlines()
[]

>>> 'Line 1\n'.splitlines()
['Line 1']

While str.split('\n') returns:

而str.split('\ n')返回:

>>> ''.split('\n')
['']

>>> 'Line 1\n'.split('\n')
['Line 1', '']

✂️ Removing additional whitespace:

If you also need to remove additional leading or trailing whitespace, like spaces, that are ignored by str.splitlines(), you could use str.splitlines() together with str.strip():

如果还需要删除str.splitlines()忽略的其他前导空格或尾随空格(如空格),则可以将str.splitlines()与str.strip()一起使用:

>>> [str.strip() for str in 'Line 1  \n  \nLine 3 \rLine 4 \r\n'.splitlines()]
['Line 1', '', 'Line 3', 'Line 4']

????️ Removing empty strings (''):

Lastly, if you want to filter out the empty strings from the resulting list, you could use filter():

最后,如果要从结果列表中过滤掉空字符串,可以使用filter():

>>> # Python 2.X:
>>> filter(bool, 'Line 1\n\nLine 3\rLine 4\r\n'.splitlines())
['Line 1', 'Line 3', 'Line 4']

>>> # Python 3.X:
>>> list(filter(bool, 'Line 1\n\nLine 3\rLine 4\r\n'.splitlines()))
['Line 1', 'Line 3', 'Line 4']

???? Additional comment regarding the original question:

As the error you posted indicates and Burhan suggested, the problem is from the print. There's a related question about that could be useful to you: UnicodeEncodeError: 'charmap' codec can't encode - character maps to <undefined>, print function

由于您发布的错误表明和Burhan建议,问题来自打印。有一个相关的问题可能对你有用:UnicodeEncodeError:'charmap'编解码器无法编码 - 字符映射到 ,打印功能

#2


1  

a.txt

A.TXT

this is line 1
this is line 2

code:

码:

Python 3.4.0 (default, Mar 20 2014, 22:43:40) 
[GCC 4.6.3] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> file = open('a.txt').read()
>>> file
>>> file.split('\n')
['this is line 1', 'this is line 2', '']

I'm on Linux, but I guess you just use \r\n on Windows and it would also work

我在Linux上,但我猜你只是在Windows上使用\ r \ n,它也可以工作

#1


24  

✨ Splitting line in Python:

Have you tried using str.splitlines() method?:

你尝试过使用str.splitlines()方法吗?:

From the docs:

来自文档:

str.splitlines([keepends])

str.splitlines([keepends])

Return a list of the lines in the string, breaking at line boundaries. Line breaks are not included in the resulting list unless keepends is given and true.

返回字符串中的行列表,在行边界处断开。除非给出keepends且为true,否则换行符不包括在结果列表中。

For example:

例如:

>>> 'Line 1\n\nLine 3\rLine 4\r\n'.splitlines()
['Line 1', '', 'Line 3', 'Line 4']

>>> 'Line 1\n\nLine 3\rLine 4\r\n'.splitlines(True)
['Line 1\n', '\n', 'Line 3\r', 'Line 4\r\n']

???? Which delimiters are considered?

This method uses the universal newlines approach to splitting lines.

此方法使用通用换行方法来分割线条。

The main difference between Python 2.X and Python 3.X is that the former uses the universal newlines approach to splitting lines, so "\r", "\n", and "\r\n" are considered line boundaries for 8-bit strings, while the latter uses a superset of it that also includes:

Python 2.X和Python 3.X之间的主要区别在于前者使用通用换行方法来分割行,因此“\ r”,“\ n”和“\ r \ n”被视为8的行边界-bit字符串,而后者使用它的超集,其中还包括:

  • \v or \x0b: Line Tabulation (added in Python 3.2).
  • \ v或\ x0b:行制表(在Python 3.2中添加)。
  • \f or \x0c: Form Feed (added in Python 3.2).
  • \ f或\ x0c:Form Feed(在Python 3.2中添加)。
  • \x1c: File Separator.
  • \ x1c:文件分隔符。
  • \x1d: Group Separator.
  • \ x1d:组分隔符。
  • \x1e: Record Separator.
  • \ x1e:记录分隔符。
  • \x85: Next Line (C1 Control Code).
  • \ x85:下一行(C1控制代码)。
  • \u2028: Line Separator.
  • \ u2028:行分隔符。
  • \u2029: Paragraph Separator.
  • \ u2029:段落分隔符。

???? splitlines VS split:

Unlike str.split() when a delimiter string sep is given, this method returns an empty list for the empty string, and a terminal line break does not result in an extra line:

与str.split()不同,当给定分隔符字符串sep时,此方法返回空字符串的空列表,并且终止换行符不会产生额外的行:

>>> ''.splitlines()
[]

>>> 'Line 1\n'.splitlines()
['Line 1']

While str.split('\n') returns:

而str.split('\ n')返回:

>>> ''.split('\n')
['']

>>> 'Line 1\n'.split('\n')
['Line 1', '']

✂️ Removing additional whitespace:

If you also need to remove additional leading or trailing whitespace, like spaces, that are ignored by str.splitlines(), you could use str.splitlines() together with str.strip():

如果还需要删除str.splitlines()忽略的其他前导空格或尾随空格(如空格),则可以将str.splitlines()与str.strip()一起使用:

>>> [str.strip() for str in 'Line 1  \n  \nLine 3 \rLine 4 \r\n'.splitlines()]
['Line 1', '', 'Line 3', 'Line 4']

????️ Removing empty strings (''):

Lastly, if you want to filter out the empty strings from the resulting list, you could use filter():

最后,如果要从结果列表中过滤掉空字符串,可以使用filter():

>>> # Python 2.X:
>>> filter(bool, 'Line 1\n\nLine 3\rLine 4\r\n'.splitlines())
['Line 1', 'Line 3', 'Line 4']

>>> # Python 3.X:
>>> list(filter(bool, 'Line 1\n\nLine 3\rLine 4\r\n'.splitlines()))
['Line 1', 'Line 3', 'Line 4']

???? Additional comment regarding the original question:

As the error you posted indicates and Burhan suggested, the problem is from the print. There's a related question about that could be useful to you: UnicodeEncodeError: 'charmap' codec can't encode - character maps to <undefined>, print function

由于您发布的错误表明和Burhan建议,问题来自打印。有一个相关的问题可能对你有用:UnicodeEncodeError:'charmap'编解码器无法编码 - 字符映射到 ,打印功能

#2


1  

a.txt

A.TXT

this is line 1
this is line 2

code:

码:

Python 3.4.0 (default, Mar 20 2014, 22:43:40) 
[GCC 4.6.3] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> file = open('a.txt').read()
>>> file
>>> file.split('\n')
['this is line 1', 'this is line 2', '']

I'm on Linux, but I guess you just use \r\n on Windows and it would also work

我在Linux上,但我猜你只是在Windows上使用\ r \ n,它也可以工作