正则表达式匹配水平空格。

时间:2022-01-21 16:51:33

I need a regex in Python2 to match only horizontal white spaces not newlines.

我需要一个在Python2里的正则表达式来匹配水平的空格而不是换行。

\s matches all whitespaces including newlines.

\s匹配所有白空格,包括换行。

>>> re.sub(r"\s", "", "line 1.\nline 2\n")
'line1.line2'

\h does not work at all.

\h根本不管用。

>>> re.sub(r"\h", "", "line 1.\nline 2\n")
'line 1.\nline 2\n'

[\t ] works but I am not sure if I am missing other possible white space characters especially in Unicode. Such as \u00A0 (non breaking space) or \u200A (hair space). There are much more white space characters at the following link. https://www.cs.tut.fi/~jkorpela/chars/spaces.html

[\t]可以,但我不确定是否遗漏了其他可能的空格字符,尤其是在Unicode中。例如\u00A0(不破坏空间)或\u200A(头发空间)。下面的链接中有更多的空格字符。https://www.cs.tut.fi/ jkorpela /字符/ spaces.html

>>> re.sub(r"[\t ]", "", u"line 1.\nline 2\n\u00A0\u200A\n", flags=re.UNICODE)
u'line1.\nline2\n\xa0\u200a\n'

Do you have any suggestions?

你有什么建议吗?

2 个解决方案

#1


3  

I ended up using [^\S\n] instead of specifying all Unicode white spaces.

我最终使用[^ \ S \ n]代替指定所有Unicode的白色空间。

>>> re.sub(r"[^\S\n]", "", u"line 1.\nline 2\n\u00A0\u200A\n", flags=re.UNICODE)
u'line1.\nline2\n\n'

>>> re.sub(r"[\t ]", "", u"line 1.\nline 2\n\u00A0\u200A\n", flags=re.UNICODE)
u'line1.\nline2\n\xa0\u200a\n'

It works as expected.

它能够正常工作。

#2


0  

If you only want to match actual spaces, try a plain ( )+ (brackets for readability only*). If you want to match spaces and tabs, try [ \t]+ (+ so that you also match a sequence of e.g. 3 space characters.

如果您只想匹配实际的空格,请尝试使用plain()+(仅用于可读性的括号*)。如果您想匹配空格和制表符,请尝试[\t]+(+),以便您也匹配一个序列,例如3个空格字符。

Now there are in fact other whitespace characters in unicode, that's true. You are, however, highly unlikely to encounter any of those in written code, and also pretty unlikely to encounter any of the less common whitespace chars in other texts.

实际上,unicode中还有其他空格字符,这是真的。但是,您非常不可能在编写的代码中遇到任何这些,而且在其他文本中也不太可能遇到任何不太常见的空白字符。

If you want to, you can include \u00A0 (non-breaking space, fairly common in scientific papers and on some websites. This is the HTML  ), en-space \u2002 ( ), em-space \u2003 ( ) or thin space \u2009 ( ).

如果你愿意,你可以包括\u00A0(不间断空间,在科学论文和一些网站上相当常见)。这是HTML), en-space \u2002 (&ensp), em-space \u2003 (&emsp)或thin space \u2009 (&thinsp)。

You can find a variety of other unicode whitespace characters on Wikipedia, but I highly doubt it's necessary to include them. I'd just stick to space, tab and maybe non-breaking space (i.e. [ \t\u00A0]+).

您可以在Wikipedia上找到各种其他unicode空格字符,但我非常怀疑是否有必要包含它们。我只会选择空格、制表符,或者不间断的空格(比如[\t\u00A0]+)。

What do you intend to match with \h, anyway? It's not a valid "symbol" in regex, as far as I know.

你打算用什么来匹配\h呢?据我所知,在regex中它不是一个有效的“符号”。

 

 

** doesn't display spaces on the edge of inline code

**不显示在内联代码边缘的空间。

#1


3  

I ended up using [^\S\n] instead of specifying all Unicode white spaces.

我最终使用[^ \ S \ n]代替指定所有Unicode的白色空间。

>>> re.sub(r"[^\S\n]", "", u"line 1.\nline 2\n\u00A0\u200A\n", flags=re.UNICODE)
u'line1.\nline2\n\n'

>>> re.sub(r"[\t ]", "", u"line 1.\nline 2\n\u00A0\u200A\n", flags=re.UNICODE)
u'line1.\nline2\n\xa0\u200a\n'

It works as expected.

它能够正常工作。

#2


0  

If you only want to match actual spaces, try a plain ( )+ (brackets for readability only*). If you want to match spaces and tabs, try [ \t]+ (+ so that you also match a sequence of e.g. 3 space characters.

如果您只想匹配实际的空格,请尝试使用plain()+(仅用于可读性的括号*)。如果您想匹配空格和制表符,请尝试[\t]+(+),以便您也匹配一个序列,例如3个空格字符。

Now there are in fact other whitespace characters in unicode, that's true. You are, however, highly unlikely to encounter any of those in written code, and also pretty unlikely to encounter any of the less common whitespace chars in other texts.

实际上,unicode中还有其他空格字符,这是真的。但是,您非常不可能在编写的代码中遇到任何这些,而且在其他文本中也不太可能遇到任何不太常见的空白字符。

If you want to, you can include \u00A0 (non-breaking space, fairly common in scientific papers and on some websites. This is the HTML  ), en-space \u2002 ( ), em-space \u2003 ( ) or thin space \u2009 ( ).

如果你愿意,你可以包括\u00A0(不间断空间,在科学论文和一些网站上相当常见)。这是HTML), en-space \u2002 (&ensp), em-space \u2003 (&emsp)或thin space \u2009 (&thinsp)。

You can find a variety of other unicode whitespace characters on Wikipedia, but I highly doubt it's necessary to include them. I'd just stick to space, tab and maybe non-breaking space (i.e. [ \t\u00A0]+).

您可以在Wikipedia上找到各种其他unicode空格字符,但我非常怀疑是否有必要包含它们。我只会选择空格、制表符,或者不间断的空格(比如[\t\u00A0]+)。

What do you intend to match with \h, anyway? It's not a valid "symbol" in regex, as far as I know.

你打算用什么来匹配\h呢?据我所知,在regex中它不是一个有效的“符号”。

 

 

** doesn't display spaces on the edge of inline code

**不显示在内联代码边缘的空间。