
时间:2022-01-21 16:51:33

I need a regex in Python2 to match only horizontal white spaces not newlines.


\s matches all whitespaces including newlines.


>>> re.sub(r"\s", "", "line 1.\nline 2\n")

\h does not work at all.


>>> re.sub(r"\h", "", "line 1.\nline 2\n")
'line 1.\nline 2\n'

[\t ] works but I am not sure if I am missing other possible white space characters especially in Unicode. Such as \u00A0 (non breaking space) or \u200A (hair space). There are much more white space characters at the following link. https://www.cs.tut.fi/~jkorpela/chars/spaces.html

[\t]可以,但我不确定是否遗漏了其他可能的空格字符,尤其是在Unicode中。例如\u00A0(不破坏空间)或\u200A(头发空间)。下面的链接中有更多的空格字符。https://www.cs.tut.fi/ jkorpela /字符/ spaces.html

>>> re.sub(r"[\t ]", "", u"line 1.\nline 2\n\u00A0\u200A\n", flags=re.UNICODE)

Do you have any suggestions?


2 个解决方案



I ended up using [^\S\n] instead of specifying all Unicode white spaces.

我最终使用[^ \ S \ n]代替指定所有Unicode的白色空间。

>>> re.sub(r"[^\S\n]", "", u"line 1.\nline 2\n\u00A0\u200A\n", flags=re.UNICODE)

>>> re.sub(r"[\t ]", "", u"line 1.\nline 2\n\u00A0\u200A\n", flags=re.UNICODE)

It works as expected.




If you only want to match actual spaces, try a plain ( )+ (brackets for readability only*). If you want to match spaces and tabs, try [ \t]+ (+ so that you also match a sequence of e.g. 3 space characters.


Now there are in fact other whitespace characters in unicode, that's true. You are, however, highly unlikely to encounter any of those in written code, and also pretty unlikely to encounter any of the less common whitespace chars in other texts.


If you want to, you can include \u00A0 (non-breaking space, fairly common in scientific papers and on some websites. This is the HTML  ), en-space \u2002 ( ), em-space \u2003 ( ) or thin space \u2009 ( ).

如果你愿意,你可以包括\u00A0(不间断空间,在科学论文和一些网站上相当常见)。这是HTML), en-space \u2002 (&ensp), em-space \u2003 (&emsp)或thin space \u2009 (&thinsp)。

You can find a variety of other unicode whitespace characters on Wikipedia, but I highly doubt it's necessary to include them. I'd just stick to space, tab and maybe non-breaking space (i.e. [ \t\u00A0]+).


What do you intend to match with \h, anyway? It's not a valid "symbol" in regex, as far as I know.




** doesn't display spaces on the edge of inline code




I ended up using [^\S\n] instead of specifying all Unicode white spaces.

我最终使用[^ \ S \ n]代替指定所有Unicode的白色空间。

>>> re.sub(r"[^\S\n]", "", u"line 1.\nline 2\n\u00A0\u200A\n", flags=re.UNICODE)

>>> re.sub(r"[\t ]", "", u"line 1.\nline 2\n\u00A0\u200A\n", flags=re.UNICODE)

It works as expected.




If you only want to match actual spaces, try a plain ( )+ (brackets for readability only*). If you want to match spaces and tabs, try [ \t]+ (+ so that you also match a sequence of e.g. 3 space characters.


Now there are in fact other whitespace characters in unicode, that's true. You are, however, highly unlikely to encounter any of those in written code, and also pretty unlikely to encounter any of the less common whitespace chars in other texts.


If you want to, you can include \u00A0 (non-breaking space, fairly common in scientific papers and on some websites. This is the HTML  ), en-space \u2002 ( ), em-space \u2003 ( ) or thin space \u2009 ( ).

如果你愿意,你可以包括\u00A0(不间断空间,在科学论文和一些网站上相当常见)。这是HTML), en-space \u2002 (&ensp), em-space \u2003 (&emsp)或thin space \u2009 (&thinsp)。

You can find a variety of other unicode whitespace characters on Wikipedia, but I highly doubt it's necessary to include them. I'd just stick to space, tab and maybe non-breaking space (i.e. [ \t\u00A0]+).


What do you intend to match with \h, anyway? It's not a valid "symbol" in regex, as far as I know.




** doesn't display spaces on the edge of inline code
