正则表达式能找到字符的重复吗?

My users insert sequences like

我的用户插入序列。

________________________
************************
------------------------
♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥

to format documents (dont ask me about my users!). And it looks bad when displaying snippets. How can I remove repetitions of any characters? I can add individual filters, but it will be a constant cat and mouse game.

格式化文档(不要问我我的用户!)在显示代码片段时，它看起来很糟糕。如何删除任何字符的重复?我可以添加单独的过滤器，但这将是一个持续的猫和鼠标游戏。

Can a regular expression filter these?

正则表达式能过滤这些吗?

3 个解决方案

#1

Try something like:

尝试:

(.)\1{5,}

Which matches any character, then 5 or more of that character. Remember to escape the \ if your language uses strings for regex patterns!

它匹配任何字符，然后是该字符的5个或多个。如果您的语言使用正则表达式模式的字符串，请记住转义\ !

#2

You can remove repetitions of any character with a simple regex like (.)\1+

您可以使用简单的regex(.)\1+删除任何字符的重复

However, this will catch legitimate uses as well, such as words that have doubled letters in their spelling (balloon, spelling, well, etc).

然而，这也会得到合理的使用，比如在拼写中有两个字母的单词(气球，拼写，等等)。

So, you'd probably want to restrict the expression to some disallowed characters, after all, while keeping it as generic as possible, in order not to have to modify it from time to time, as your users find new characters to use.
One possible solution would be to disallow repeated non-letter and non-number characters:

因此，您可能希望将表达式限制为某些不允许的字符，同时尽可能保持其通用性，以便不必不时修改，因为您的用户会发现要使用的新字符。一种可能的解决办法是不允许重复的非字母和非数字字符:

([^A-Za-z0-9])\1+

([^ A-Za-z0-9])\ 1 +

But even this is not a definitive solution to all the cases, as some of your users may actually decide to use actual letter sequences as delimiters:

但即便如此，这也不是所有情况的最终解决方案，因为您的一些用户实际上可能决定使用实际的字母序列作为分隔符:

ZZZZZZZZZZZZZZZZZZZZZZ
BBBBBBBBBBBBBBBBBBBBBB
ZZZZZZZZZZZZZZZZZZZZZZ

In order not to allow this and with the added benefit of allowing legitimate uses of some repeated non-letter characters (such as in an ellipsis: ...), you could restrict the character repetitions to a maximum of 3, by using a regex with the syntax (<pattern>)\1{min, max} like this: (.)\1{4,} to match offending character sequences, with a minimum length of 4 and an unspecified maximum.

为了不允许这样的好处就是允许合法使用一些重复的非字母字符(如在一个省略号:…),你可以限制字符重复最多3个,通过使用一个正则表达式的语法( <模式> )\ 1 {最小,最大}这样的:()\ 1 { 4 }匹配的字符序列,最小长度为4和一个未指定的最大值。

#3

In python (but the logic is the same regardless of the language):

在python中(但是不管语言是什么，逻辑都是一样的):

>>> import re
>>> text = '''
... This is some text
... ________________________
... This some more
... ♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥
... Truly the last line
... '''
>>> print re.sub(r'[_♥]{2,}', '', text)  #this is the core (regexp)

This is some text

This some more

Truly the last line

This has the advantage that you have some control on what to substitute and what not (for example you might wish not to substitute . as it could be part of a comment like This is still to do....

这样做的好处是，您可以控制什么东西可以替代，什么不能替代(例如，您可能不希望替换)。因为它可以参与这样的评论仍做....

EDIT:

编辑:

If your repetitions are always "lines" you could add the newline characters to your expression:

如果你的重复总是“行”，你可以在你的表达式中加入换行字符:

text = '''
This is some text
________________________
This some more
♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥
Truly the last line
But this is not to be changed: ♥♥♥
'''
>>> print re.sub(r'\n[_♥]{2,}\n', '\n', text)
This is some text
This some more
Truly the last line
But this is not to be changed: ♥♥♥

HTH

#1