如何删除字符串中重复的字符

时间:2022-05-10 02:47:11

I have a website which allows users to comment on photos. Of course, users leave comments like:

我有一个网站,允许用户评论照片。当然,用户会留下以下评论:

'OMGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG!!!!!!!!!!!!!!!'

'OMGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG !!!!!!!!!!!!!!!'

or

要么

'YOU SUCCCCCCCCCCCCCCCCCKKKKKKKKKKKKKKKKKK'

'你SUCCCCCCCCCCCCCCCKKKKKKKKKKKKKKKKKK'

You get it.

你懂了。

Basically, I want to shorten those comments by removing at least most of those excess repeated characters. I'm sure there's a way to do it with Regex..i just can't figure it out.

基本上,我想通过删除至少大多数重复多余的字符来缩短这些评论。我确信有一种方法可以用Regex做到这一点。我只是想不出来。

Any ideas?

有任何想法吗?

7 个解决方案

#1


9  

Keeping in mind that the English language uses double letters often you probably don't want to blindly eliminate them. Here is a regex that will get rid of anything beyond a double.

请记住,英语通常使用双字母,你可能不想盲目地消除它们。这是一个正则表达式,将摆脱双重之外的任何东西。

Regex r = new Regex("(.)(?<=\\1\\1\\1)", RegexOptions.IgnoreCase | RegexOptions.CultureInvariant | RegexOptions.Compiled);

var x = r.Replace("YOU SUCCCCCCCCCCCCCCCCCKKKKKKKKKKKKKKKKKK", String.Empty);
// x = "YOU SUCCKK"

var y = r.Replace("OMGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG!!!!!!!!!!!!!!!", String.Empty);
// y = "OMGG!!"

#2


8  

Do you specifically want to shorten the strings in the code, or would it be enough to simply fail validation and present the form to the user again with a validation error? Something like "Too many repeated characters."

您是否特别希望缩短代码中的字符串,还是仅仅通过验证失败并再次向用户显示验证错误?像“太多重复的人物”这样的东西。

If the latter is acceptable, @"(\w)\1{2}" should match characters of 3 or more (interpreted as "repeated" two or more times).

如果后者是可接受的,@“(\ w)\ 1 {2}”应匹配3或更多的字符(解释为“重复”两次或更多次)。

Edit: As @Piskvor pointed out, this will match on exactly 3 characters. It works fine for matching, but not for replacing. His version, @"(\w)\1{2,}", would work better for replacing. However, I'd like to point out that I think replacing wouldn't be the best practice here. Better to just have the form fail validation than to try to scrub the text being submitted, because there likely will be edge cases where you turn otherwise readable (even if unreasonable) text into nonsense.

编辑:正如@Piskvor所指出的,这将匹配正好3个字符。它适用于匹配,但不适用于替换。他的版本,@“(\ w)\ 1 {2,}”,可以更好地替换。但是,我想指出,我认为替换不是这里的最佳做法。最好只是让表单失败验证而不是尝试擦除正在提交的文本,因为可能会出现边缘情况,您可以将其他可读(即使不合理)文本转换为无意义。

#3


2  

Regex would be overkill. Try this:

正则表达式会有点矫枉过正。尝试这个:

public static string RemoveRepeatedChars(String input, int maxRepeat)
    {
        if(input.Length==0)return input;

        StringBuilder b = new StringBuilder;
        Char[] chars = input.ToCharArray();
        Char lastChar = chars[0];
        int repeat = 0;
        for(int i=1;i<input.Length;i++){
            if(chars[i]==lastChar && ++repeat<maxRepeat)
            {
                b.Append(chars[i]);
            }
            else
            {
                b.Append(chars[i]);
                repeat=0;
                lastChar = chars[i];
            }
        }
        return b.ToString();
    }

#4


1  

var nonRepeatedChars = myString.ToCharArray().Distinct().Where(c => !char.IsWhiteSpace(c) || !myString.Contains(c)).ToString();

#5


0  

Distinct() will remove all duplicates, however it will not see "A" and "a" as the same, obviously.

Distinct()将删除所有重复项,但显然不会看到“A”和“a”相同。

Console.WriteLine(new string("Asdfasdf".Distinct().ToArray()));

Outputs "Asdfa"

输出“Asdfa”

#6


0  

var test = "OMMMMMGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGMMM";

test.Distinct().Select(c => c.ToString()).ToList()
        .ForEach(c =>
            { 
                while (test.Contains(c + c)) 
                test = test.Replace(c + c, c); 
            }
        );

#7


0  

Edit : awful suggestion, please don't read, I truly deserve my -1 :)

编辑:糟糕的建议,请不要阅读,我真的值得我的-1 :)

I found here on technical nuggets something like what you're looking for.

我在技术掘金中找到了类似于你所寻找的东西。

There's nothing to do except a very long regex, because I've never heard about a regex sign for repetition ...

除了很长的正则表达式之外没什么可做的,因为我从来没有听说过重复的正则表达式......

It's a total example, I won't paste it here but I think this will totally answer your question.

这是一个完整的例子,我不会在这里粘贴它,但我认为这将完全回答你的问题。

#1


9  

Keeping in mind that the English language uses double letters often you probably don't want to blindly eliminate them. Here is a regex that will get rid of anything beyond a double.

请记住,英语通常使用双字母,你可能不想盲目地消除它们。这是一个正则表达式,将摆脱双重之外的任何东西。

Regex r = new Regex("(.)(?<=\\1\\1\\1)", RegexOptions.IgnoreCase | RegexOptions.CultureInvariant | RegexOptions.Compiled);

var x = r.Replace("YOU SUCCCCCCCCCCCCCCCCCKKKKKKKKKKKKKKKKKK", String.Empty);
// x = "YOU SUCCKK"

var y = r.Replace("OMGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG!!!!!!!!!!!!!!!", String.Empty);
// y = "OMGG!!"

#2


8  

Do you specifically want to shorten the strings in the code, or would it be enough to simply fail validation and present the form to the user again with a validation error? Something like "Too many repeated characters."

您是否特别希望缩短代码中的字符串,还是仅仅通过验证失败并再次向用户显示验证错误?像“太多重复的人物”这样的东西。

If the latter is acceptable, @"(\w)\1{2}" should match characters of 3 or more (interpreted as "repeated" two or more times).

如果后者是可接受的,@“(\ w)\ 1 {2}”应匹配3或更多的字符(解释为“重复”两次或更多次)。

Edit: As @Piskvor pointed out, this will match on exactly 3 characters. It works fine for matching, but not for replacing. His version, @"(\w)\1{2,}", would work better for replacing. However, I'd like to point out that I think replacing wouldn't be the best practice here. Better to just have the form fail validation than to try to scrub the text being submitted, because there likely will be edge cases where you turn otherwise readable (even if unreasonable) text into nonsense.

编辑:正如@Piskvor所指出的,这将匹配正好3个字符。它适用于匹配,但不适用于替换。他的版本,@“(\ w)\ 1 {2,}”,可以更好地替换。但是,我想指出,我认为替换不是这里的最佳做法。最好只是让表单失败验证而不是尝试擦除正在提交的文本,因为可能会出现边缘情况,您可以将其他可读(即使不合理)文本转换为无意义。

#3


2  

Regex would be overkill. Try this:

正则表达式会有点矫枉过正。尝试这个:

public static string RemoveRepeatedChars(String input, int maxRepeat)
    {
        if(input.Length==0)return input;

        StringBuilder b = new StringBuilder;
        Char[] chars = input.ToCharArray();
        Char lastChar = chars[0];
        int repeat = 0;
        for(int i=1;i<input.Length;i++){
            if(chars[i]==lastChar && ++repeat<maxRepeat)
            {
                b.Append(chars[i]);
            }
            else
            {
                b.Append(chars[i]);
                repeat=0;
                lastChar = chars[i];
            }
        }
        return b.ToString();
    }

#4


1  

var nonRepeatedChars = myString.ToCharArray().Distinct().Where(c => !char.IsWhiteSpace(c) || !myString.Contains(c)).ToString();

#5


0  

Distinct() will remove all duplicates, however it will not see "A" and "a" as the same, obviously.

Distinct()将删除所有重复项,但显然不会看到“A”和“a”相同。

Console.WriteLine(new string("Asdfasdf".Distinct().ToArray()));

Outputs "Asdfa"

输出“Asdfa”

#6


0  

var test = "OMMMMMGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGMMM";

test.Distinct().Select(c => c.ToString()).ToList()
        .ForEach(c =>
            { 
                while (test.Contains(c + c)) 
                test = test.Replace(c + c, c); 
            }
        );

#7


0  

Edit : awful suggestion, please don't read, I truly deserve my -1 :)

编辑:糟糕的建议,请不要阅读,我真的值得我的-1 :)

I found here on technical nuggets something like what you're looking for.

我在技术掘金中找到了类似于你所寻找的东西。

There's nothing to do except a very long regex, because I've never heard about a regex sign for repetition ...

除了很长的正则表达式之外没什么可做的,因为我从来没有听说过重复的正则表达式......

It's a total example, I won't paste it here but I think this will totally answer your question.

这是一个完整的例子,我不会在这里粘贴它,但我认为这将完全回答你的问题。