如何从java字符串中删除控制字符?

时间:2020-12-01 20:49:37

I have a string coming from UI that may contains control characters, and I want to remove all control characters except carriage returns, line feeds, and tabs.

我有一个来自UI的字符串,它可能包含控制字符,我想删除除回车、换行和制表符之外的所有控制字符。

Right now I can find two way to remove all control characters:

现在我可以找到两种方法来删除所有的控制字符:

1- using guava:

1 -使用番石榴:

return CharMatcher.JAVA_ISO_CONTROL.removeFrom(string);

2- using regex:

2 -使用正则表达式:

return string.replaceAll("\\p{Cntrl}", "");

6 个解决方案

#1


19  

You can do something like this if you want to delete all characters in other or control uni-code category

如果您想删除其他字符或控制统一代码类别中的所有字符,可以这样做

System.out.println(
    "a\u0000b\u0007c\u008fd".replaceAll("\\p{Cc}", "")
); // abcd

Note : This actually removes (among others) '\u008f' Unicode character from the string, not the escaped form "%8F" string.

注意:这实际上删除了(除其他外)字符串中的'\u008f' Unicode字符,而不是转义的“%8F”字符串。

Courtesy : polygenelubricants ( Replace Unicode Control Characters )

礼貌:polygenelubricants(替换Unicode控制字符)

#2


13  

One option is to use a combination of CharMatchers:

一种选择是使用组合的CharMatchers:

CharMatcher charsToPreserve = CharMatcher.anyOf("\r\n\t");
CharMatcher allButPreserved = charsToPreserve.negate();
CharMatcher controlCharactersToRemove = CharMatcher.JAVA_ISO_CONTROL.and(allButPreserved);

Then use removeFrom as before. I don't know how efficient it is, but it's at least simple.

然后像以前一样使用removeFrom。我不知道它有多高效,但它至少很简单。

#3


5  

This seems to be an option

这似乎是一个选择

    String s = "\u0001\t\r\n".replaceAll("[\\p{Cntrl}&&[^\r\n\t]]", "");
    for (char c : s.toCharArray()) {
        System.out.print((int) c + " ");
    }

prints 9 13 10 just like you said "except carriage returns, line feeds, and tabs".

打印9 13 10,就像你说的“除了回车、换行和制表符”。

#4


1  

I'm using Selenium to test web screens. I use Hamcrest asserts and matchers to search the page source for different strings based on various conditions.

我正在使用Selenium测试web屏幕。我使用Hamcrest assert和matchers根据不同的条件搜索不同的页面源。

String pageSource = browser.getPageSource();
assertThat("Text not found!", pageSource, containsString(text));

This works just fine using an IE or Firefox driver, but it bombs when using the HtmlUnitDriver. The HtmlUnitDriver formats the page source with tabs, carriage returns, and other control characters. I am using a riff on Nidhish Krishnan's ingenious answer above. If I use Nidish's solution "out of the box," I am left with extra spaces, so I added a private method named filterTextForComparison:

使用IE或Firefox驱动程序可以正常工作,但是在使用HtmlUnitDriver时它会爆炸。HtmlUnitDriver通过选项卡、回车符和其他控制字符来格式化页面源。我正在对Nidhish Krishnan上面巧妙的回答进行即兴重复。如果我使用Nidish的解决方案“out of the box”,我就留下了额外的空间,所以我添加了一个名为filterTextForComparison的私有方法:

String pageSource = filterTextForComparison(browser.getPageSource());
assertThat("Text not found!", pageSource, 
        containsString(filterTextForComparison(text)));

And the function:

和功能:

/**
 * Filter out any characters embedded in the text that will interfere with
 * comparing Strings.
 * 
 * @param text
 *            the text to filter.
 * @return the text with any extraneous character removed.
 */
private String filterTextForComparison(String text) {

    String filteredText = text;

    if (filteredText != null) {
        filteredText = filteredText.replaceAll("\\p{Cc}", " ").replaceAll("\\s{2,}", " ");
    }

    return filteredText;
}

First, the method replaces the control characters with a space then it replaces multiple spaces with a single one. I tried doing everything at once with "\p{Cc}+?" but it didn't catch "\t " becoming " ".

首先,该方法用一个空格替换控制字符,然后用一个空格替换多个空格。我试着用“\p{Cc}+”来做每一件事,但它并没有“变成”。

#5


0  

In Java regular expression, it is possible to exclude some characters in a character class. Here's a sample program demonstrate something similar:

在Java正则表达式中,可以排除字符类中的某些字符。这里有一个示例程序演示了类似的东西:

class test {
    public static void main (String argv[]) {
            String testStr="abcdefABCDEF";
            System.out.println(testStr);
            System.out.println(testStr.replaceAll("[\\p{Lower}&&[^cd]]",""));
    }
}

It will produce this output:

它将产生以下产出:

abcdefABCDEF
cdABCDEF

#6


0  

use these

使用这些

public static String removeNonAscii(String str)
{
    return str.replaceAll("[^\\x00-\\x7F]", "");
}

public static String removeNonPrintable(String str) // All Control Char
{
    return str.replaceAll("[\\p{C}]", "");
}

public static String removeSomeControlChar(String str) // Some Control Char
{
    return str.replaceAll("[\\p{Cntrl}\\p{Cc}\\p{Cf}\\p{Co}\\p{Cn}]", "");
}

public static String removeControlCharFull(String str)
{
    return removeNonPrintable(str).replaceAll("[\\r\\n\\t]", "");
}

#1


19  

You can do something like this if you want to delete all characters in other or control uni-code category

如果您想删除其他字符或控制统一代码类别中的所有字符,可以这样做

System.out.println(
    "a\u0000b\u0007c\u008fd".replaceAll("\\p{Cc}", "")
); // abcd

Note : This actually removes (among others) '\u008f' Unicode character from the string, not the escaped form "%8F" string.

注意:这实际上删除了(除其他外)字符串中的'\u008f' Unicode字符,而不是转义的“%8F”字符串。

Courtesy : polygenelubricants ( Replace Unicode Control Characters )

礼貌:polygenelubricants(替换Unicode控制字符)

#2


13  

One option is to use a combination of CharMatchers:

一种选择是使用组合的CharMatchers:

CharMatcher charsToPreserve = CharMatcher.anyOf("\r\n\t");
CharMatcher allButPreserved = charsToPreserve.negate();
CharMatcher controlCharactersToRemove = CharMatcher.JAVA_ISO_CONTROL.and(allButPreserved);

Then use removeFrom as before. I don't know how efficient it is, but it's at least simple.

然后像以前一样使用removeFrom。我不知道它有多高效,但它至少很简单。

#3


5  

This seems to be an option

这似乎是一个选择

    String s = "\u0001\t\r\n".replaceAll("[\\p{Cntrl}&&[^\r\n\t]]", "");
    for (char c : s.toCharArray()) {
        System.out.print((int) c + " ");
    }

prints 9 13 10 just like you said "except carriage returns, line feeds, and tabs".

打印9 13 10,就像你说的“除了回车、换行和制表符”。

#4


1  

I'm using Selenium to test web screens. I use Hamcrest asserts and matchers to search the page source for different strings based on various conditions.

我正在使用Selenium测试web屏幕。我使用Hamcrest assert和matchers根据不同的条件搜索不同的页面源。

String pageSource = browser.getPageSource();
assertThat("Text not found!", pageSource, containsString(text));

This works just fine using an IE or Firefox driver, but it bombs when using the HtmlUnitDriver. The HtmlUnitDriver formats the page source with tabs, carriage returns, and other control characters. I am using a riff on Nidhish Krishnan's ingenious answer above. If I use Nidish's solution "out of the box," I am left with extra spaces, so I added a private method named filterTextForComparison:

使用IE或Firefox驱动程序可以正常工作,但是在使用HtmlUnitDriver时它会爆炸。HtmlUnitDriver通过选项卡、回车符和其他控制字符来格式化页面源。我正在对Nidhish Krishnan上面巧妙的回答进行即兴重复。如果我使用Nidish的解决方案“out of the box”,我就留下了额外的空间,所以我添加了一个名为filterTextForComparison的私有方法:

String pageSource = filterTextForComparison(browser.getPageSource());
assertThat("Text not found!", pageSource, 
        containsString(filterTextForComparison(text)));

And the function:

和功能:

/**
 * Filter out any characters embedded in the text that will interfere with
 * comparing Strings.
 * 
 * @param text
 *            the text to filter.
 * @return the text with any extraneous character removed.
 */
private String filterTextForComparison(String text) {

    String filteredText = text;

    if (filteredText != null) {
        filteredText = filteredText.replaceAll("\\p{Cc}", " ").replaceAll("\\s{2,}", " ");
    }

    return filteredText;
}

First, the method replaces the control characters with a space then it replaces multiple spaces with a single one. I tried doing everything at once with "\p{Cc}+?" but it didn't catch "\t " becoming " ".

首先,该方法用一个空格替换控制字符,然后用一个空格替换多个空格。我试着用“\p{Cc}+”来做每一件事,但它并没有“变成”。

#5


0  

In Java regular expression, it is possible to exclude some characters in a character class. Here's a sample program demonstrate something similar:

在Java正则表达式中,可以排除字符类中的某些字符。这里有一个示例程序演示了类似的东西:

class test {
    public static void main (String argv[]) {
            String testStr="abcdefABCDEF";
            System.out.println(testStr);
            System.out.println(testStr.replaceAll("[\\p{Lower}&&[^cd]]",""));
    }
}

It will produce this output:

它将产生以下产出:

abcdefABCDEF
cdABCDEF

#6


0  

use these

使用这些

public static String removeNonAscii(String str)
{
    return str.replaceAll("[^\\x00-\\x7F]", "");
}

public static String removeNonPrintable(String str) // All Control Char
{
    return str.replaceAll("[\\p{C}]", "");
}

public static String removeSomeControlChar(String str) // Some Control Char
{
    return str.replaceAll("[\\p{Cntrl}\\p{Cc}\\p{Cf}\\p{Co}\\p{Cn}]", "");
}

public static String removeControlCharFull(String str)
{
    return removeNonPrintable(str).replaceAll("[\\r\\n\\t]", "");
}