如何从Java字符串中删除版权和其他非ASCII字符?

时间:2022-08-25 16:47:45

I’m using Java 6 (not an option to upgrade at this time). I have a Java string that contains the following value:

我正在使用Java 6(此时不是升级选项)。我有一个包含以下值的Java字符串:

My Product Edition 2014©

The last symbol is a copyright symbol (©). When this string outputs to my terminal (using bash on Mac 10.9.5), the copyright symbol renders as a question mark.

最后一个符号是版权符号(©)。当此字符串输出到我的终端(在Mac 10.9.5上使用bash)时,版权符号呈现为问号。

I’d like to know how to remove all characters from my string that will render as question marks on my terminal.

我想知道如何从我的字符串中删除所有字符,这些字符将在终端上呈现为问号。

4 个解决方案

#1


if you want to remove special characters, you could do some thing like this:

如果你想删除特殊字符,你可以这样做:

String s = "My Product Edition 2014©";

s = s.replaceAll("[^\\w\\s]", "");

System.out.println(s);

Output:

My Product Edition 2014

#2


The "right" thing to do here is to fix your terminal, so it doesn't print squares. See How do you echo a 4-digit Unicode character in Bash? and try just echoing Unicode characters directly in your terminal. It may be as simple as ensuring your LANG environment variable is set to UTF-8 (on my Mac, $LANG is en_US.UTF-8). You might also consider using a more full-featured terminal, like iTerm2.

这里“正确”的做法是修复您的终端,因此它不会打印正方形。请参阅如何在Bash中回显4位Unicode字符?并尝试直接在终端中回显Unicode字符。它可能就像确保您的LANG环境变量设置为UTF-8一样简单(在我的Mac上,$ LANG是en_US.UTF-8)。您也可以考虑使用功能更全面的终端,如iTerm2。

If you really want to strip non-ASCII characters in Java instead, there's a number of equally reasonable ways to do so, but my preference is with Guava's CharMatcher, e.g.:

如果你真的想在Java中删除非ASCII字符,那么有很多同样合理的方法可以这样做,但我更喜欢Guava的CharMatcher,例如:

String stripped = CharMatcher.ASCII.retainFrom(original);

You could use a Pattern to strip undesirable characters, but (as demonstrated by the confusion here) it's more hassle than using Guava's out of the box solution.

您可以使用模式来删除不需要的字符,但是(正如此处的混淆所示)它比使用Guava开箱即用的解决方案更麻烦。

#3


You better adopt the notion that there is no such thing as a "special character". However, there are a couple of reasons why some characters are not shown correctly.

你最好采用不存在“特殊性格”之类的概念。但是,有几个原因导致某些字符无法正确显示。

Java will keep all strings in UTF-16 encoding internally. When you print a string, the characters are converted to the encoding of the corresponding output stream or output writer. Unfortunately, the java runtime tries to be smart and uses what is called the "default" encoding unless you explicitly demanded a specific encoding.

Java将在内部保留所有UTF-16编码字符串。打印字符串时,字符将转换为相应输出流或输出编写器的编码。不幸的是,除非您明确要求特定的编码,否则Java运行时会尝试使用所谓的“默认”编码。

This hurts especially Windows users, where the default encoding often turns out to be some archaic Microsoft "code page". I have yet to find out where I can tell Windows that I don't want their CP 850 (which is the default whenever you have a german keyboard).

这尤其会伤害Windows用户,其中默认编码通常是一些古老的Microsoft“代码页”。我还没有找到我可以告诉Windows我不想要他们的CP 850的地方(这是默认的,只要你有德语键盘)。

In the long run, you'll fare best when you make the following a habit:

从长远来看,当你养成以下习惯时,你会表现得最好:

  1. Open all your output streams (or writers) with UTF-8 encoding. Don't use System.out/System.err.
  2. 使用UTF-8编码打开所有输出流(或编写器)。不要使用System.out / System.err。

  3. Make sure you use a terminal that can handle UTF-8. If you're on windows, enter chcp 65001 to set the encoding of the cmd-window to UTF-8 and use a font that can render the UTF characters.
  4. 确保使用可以处理UTF-8的终端。如果您在Windows上,请输入chcp 65001将cmd-window的编码设置为UTF-8,并使用可以呈现UTF字符的字体。

#4


You can trim all characters other than non readable ASCII character using regEx and replaceAll()

您可以使用regEx和replaceAll()修剪除不可读ASCII字符以外的所有字符

public static String asciiOnly(String unicodeString)
{
    String asciiString = unicodeString.replaceAll("[^\\x20-\\x7E]", "");
    return asciiString;
}

Here is the explanation of Regular expression "[^\\x20-\\x7E]":

以下是正则表达式“[^ \\ x20 - \\ x7E]”的说明:

  • ^ - Not
  • ^ - 没有

  • \\x20 - Hex value representing space which is first writable ASCII character.
  • \\ x20 - 表示空格的十六进制值,它是第一个可写的ASCII字符。

  • - - Represent to, ie x20 to x7E
  • - - 表示,即x20到x7E

  • \\x7E - Hex value representing ~ which is the last writable ASCII character
  • \\ x7E - 表示〜的十六进制值,即最后一个可写ASCII字符


ASCII

如何从Java字符串中删除版权和其他非ASCII字符?

#1


if you want to remove special characters, you could do some thing like this:

如果你想删除特殊字符,你可以这样做:

String s = "My Product Edition 2014©";

s = s.replaceAll("[^\\w\\s]", "");

System.out.println(s);

Output:

My Product Edition 2014

#2


The "right" thing to do here is to fix your terminal, so it doesn't print squares. See How do you echo a 4-digit Unicode character in Bash? and try just echoing Unicode characters directly in your terminal. It may be as simple as ensuring your LANG environment variable is set to UTF-8 (on my Mac, $LANG is en_US.UTF-8). You might also consider using a more full-featured terminal, like iTerm2.

这里“正确”的做法是修复您的终端,因此它不会打印正方形。请参阅如何在Bash中回显4位Unicode字符?并尝试直接在终端中回显Unicode字符。它可能就像确保您的LANG环境变量设置为UTF-8一样简单(在我的Mac上,$ LANG是en_US.UTF-8)。您也可以考虑使用功能更全面的终端,如iTerm2。

If you really want to strip non-ASCII characters in Java instead, there's a number of equally reasonable ways to do so, but my preference is with Guava's CharMatcher, e.g.:

如果你真的想在Java中删除非ASCII字符,那么有很多同样合理的方法可以这样做,但我更喜欢Guava的CharMatcher,例如:

String stripped = CharMatcher.ASCII.retainFrom(original);

You could use a Pattern to strip undesirable characters, but (as demonstrated by the confusion here) it's more hassle than using Guava's out of the box solution.

您可以使用模式来删除不需要的字符,但是(正如此处的混淆所示)它比使用Guava开箱即用的解决方案更麻烦。

#3


You better adopt the notion that there is no such thing as a "special character". However, there are a couple of reasons why some characters are not shown correctly.

你最好采用不存在“特殊性格”之类的概念。但是,有几个原因导致某些字符无法正确显示。

Java will keep all strings in UTF-16 encoding internally. When you print a string, the characters are converted to the encoding of the corresponding output stream or output writer. Unfortunately, the java runtime tries to be smart and uses what is called the "default" encoding unless you explicitly demanded a specific encoding.

Java将在内部保留所有UTF-16编码字符串。打印字符串时,字符将转换为相应输出流或输出编写器的编码。不幸的是,除非您明确要求特定的编码,否则Java运行时会尝试使用所谓的“默认”编码。

This hurts especially Windows users, where the default encoding often turns out to be some archaic Microsoft "code page". I have yet to find out where I can tell Windows that I don't want their CP 850 (which is the default whenever you have a german keyboard).

这尤其会伤害Windows用户,其中默认编码通常是一些古老的Microsoft“代码页”。我还没有找到我可以告诉Windows我不想要他们的CP 850的地方(这是默认的,只要你有德语键盘)。

In the long run, you'll fare best when you make the following a habit:

从长远来看,当你养成以下习惯时,你会表现得最好:

  1. Open all your output streams (or writers) with UTF-8 encoding. Don't use System.out/System.err.
  2. 使用UTF-8编码打开所有输出流(或编写器)。不要使用System.out / System.err。

  3. Make sure you use a terminal that can handle UTF-8. If you're on windows, enter chcp 65001 to set the encoding of the cmd-window to UTF-8 and use a font that can render the UTF characters.
  4. 确保使用可以处理UTF-8的终端。如果您在Windows上,请输入chcp 65001将cmd-window的编码设置为UTF-8,并使用可以呈现UTF字符的字体。

#4


You can trim all characters other than non readable ASCII character using regEx and replaceAll()

您可以使用regEx和replaceAll()修剪除不可读ASCII字符以外的所有字符

public static String asciiOnly(String unicodeString)
{
    String asciiString = unicodeString.replaceAll("[^\\x20-\\x7E]", "");
    return asciiString;
}

Here is the explanation of Regular expression "[^\\x20-\\x7E]":

以下是正则表达式“[^ \\ x20 - \\ x7E]”的说明:

  • ^ - Not
  • ^ - 没有

  • \\x20 - Hex value representing space which is first writable ASCII character.
  • \\ x20 - 表示空格的十六进制值,它是第一个可写的ASCII字符。

  • - - Represent to, ie x20 to x7E
  • - - 表示,即x20到x7E

  • \\x7E - Hex value representing ~ which is the last writable ASCII character
  • \\ x7E - 表示〜的十六进制值,即最后一个可写ASCII字符


ASCII

如何从Java字符串中删除版权和其他非ASCII字符?