java,正则表达式,需要在正则表达式中转义反斜杠

时间:2021-07-21 00:18:32

With reference to below question - String.replaceAll single backslashes with double backslashes

参考下面的问题 - String.replaceAll单反斜杠和双反斜杠

I wrote a test program, and I found that the result is true in both cases, whether I escape the backslash or not. This may be because - \t is a recognized Java String escape sequence. (Try \s and it would complain). - \t is taken as literal tab in the regex. I am somewhat unsure of the reasons.

我写了一个测试程序,我发现在两种情况下结果都是正确的,无论我是否逃避反斜杠。这可能是因为 - \ t是可识别的Java String转义序列。 (尝试\ s,它会抱怨)。 - \ t在正则表达式中作为文字选项卡。我有点不确定原因。

Is there any general guideline about escaping regex in Java. I think using two backslashes is the correct approach.

有没有关于在Java中转义正则表达式的一般准则。我认为使用两个反斜杠是正确的方法。

I would still like to know your opinions.

我仍然想知道你的意见。

public class TestDeleteMe {

  public static void main(String args[]) {
    System.out.println(System.currentTimeMillis());

    String str1 = "a    b"; //tab between a and b 

    //pattern - a and b with any number of spaces or tabs between 
    System.out.println("matches = " + str1.matches("^a[ \\t]*b$")); 
    System.out.println("matches = " + str1.matches("^a[ \t]*b$")); 
  }
}

4 个解决方案

#1


9  

There are two interpretations of escape sequences going on: first by the Java compiler, and then by the regexp engine. When Java compiler sees two slashes, it replaces them with a single slash. When there is t following a slash, Java replaces it with a tab; when there is a t following a double-slash, Java leaves it alone. However, because two slashes have been replaced by a single slash, regexp engine sees \t, and interprets it as a tab.

转义序列有两种解释:首先是Java编译器,然后是regexp引擎。当Java编译器看到两个斜杠时,它会用一个斜杠替换它们。当有一个斜杠后,Java用一个选项卡替换它;当双斜杠后面有一个t时,Java就不管它了。但是,因为两个斜杠已被单个斜杠替换,所以regexp引擎会看到\ t,并将其解释为选项卡。

I think that it is cleaner to let the regexp interpret \t as a tab (i.e. write "\\t" in Java) because it lets you see the expression in its intended form during debugging, logging, etc. If you convert Pattern with \t to string, you will see a tab character in the middle of your regular expression, and may confuse it for other whitespace. Patterns with \\t do not have this problem: they will show you a \t with a single slash, telling you exactly the kind of whitespace that they match.

我认为让正则表达式将\ t解释为选项卡(即在Java中编写“\\ t”)更为清晰,因为它允许您在调试,日志记录等过程中以预期形式查看表达式。如果转换模式与\ t到字符串,您将在正则表达式的中间看到一个制表符,并可能将其混淆为其他空格。具有\\ t的模式没有这个问题:它们会向您显示带有单个斜杠的\ t,告诉您它们匹配的空白类型。

#2


6  

Yes, there is a general guideline about escaping: Escape sequences in your Java source get replaced by the Java compiler (or some preprocessor eventually). The compiler will complain about any escape sequences it does not know, e.g. \s. When you write a String literal for a RegEx pattern, the compiler will process this literal as usual and replace all escape sequences with the according character. Then, when the program is executed, the Pattern class compiles the input String, that is, it will evaluate escape sequences another time. The Pattern class knows \s as a character class and will therefore be able to compile a pattern containing this class. However, you need to escape \s from the Java compiler which does not know this escape sequence. To do so, you escape the backslash resulting in \\s.

是的,有一个关于转义的一般准则:Java源代码中的转义序列被Java编译器(或最终的某些预处理器)取代。编译器会抱怨它不知道的任何转义序列,例如\ S。为RegEx模式编写字符串文字时,编译器将照常处理此文字,并使用相应的字符替换所有转义序列。然后,当程序执行时,Pattern类编译输入String,也就是说,它将再次评估转义序列。 Pattern类知道它是一个字符类,因此能够编译包含该类的模式。但是,您需要从不知道此转义序列的Java编译器中转义。为此,您可以转义反斜杠,从而产生\\ s。

In short, you always need to escape character classes for RegEx patterns twice. If you want to match a backslash, the correct pattern is \\\\ because the Java compiler will make it \\ which the Pattern compiler will recognize as the escaped backslash character.

简而言之,您总是需要两次转义RegEx模式的字符类。如果要匹配反斜杠,则正确的模式为\\\\,因为Java编译器将使其成为模式编译器将识别为转义反斜杠字符的模式。

#3


6  

The first form \\t will be expanded to a tab char by the pattern class.

第一个表单\ t将被模式类扩展为tab char。

The second form \t will be expanded to a tab char by Java before it builds a pattern.

在构建模式之前,第二种形式\ t将被Java扩展为制表符char。

In the end, you get a tab char either way.

最后,无论如何都会得到一个tab char。

#4


0  

With org.apache.commons.lang3.StringEscapeUtils.unescapeJava(...), you can escape most of the common spl.chars and also the unicode characters (converts unicode charset to readable regular character)

使用org.apache.commons.lang3.StringEscapeUtils.unescapeJava(...),您可以转义大多数常见的spl.chars以及unicode字符(将unicode字符集转换为可读的常规字符)

#1


9  

There are two interpretations of escape sequences going on: first by the Java compiler, and then by the regexp engine. When Java compiler sees two slashes, it replaces them with a single slash. When there is t following a slash, Java replaces it with a tab; when there is a t following a double-slash, Java leaves it alone. However, because two slashes have been replaced by a single slash, regexp engine sees \t, and interprets it as a tab.

转义序列有两种解释:首先是Java编译器,然后是regexp引擎。当Java编译器看到两个斜杠时,它会用一个斜杠替换它们。当有一个斜杠后,Java用一个选项卡替换它;当双斜杠后面有一个t时,Java就不管它了。但是,因为两个斜杠已被单个斜杠替换,所以regexp引擎会看到\ t,并将其解释为选项卡。

I think that it is cleaner to let the regexp interpret \t as a tab (i.e. write "\\t" in Java) because it lets you see the expression in its intended form during debugging, logging, etc. If you convert Pattern with \t to string, you will see a tab character in the middle of your regular expression, and may confuse it for other whitespace. Patterns with \\t do not have this problem: they will show you a \t with a single slash, telling you exactly the kind of whitespace that they match.

我认为让正则表达式将\ t解释为选项卡(即在Java中编写“\\ t”)更为清晰,因为它允许您在调试,日志记录等过程中以预期形式查看表达式。如果转换模式与\ t到字符串,您将在正则表达式的中间看到一个制表符,并可能将其混淆为其他空格。具有\\ t的模式没有这个问题:它们会向您显示带有单个斜杠的\ t,告诉您它们匹配的空白类型。

#2


6  

Yes, there is a general guideline about escaping: Escape sequences in your Java source get replaced by the Java compiler (or some preprocessor eventually). The compiler will complain about any escape sequences it does not know, e.g. \s. When you write a String literal for a RegEx pattern, the compiler will process this literal as usual and replace all escape sequences with the according character. Then, when the program is executed, the Pattern class compiles the input String, that is, it will evaluate escape sequences another time. The Pattern class knows \s as a character class and will therefore be able to compile a pattern containing this class. However, you need to escape \s from the Java compiler which does not know this escape sequence. To do so, you escape the backslash resulting in \\s.

是的,有一个关于转义的一般准则:Java源代码中的转义序列被Java编译器(或最终的某些预处理器)取代。编译器会抱怨它不知道的任何转义序列,例如\ S。为RegEx模式编写字符串文字时,编译器将照常处理此文字,并使用相应的字符替换所有转义序列。然后,当程序执行时,Pattern类编译输入String,也就是说,它将再次评估转义序列。 Pattern类知道它是一个字符类,因此能够编译包含该类的模式。但是,您需要从不知道此转义序列的Java编译器中转义。为此,您可以转义反斜杠,从而产生\\ s。

In short, you always need to escape character classes for RegEx patterns twice. If you want to match a backslash, the correct pattern is \\\\ because the Java compiler will make it \\ which the Pattern compiler will recognize as the escaped backslash character.

简而言之,您总是需要两次转义RegEx模式的字符类。如果要匹配反斜杠,则正确的模式为\\\\,因为Java编译器将使其成为模式编译器将识别为转义反斜杠字符的模式。

#3


6  

The first form \\t will be expanded to a tab char by the pattern class.

第一个表单\ t将被模式类扩展为tab char。

The second form \t will be expanded to a tab char by Java before it builds a pattern.

在构建模式之前,第二种形式\ t将被Java扩展为制表符char。

In the end, you get a tab char either way.

最后,无论如何都会得到一个tab char。

#4


0  

With org.apache.commons.lang3.StringEscapeUtils.unescapeJava(...), you can escape most of the common spl.chars and also the unicode characters (converts unicode charset to readable regular character)

使用org.apache.commons.lang3.StringEscapeUtils.unescapeJava(...),您可以转义大多数常见的spl.chars以及unicode字符(将unicode字符集转换为可读的常规字符)