Java正则表达式之代码实战

时间:2023-01-16 18:45:57

Java中用于正则表达式的类位于java.util.regex包中,该包包含三个类,分别为:Pattern、Matcher、PatternSyntaxException。Pattern对象是正则表达式编译后的表现形式,该类没有公共的构造方法,所以无法直接创建该类的对象,但该类提供了构建Pattern对象的两个公共静态方法,分别为compile(String regex)和compile(String regex, int flags),这两个方法中的参数regex表示正则表达式,flags表示匹配模式。Matcher是解释Pattern对象和执行匹配输入字符串的引擎,该类也没有定义公共的构造方法,通过调用Pattern对象中的matcher(CharSequence input)方法得到Matcher对象。PatternSyntaxException类是表示正则表达式语法错误的未检查异常。Java中的正则表达式主要是通过这三个类表示和执行的,下面通过简单的代码示例来学习这三个类的使用方式(主要是前两个类)和正则表达式。下面的代码用于学习正则表达式的各种语法规则,可以循环运行,提示要输入正则表达式和要匹配的字符串,给出是否是否匹配以及匹配时的匹配位置。

import java.util.Scanner;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class RegexStudy {
public static void main(String[] args){
Scanner scanner = new Scanner(System.in);
while (true) {
System.out.print("Enter your regex:");
Pattern pattern = Pattern.compile(scanner.nextLine());
System.out.print("Enter input string to search: ");
Matcher matcher = pattern.matcher(scanner.nextLine());
boolean found = false;
while (matcher.find()) {
System.out.printf("I found the text \"%s\" starting at " +
"index %d and ending at index %d.%n",
matcher.group(), matcher.start(), matcher.end());
found = true;
}
if(!found){
System.out.println("No match found.");
}
}
}
}

首先看看对字符类的测试,在正则表达式中,字符类是被方括号括起来的字符串,指定了可以匹配输入字符串中的单个字符,也就是方括号中的字符串仅仅表示了一个字符,虽然有很多字符。利用上面的程序对字符类的各个情况进行测试。

            简单类([abc])。在下面的例子中只有第一个字符是方括号中的某个字符时匹配才会成功。

    Enter your regex:[bcr]at
    Enter input string to search: bat
    I found the text "bat" starting at index 0 and ending at index 3.
    Enter your regex:[bcr]at
    Enter input string to search: cat
    I found the text "cat" starting at index 0 and ending at index 3.
    Enter your regex:[bcr]at
    Enter input string to search: rat
    I found the text "rat" starting at index 0 and ending at index 3.
    Enter your regex:[bcr]at
    Enter input string to search: hat
    No match found.

    否定([^abc])。只有第一个字符不是方括号中的任何字符时匹配才会成功。

    Enter your regex:[^bcr]at
    Enter input string to search: bat
    No match found.
    Enter your regex:[^bcr]at
    Enter input string to search: cat
    No match found.
    Enter your regex:[^bcr]at
    Enter input string to search: rat
    No match found.
    Enter your regex:[^bcr]at
    Enter input string to search: hat
    I found the text "hat" starting at index 0 and ending at index 3.

    范围([a-zA-Z])。在第一个字符和最后一个字符之间插入-可以指定这两个字符之间的某个字符,例如a-z,1-5。也可以将不同范围的字符串彼此相连进一步扩展匹配的可能性,比如[a-zA-Z1-5]指定了a到z或者A-Z或者1-5,范围字符串直接相连是或的关系。

    Enter your regex:[a-c]
    Enter input string to search: a
    I found the text "a" starting at index 0 and ending at index 1.
    Enter your regex:[a-c]b
    Enter input string to search: bb
    I found the text "bb" starting at index 0 and ending at index 2.
    Enter your regex:[a-c1-5]
    Enter input string to search: 1
    I found the text "1" starting at index 0 and ending at index 1.

    并集([a-d[m-p]])。或可以将两个或者更多的字符类组合起来表示某个字符,表示或的关系只需将某个字符类嵌入字符类中即可。比如:[1-4[6-8]]。这种创建或关系的方式与上面直接接范围的方式结果是一致的。

    Enter your regex:[a-c[d-e]]
    Enter input string to search: d
    I found the text "d" starting at index 0 and ending at index 1.
    Enter your regex:[a-c[d-e]]
    Enter input string to search: f
    No match found.
    Enter your regex:[a-c[d-e]]
    Enter input string to search: b
    I found the text "b" starting at index 0 and ending at index 1.
    Enter your regex:[1-4[6-8]]
    Enter input string to search: 4
    I found the text "4" starting at index 0 and ending at index 1.
    Enter your regex:[1-4[6-8]]
    Enter input string to search: 7
    I found the text "7" starting at index 0 and ending at index 1.
    Enter your regex:[1-46-8]
    Enter input string to search: 4
    I found the text "4" starting at index 0 and ending at index 1.
    Enter your regex:[1-46-8]
    Enter input string to search: 6
    I found the text "6" starting at index 0 and ending at index 1.

    交集([a-z&&[def]])。通过&&符号可以仅仅匹配嵌入在某个字符类中的字符类,比如[0-9&&[345]],仅仅匹配3,4,5。

    Enter your regex:[0-9&&[3-5]]
    Enter input string to search: 0
    No match found.
    Enter your regex:[0-9&&[3-5]]
    Enter input string to search: 5
    I found the text "5" starting at index 0 and ending at index 1.
    Enter your regex:[0-9&&[3-5]]
    Enter input string to search: 4
    I found the text "4" starting at index 0 and ending at index 1.
    Enter your regex:[0-9&&[246]]
    Enter input string to search: 5
    No match found.
    Enter your regex:[0-9&&[246]]
    Enter input string to search: 6
    I found the text "6" starting at index 0 and ending at index 1.

     差集([a-z&&[^bc]]、[a-z&&[^m-p]])。通过使用上面介绍的^、&&可以构造差集,比如[0-9&&[^345]]匹配0到9之间除了3,4,5之外的数字,也就是0,1,2,6,7,8,9。

    Enter your regex:[0-9&&[^345]]
    Enter input string to search: 4
    No match found.
    Enter your regex:[0-9&&[^345]]
    Enter input string to search: 2
    I found the text "2" starting at index 0 and ending at index 1.
    Enter your regex:[0-9&&[^345]]
    Enter input string to search: 5
    No match found.
    Enter your regex:[0-9&&[^345]]
    Enter input string to search: 6
    I found the text "6" starting at index 0 and ending at index 1.
    Enter your regex:[0-9&&[^2-7]]
    Enter input string to search: 4
    No match found.
    Enter your regex:[0-9&&[^2-7]]
    Enter input string to search: 1
    I found the text "1" starting at index 0 and ending at index 1.

    Java提供了几个方便的预定义字符类,比如点号(.)、\d、\D、\s、\S、\w、\W。在使用正则表达式时尽量使用这些预定义字符类,因为它们可以使代码更容易阅读并且减少过多字符所引起的错误。如果在字符串中中使用带斜线的预定义字符类,必须在斜线之前再增加一个斜线,比如:private final String REGEX = “\\d”。在上面的程序中,由于是从控制台直接读取正则表达式,所以不需要额外的斜线,但在代码中额外的斜线是必需的。

    Enter your regex:.
    Enter input string to search: @
    I found the text "@" starting at index 0 and ending at index 1.
    Enter your regex:.
    Enter input string to search: a
    I found the text "a" starting at index 0 and ending at index 1.
    Enter your regex:.
    Enter input string to search: 1
    I found the text "1" starting at index 0 and ending at index 1.
    Enter your regex:\d
    Enter input string to search: 1
    I found the text "1" starting at index 0 and ending at index 1.
    Enter your regex:\d
    Enter input string to search: e
    No match found.
    Enter your regex:\D
    Enter input string to search: 1
    No match found.
    Enter your regex:\D
    Enter input string to search: w
    I found the text "w" starting at index 0 and ending at index 1.
    Enter your regex:\S
    Enter input string to search: s
    I found the text "s" starting at index 0 and ending at index 1.
    Enter your regex:\w
    Enter input string to search: 1
    I found the text "1" starting at index 0 and ending at index 1.
    Enter your regex:\w
    Enter input string to search:
    !
    No match found.
    Enter your regex:\W
    Enter input string to search: !
    I found the text "!" starting at index 0 and ending at index 1.
    Enter your regex:\W
    Enter input string to search: d
    No match found.

    再来看看正则表示中的数量词。在语法规则中介绍了三种数量词,按照解释,这三者没什么区别,但实际上它们之间是存在很微妙的区别的,而这些区别会导致匹配结果不同。先看一个简单的例子:

    Enter your regex:a?
    Enter input string to search:
    I found the text "" starting at index 0 and ending at index 0.
    Enter your regex:a*
    Enter input string to search:
    I found the text "" starting at index 0 and ending at index 0.
    Enter your regex:a+
    Enter input string to search:
    No match found.

    在该例子中,正则表达式分别为a?、a*、a+,输入字符串为””空字符串。前两个正则表达式都能正确匹配空字符串,但第三个则匹配失败。但在前两个匹配成功的例子中,开始和结束位置的索引都为0,这是因为空字符串的长度为0,这种匹配称为零长度匹配。零长度匹配很容易识别,因为开始和结束位置的索引总是相同的。零长度匹配出现在以下几种情况中:在空字符串中,在输入字符串的开始处,在输入字符串的最后一个字符后,或者在两个字符之间,如果在空字符串中,则开始和结束索引皆为0。由于?和*可以表示0次,而+表示至少一次,所以也只有前面两个会出现零长度匹配的现象,而+则不会出现。下面演示一下零匹配的几种情况。

    Enter your regex:a?
    Enter input string to search: ababaaab
    I found the text "a" starting at index 0 and ending at index 1.
    I found the text "" starting at index 1 and ending at index 1.
    I found the text "a" starting at index 2 and ending at index 3.
    I found the text "" starting at index 3 and ending at index 3.
    I found the text "a" starting at index 4 and ending at index 5.
    I found the text "a" starting at index 5 and ending at index 6.
    I found the text "a" starting at index 6 and ending at index 7.
    I found the text "" starting at index 7 and ending at index 7.
    I found the text "" starting at index 8 and ending at index 8.
    Enter your regex:a*
    Enter input string to search: ababaaab
    I found the text "a" starting at index 0 and ending at index 1.
    I found the text "" starting at index 1 and ending at index 1.
    I found the text "a" starting at index 2 and ending at index 3.
    I found the text "" starting at index 3 and ending at index 3.
    I found the text "aaa" starting at index 4 and ending at index 7.
    I found the text "" starting at index 7 and ending at index 7.
    I found the text "" starting at index 8 and ending at index 8.
    Enter your regex:a+
    Enter input string to search: ababaaab
    I found the text "a" starting at index 0 and ending at index 1.
    I found the text "a" starting at index 2 and ending at index 3.
    I found the text "aaa" starting at index 4 and ending at index 7.

    在该例子中,+不会出现零长度匹配,只会匹配一个或者多个a,而?和*则出现了零长度匹配,并且说明了在a字符和不同于a字符之间出现了零长度匹配,比如ab,和输入字符串最后一个字符之后也出现了零长度匹配。

    若想精确的匹配模式n次,只需在要匹配的模式后面跟大括号{n}即可。若要至少匹配n次,只需在n后面增加逗号(,){n,}。要增加匹配次数的上限,则在大括号中增加上限m即可,{n,m}。

    Enter your regex:a{3}
    Enter input string to search: aaaaaaaaa
    I found the text "aaa" starting at index 0 and ending at index 3.
    I found the text "aaa" starting at index 3 and ending at index 6.
    I found the text "aaa" starting at index 6 and ending at index 9.
    Enter your regex:a{3,}
    Enter input string to search: aaaaaaaaa
    I found the text "aaaaaaaaa" starting at index 0 and ending at index 9.
    Enter your regex:a{3,6}
    Enter input string to search: aaaaaaaaa
    I found the text "aaaaaa" starting at index 0 and ending at index 6.
    I found the text "aaa" starting at index 6 and ending at index 9.

    数量词除了对单个字符起作用外,也可以对字符类或者捕获组起作用。比如abc+的意思是ab后面跟着至少一个c,[abc]+表示a或b或c至少出现一次,(abc)+则表示abc这个字符串至少出现一次。也可以这样理解[]()的优先级要高于数量词,要先考虑前两者再考虑数量词。

    Enter your regex:(dog){3}
    Enter input string to search: dogdogdogdogdogdog
    I found the text "dogdogdog" starting at index 0 and ending at index 9.
    I found the text "dogdogdog" starting at index 9 and ending at index 18.
    Enter your regex:dog{3}
    Enter input string to search: dogdogdogdogdogdog
    No match found.
    Enter your regex:[abc]{3}
    Enter input string to search: abccabaaaccbbbc
    I found the text "abc" starting at index 0 and ending at index 3.
    I found the text "cab" starting at index 3 and ending at index 6.
    I found the text "aaa" starting at index 6 and ending at index 9.
    I found the text "ccb" starting at index 9 and ending at index 12.
    I found the text "bbc" starting at index 12 and ending at index 15.
    Enter your regex:abc{3}
    Enter input string to search: abccabaaaccbbbc
    No match found.

    贪婪型、勉强型和占有型数量词虽然含义一样,但它们之间存在微妙的区别,而这些区别往往会导致匹配结果的不同。

    贪婪型之所以成为贪婪是因为在试图第一次匹配之前,强迫匹配器一次性读入整个字符串,如果第一次匹配失败(匹配整个字符串),匹配器卸下输入字符串的一个字符然后重试,该过程一直重复执行直到找到匹配或者没有多余的字符可以卸下。

    勉强型采用与贪婪型相反的方法,在输入字符串的第一个字符开始,每次读入一个字符进行匹配,不成功将继续读入单个字符。若都不成功则最后匹配的是整个输入字符串。

    而占有型数量词总是读入整个输入字符串,尝试仅且仅有一次匹配。不像贪婪型数量词,占有型数量词从不进行回退,即使这样做会使得整个匹配成功。

    下面以输入字符串xfooxxxxxxfoo为例说明这三种数量词的区别。

    Enter your regex:.*foo
    Enter input string to search: xfooxxxxxxfoo
    I found the text "xfooxxxxxxfoo" starting at index 0 and ending at index 13.

    该例子使用贪婪型数量词.*查找任何字符出现零次或者多次,紧跟着字母"f" "o" "o"。.*部分首先读入整个输入字符串,也就是xfooxxxxxxfoo,此时正则表达式为xfooxxxxxxfoofoo,与输入字符串不匹配。这时匹配器一次回退一个字符直到最右侧的foo去掉,正则表达式为xfooxxxxxxfoo,匹配成功,搜索结束。

    Enter your regex:.*?foo
    Enter input string to search: xfooxxxxxxfoo
    I found the text "xfoo" starting at index 0 and ending at index 4.
    I found the text "xxxxxxfoo" starting at index 4 and ending at index 13.

    该例子使用了勉强型数量词.*?。开始匹配前不读入任何字符,由于”foo”没有出现在输入字符串的开头,所以勉强型数量词强迫匹配器读入第一个字母x,这触发了第一个匹配。第一个匹配成功后,代码将继续循环知道输入字符串处理完毕,在这个例子中会找到第二个匹配xxxxxxfoo。

    Enter your regex:.*+foo
    Enter input string to search: xfooxxxxxxfoo
    No match found.

    该例子使用占有型数量词匹配输入字符串,结果匹配失败。因为该量词读入整个输入字符串,此时的正则表达式为xfooxxxxxxfoofoo,匹配xfooxxxxxxfoo失败。不像贪婪型数量词那样回退字符进行匹配,它只匹配一次。当匹配不是立即被找到时,占有型数量词的性能要比贪婪型的好,因为它不进行回退匹配。

    捕获组通过将多个字符放在括号内使其作为一个单元来处理,为了稍后通过反向引用使用捕获组,输入字符串中匹配捕获组的部分将被保存在内存中。在正则表达式中,反向引用使用反斜线\紧跟着表示第几个捕获组的数字来表示。Matcher对象中的groupCount()方法返回匹配器模式中捕获组的数量。

    Enter your regex:(\d\d)
    Enter input string to search: 1212
    I found the text "12" starting at index 0 and ending at index 2.
    I found the text "12" starting at index 2 and ending at index 4.
    Enter your regex:(\d\d)
    Enter input string to search: 1234
    I found the text "12" starting at index 0 and ending at index 2.
    I found the text "34" starting at index 2 and ending at index 4.
    Enter your regex:(\d\d)\1
    Enter input string to search: 1212
    I found the text "1212" starting at index 0 and ending at index 4.
    Enter your regex:(\d\d)\1
    Enter input string to search: 1234
    No match found.

    正则表达式(\d\d)表示任意两位数字,前面两个例子的输出也验证了这一点。后两个例子加入了反向引用,则结果就不同了。因为加入了反向引用,则匹配的字符串必须是(\d\d)重复出现两次,也就是1212和3434。

    通过使用边界匹配器可以使正则表达式模式匹配的更加精确,比如匹配出现在一行的开头或者结尾的特定单词,或者匹配是否发生在一个单词的边界,或者在上次匹配的后面。下面看一些具体的例子。

    Enter your regex:^dog$
    Enter input string to search: dog
    I found the text "dog" starting at index 0 and ending at index 3.
    Enter your regex:^dog$
    Enter input string to search: dog
    No match found.
    Enter your regex:\s*dog$
    Enter input string to search: dog
    I found the text " dog" starting at index 0 and ending at index 4.
    Enter your regex:^dog\w*
    Enter input string to search: dogblahblah
    I found the text "dogblahblah" starting at index 0 and ending at index 11.

    可以使用\b匹配模式是否出现一个单词的边界,根据\b在匹配模式的前后或者前后都存在,匹配结果是不同的。若\b出现在模式的前后则匹配整个单词,若只出现在模式的前面,则匹配以该模式开头的单词,在后面则匹配以该模式结尾的单词。

    Enter your regex:\bdog\b
    Enter input string to search: The dog plays in the yard
    I found the text "dog" starting at index 4 and ending at index 7.
    Enter your regex:\bdog\b
    Enter input string to search: The doggie plays in the yard.
    No match found.
    Enter your regex:\bdog
    Enter input string to search: The doggie plays in the yard.
    I found the text "dog" starting at index 4 and ending at index 7.

    \B则匹配是否出现在非单词的边界。

    Enter your regex:\bdog\B
    Enter input string to search: The doggie plays in the yard
    I found the text "dog" starting at index 4 and ending at index 7.
    Enter your regex:\bdog\B
    Enter input string to search: The dog plays in the yard.
    No match found.

    使用\G则要求匹配只能出现在上一次匹配之后。

    Enter your regex:dog
    Enter input string to search: dog dog
    I found the text "dog" starting at index 0 and ending at index 3.
    I found the text "dog" starting at index 4 and ending at index 7.
    Enter your regex:\Gdog
    Enter input string to search: dog dog
    I found the text "dog" starting at index 0 and ending at index 3.
    Enter your regex:\Gdog
    Enter input string to search: dogdog
    I found the text "dog" starting at index 0 and ending at index 3.
    I found the text "dog" starting at index 3 and ending at index 6.
    第二例子只输出了一个dog,第二个dog开始的位置为4,而上一次匹配后的位置为3,所以失败。之所以能输出第一个dog是因为上次匹配后的位置为0。第三个例子则两个dog都可以匹配成功。这些说明在使用\G时,要注意\G模式的起始位置和上一次匹配的结束位置相同。