I wish to generate a regular expression from a string containing numbers, and then use this as a Pattern to search for similar strings. Example:
我希望从包含数字的字符串生成一个正则表达式,然后将其用作搜索类似字符串的模式。例子:
String s = "Page 3 of 23"
If I substitute all digits by \d
如果我用\d替换所有的数字
StringBuilder sb = new StringBuilder();
for (int i = 0; i < s.length(); i++) {
char c = s.charAt(i);
if (Character.isDigit(c)) {
sb.append("\\d"); // backslash d
} else {
sb.append(c);
}
}
Pattern numberPattern = Pattern.compile(sb.toString());
// Pattern numberPattern = Pattern.compile("Page \d of \d\d");
I can use this to match similar strings (e.g. "Page 7 of 47"
). My problem is that if I do this naively some of the metacharacters such as (){}-
, etc. will not be escaped. Is there a library to do this or an exhaustive set of characters for regular expressions which I must and must not escape? (I can try to extract them from the Javadocs but am worried about missing something).
我可以用这个来匹配类似的字符串。“47页7”)。我的问题是,如果我天真地这样做,一些元字符(){}-等将不会被转义。是否有一个库来完成此操作,或者有一组完整的字符用于正则表达式,而我必须且不能转义它们?(我可以尝试从Javadocs中提取它们,但我担心会漏掉什么)。
Alternatively is there a library which already does this (I don't at this stage want to use a full Natural Language Processing solution).
或者有一个库已经这样做了(我目前不想使用完整的自然语言处理解决方案)。
NOTE: @dasblinkenlight's edited answer now works for me!
注:@dasblinkenlight编辑的答案现在对我有效!
1 个解决方案
#1
10
Java's regexp library provides this functionality:
Java的regexp库提供了以下功能:
String s = Pattern.quote(orig);
The "quoted" string will have all its metacharacters escaped. First, escape your string, and then go through it and replace digits by \d
to make a regular expression. Since regex library uses \Q
and \E
for quoting, you need to enclose your portion of regex in inverse quotes of \E
and \Q
.
“引用”字符串将使所有的元字符都脱逃。首先,转义您的字符串,然后遍历它,并用\d替换数字,以形成一个正则表达式。由于regex库使用\Q和\E进行引用,所以需要将您的regex部分包含在\E和\Q的逆引号中。
One thing I would change in your implementation is the replacement algorithm: rather than replacing character-by-character, I would replace digits in groups. This would let an expression produced from Page 3 of 23
match strings like Page 13 of 23
and Page 6 of 8
.
在您的实现中,我要更改的一件事是替换算法:我将替换组中的数字,而不是逐个字符替换。这将使从第3页(23)生成的表达式匹配字符串,如第13页(23)和第6页(8)。
String p = Pattern.quote(orig).replaceAll("\\d+", "\\\\E\\\\d+\\\\Q");
This would produce "\QPage \E\d+\Q of \E\d+\Q\E"
no matter what page numbers and counts were there originally. The output needs only one, not two slashes in \d
, because the result is fed directly to regex engine, bypassing the Java compiler.
这将产生“\数量\数量\数量\数量\数量\数量\数量\数量\数量\数量\数量\数量\数量\数量\数量\数量”。在\d中,输出只需要一个斜杠,而不是两个斜杠,因为结果被直接发送给regex引擎,绕过Java编译器。
#1
10
Java's regexp library provides this functionality:
Java的regexp库提供了以下功能:
String s = Pattern.quote(orig);
The "quoted" string will have all its metacharacters escaped. First, escape your string, and then go through it and replace digits by \d
to make a regular expression. Since regex library uses \Q
and \E
for quoting, you need to enclose your portion of regex in inverse quotes of \E
and \Q
.
“引用”字符串将使所有的元字符都脱逃。首先,转义您的字符串,然后遍历它,并用\d替换数字,以形成一个正则表达式。由于regex库使用\Q和\E进行引用,所以需要将您的regex部分包含在\E和\Q的逆引号中。
One thing I would change in your implementation is the replacement algorithm: rather than replacing character-by-character, I would replace digits in groups. This would let an expression produced from Page 3 of 23
match strings like Page 13 of 23
and Page 6 of 8
.
在您的实现中,我要更改的一件事是替换算法:我将替换组中的数字,而不是逐个字符替换。这将使从第3页(23)生成的表达式匹配字符串,如第13页(23)和第6页(8)。
String p = Pattern.quote(orig).replaceAll("\\d+", "\\\\E\\\\d+\\\\Q");
This would produce "\QPage \E\d+\Q of \E\d+\Q\E"
no matter what page numbers and counts were there originally. The output needs only one, not two slashes in \d
, because the result is fed directly to regex engine, bypassing the Java compiler.
这将产生“\数量\数量\数量\数量\数量\数量\数量\数量\数量\数量\数量\数量\数量\数量\数量\数量”。在\d中,输出只需要一个斜杠,而不是两个斜杠,因为结果被直接发送给regex引擎,绕过Java编译器。