从字符串生成正则表达式

时间:2022-07-08 21:14:03

I wish to generate a regular expression from a string containing numbers, and then use this as a Pattern to search for similar strings. Example:

我希望从包含数字的字符串生成一个正则表达式,然后将其用作搜索类似字符串的模式。例子:

String s = "Page 3 of 23"

If I substitute all digits by \d

如果我用\d替换所有的数字

    StringBuilder sb = new StringBuilder();
    for (int i = 0; i < s.length(); i++) {
    char c = s.charAt(i);
    if (Character.isDigit(c)) {
        sb.append("\\d"); // backslash d
    } else {
        sb.append(c);
        }
    }

    Pattern numberPattern = Pattern.compile(sb.toString());

//    Pattern numberPattern = Pattern.compile("Page \d of \d\d");

I can use this to match similar strings (e.g. "Page 7 of 47"). My problem is that if I do this naively some of the metacharacters such as (){}-, etc. will not be escaped. Is there a library to do this or an exhaustive set of characters for regular expressions which I must and must not escape? (I can try to extract them from the Javadocs but am worried about missing something).

我可以用这个来匹配类似的字符串。“47页7”)。我的问题是,如果我天真地这样做,一些元字符(){}-等将不会被转义。是否有一个库来完成此操作,或者有一组完整的字符用于正则表达式,而我必须且不能转义它们?(我可以尝试从Javadocs中提取它们,但我担心会漏掉什么)。

Alternatively is there a library which already does this (I don't at this stage want to use a full Natural Language Processing solution).

或者有一个库已经这样做了(我目前不想使用完整的自然语言处理解决方案)。

NOTE: @dasblinkenlight's edited answer now works for me!

注:@dasblinkenlight编辑的答案现在对我有效!

1 个解决方案

#1


10  

Java's regexp library provides this functionality:

Java的regexp库提供了以下功能:

String s = Pattern.quote(orig);

The "quoted" string will have all its metacharacters escaped. First, escape your string, and then go through it and replace digits by \d to make a regular expression. Since regex library uses \Q and \E for quoting, you need to enclose your portion of regex in inverse quotes of \E and \Q.

“引用”字符串将使所有的元字符都脱逃。首先,转义您的字符串,然后遍历它,并用\d替换数字,以形成一个正则表达式。由于regex库使用\Q和\E进行引用,所以需要将您的regex部分包含在\E和\Q的逆引号中。

One thing I would change in your implementation is the replacement algorithm: rather than replacing character-by-character, I would replace digits in groups. This would let an expression produced from Page 3 of 23 match strings like Page 13 of 23 and Page 6 of 8.

在您的实现中,我要更改的一件事是替换算法:我将替换组中的数字,而不是逐个字符替换。这将使从第3页(23)生成的表达式匹配字符串,如第13页(23)和第6页(8)。

String p = Pattern.quote(orig).replaceAll("\\d+", "\\\\E\\\\d+\\\\Q");

This would produce "\QPage \E\d+\Q of \E\d+\Q\E" no matter what page numbers and counts were there originally. The output needs only one, not two slashes in \d, because the result is fed directly to regex engine, bypassing the Java compiler.

这将产生“\数量\数量\数量\数量\数量\数量\数量\数量\数量\数量\数量\数量\数量\数量\数量\数量”。在\d中,输出只需要一个斜杠,而不是两个斜杠,因为结果被直接发送给regex引擎,绕过Java编译器。

#1


10  

Java's regexp library provides this functionality:

Java的regexp库提供了以下功能:

String s = Pattern.quote(orig);

The "quoted" string will have all its metacharacters escaped. First, escape your string, and then go through it and replace digits by \d to make a regular expression. Since regex library uses \Q and \E for quoting, you need to enclose your portion of regex in inverse quotes of \E and \Q.

“引用”字符串将使所有的元字符都脱逃。首先,转义您的字符串,然后遍历它,并用\d替换数字,以形成一个正则表达式。由于regex库使用\Q和\E进行引用,所以需要将您的regex部分包含在\E和\Q的逆引号中。

One thing I would change in your implementation is the replacement algorithm: rather than replacing character-by-character, I would replace digits in groups. This would let an expression produced from Page 3 of 23 match strings like Page 13 of 23 and Page 6 of 8.

在您的实现中,我要更改的一件事是替换算法:我将替换组中的数字,而不是逐个字符替换。这将使从第3页(23)生成的表达式匹配字符串,如第13页(23)和第6页(8)。

String p = Pattern.quote(orig).replaceAll("\\d+", "\\\\E\\\\d+\\\\Q");

This would produce "\QPage \E\d+\Q of \E\d+\Q\E" no matter what page numbers and counts were there originally. The output needs only one, not two slashes in \d, because the result is fed directly to regex engine, bypassing the Java compiler.

这将产生“\数量\数量\数量\数量\数量\数量\数量\数量\数量\数量\数量\数量\数量\数量\数量\数量”。在\d中,输出只需要一个斜杠,而不是两个斜杠,因为结果被直接发送给regex引擎,绕过Java编译器。