如何添加Java regex实现中缺失的特性?

时间:2023-02-05 12:51:22

I'm new to Java. As a .Net developer, I'm very much used to the Regex class in .Net. The Java implementation of Regex (Regular Expressions) is not bad but it's missing some key features.

我的新Java。作为. net开发人员,我非常习惯。net中的Regex类。Regex(正则表达式)的Java实现不错,但它缺少一些关键特性。

I wanted to create my own helper class for Java but I thought maybe there is already one available. So is there any free and easy-to-use product available for Regex in Java or should I create one myself?

我想为Java创建我自己的助手类,但我认为可能已经有了一个可用的助手类。那么,在Java中,Regex是否有免费的、易于使用的产品,或者我应该自己创建一个吗?

If I would write my own class, where do you think I should share it for the others to use it?

如果我要写我自己的类,你认为我应该把它分享给别人使用吗?


[Edit]

(编辑)

There were complaints that I wasn't addressing the problem with the current Regex class. I'll try to clarify my question.

有人抱怨我没有解决当前Regex类的问题。我将设法澄清我的问题。

In .Net the usage of a regular expression is easier than in Java. Since both languages are object oriented and very similar in many aspects, I expect to have a similar experience with using regex in both languages. Unfortunately that's not the case.

在。net中,正则表达式的使用比在Java中更容易。由于两种语言都是面向对象的,而且在很多方面都非常相似,所以我希望在两种语言中使用regex都有类似的体验。不幸的是事实并非如此。


Here's a little code compared in Java and C#. The first is C# and the second is Java:

下面是在Java和c#中比较的一些代码。第一个是c#,第二个是Java:

In C#:

在c#中:

string source = "The colour of my bag matches the color of my shirt!";
string pattern = "colou?r";

foreach(Match match in Regex.Matches(source, pattern))
{
    Console.WriteLine(match.Value);
}

In Java:

在Java中:

String source = "The colour of my bag matches the color of my shirt!";
String pattern = "colou?r";
Pattern p = Pattern.compile(pattern);
Matcher m = p.matcher(source);

while(m.find())
{
    System.out.println(source.substring(m.start(), m.end()));
}

I tried to be fair to both languages in the sample code above.

在上面的示例代码中,我试图对两种语言都公平。

The first thing you notice here is the .Value member of the Match class (compared to using .start() and .end() in Java).

这里您首先注意到的是Match类的. value成员(与Java中使用.start()和.end()相比)。

Why should I create two objects when I can call a static function like Regex.Matches or Regex.Match, etc.?

当我可以调用像Regex这样的静态函数时,为什么要创建两个对象。匹配和正则表达式。匹配等。

In more advanced usages, the difference shows itself much more. Look at the method Groups, dictionary length, Capture, Index, Length, Success, etc. These are all very necessary features that in my opinion should be available for Java too.

在更高级的用法中,差异表现得更为明显。查看方法组、字典长度、捕获、索引、长度、成功等等,这些都是我认为Java也应该具备的特性。

Of course all of these features can be manually added by a custom proxy (helper) class. This is main reason why I asked this question. We don't have the breeze of Regex in Perl but at least we can use the .Net approach to Regex which I think is very cleverly designed.

当然,所有这些特性都可以由自定义代理(helper)类手动添加。这就是我问这个问题的主要原因。我们在Perl中没有Regex的经验,但至少我们可以使用. net方法来使用Regex,我认为它设计得非常巧妙。

4 个解决方案

#1


102  

From your edited example, I can now see what you would like. And you have my sympathies in this, too. Java’s regexes are a long, long, long ways from the convenience you find in higher level programming languages like Ruby or Perl. And they pretty much always will be; this cannot be fixed, so we’re stuck with this mess forever — at least in Java. Other JVM languages do a better job at this, especially Groovy. But they still suffer some of the inherent flaws, and can only go so far.

从您编辑的示例中,我可以看到您想要什么。我也很同情你。Java的regex是一种很长的、很长的方法,从您在高级编程语言(如Ruby或Perl)中找到的方便。它们几乎总是;这是无法修复的,因此我们将永远无法摆脱这种混乱——至少在Java中是这样。其他JVM语言在这方面做得更好,特别是Groovy。但它们仍有一些固有的缺陷,只能走这么远。

Where to begin? There are the so-called convenience methods of the String class: matches, replaceAll, replaceFirst, and split. These can sometimes be ok in small programs, depending how you use them. However, they do indeed have several problems, which it appears you have discovered. Here’s a partial list of those problems, and what can and cannot be done about them.

从哪里开始呢?字符串类有所谓的方便方法:match、replaceAll、replaceFirst和split。这些在小程序中有时是可以的,这取决于您如何使用它们。然而,他们确实有几个问题,似乎你已经发现了。以下是这些问题的部分列表,以及可以做什么和不能做什么。

  1. The inconvenience method is very bizarrely named “matches” but it requires you to pad your regex on both sides to match the entire string. This counter-intuitive sense is contrary to any sense of the word match as used in any previous language, and constantly bites people. Patterns passed into the other 3 inconvenience methods work very unlike this one, because in the other 3, they work like normal patterns work everywhere else; just not in matches. This means you can’t just copy your patterns around, even within methods in the same darned class for goodness’ sake! And there is no find convenience method to do what every other matcher in the world does. The matches method should have been called something like FullMatch, and there should have been a PartialMatch or find method added to the String class.

    不方便的方法非常奇怪地命名为“matches”,但它要求您在两边填充regex以匹配整个字符串。这种反直觉的感觉与以往任何一种语言中使用的match(匹配)一词都是相反的,而且会不断地咬人。传递给其他3个不便方法的模式与此方法非常不同,因为在另外3个方法中,它们的工作方式与其他任何地方的正常模式一样;不匹配。这就意味着你不能只是在同一个该死的类里复制你的模式。世界上所有其他的matcher都没有找到方便的方法去做。match方法应该被称为FullMatch,并且应该有一个PartialMatch或find方法添加到String类中。

  2. There is no API that allows you to pass in Pattern.compile flags along with the strings you use for the 4 pattern-related convenience methods of the String class. That means you havce to rely on string versions like (?i) and (?x), but those do not exist for all possible Pattern compilation flags. This is highly inconvenient to say the least.

    没有API允许您通过Pattern.compile标志,以及您使用的4个与模式相关的字符串类的便利方法。这意味着您必须依赖(?i)和(?x)这样的字符串版本,但这些并不存在于所有可能的模式编译标志。这至少可以说是极不方便的。

  3. The split method does not return the same result in edge cases as split returns in the languages that Java borrowed split from. This is a sneaky little gotcha. How many elements do you think you should get back in the return list if you split the empty string, eh? Java manufacturers a fake return element where there should be one, which means you can’t distinguish between legit results and bogus ones. It is a serious design flaw that splitting on a ":", you cannot tell the difference between inputs of "" vs of ":". Aw, gee! Don’t people ever test this stuff? And again, the broken and fundamentally unreliable behavior is unfixable: you must never change things, even broken things. It’s not ok to break broken things in Java the wayt it is anywhere else. Broken is forever here.

    分割方法不返回与Java所借用的语言中分离的语言相同的结果。这是一个狡猾的小陷阱。如果你把空字符串分开,你认为你应该回到返回列表中有多少元素?Java生成了一个虚假的返回元素,这意味着您无法区分合法的结果和虚假的结果。分割成“:”是一个严重的设计缺陷,您无法区分“”与“:”的输入之间的区别。噢,天啊!人们从来没有测试过这个东西吗?再一次,这个坏的,根本不可靠的行为是不可修正的:你永远不能改变事情,甚至是坏的事情。在Java中不可以像在其他地方那样破坏东西。这里永远破碎。

  4. The backslash notation of regexes conflicts with the backslash notation used in strings. This makes it superduper awkward, and error-prone, too, because you have to constantly add lots of backslashes to everything, and it’s too easy to forget one and get neither warning nor success. Simple patterns like \b\w+\b become nightmares in typographical excess: "\\b\\w+\\b". Good luck with reading that. Some people use a slash-inverter function on their patterns so that they can write that as "/b/w+/b" instead. Other than reading in your patterns from a string, there is no way to construct your pattern in a WYSIWYG literal fashion; it’s always heavy-laden with backslashes. Did you get them all, and enough, and in the right places? If so, it makes it really really hard to read. If it isn’t, you probably haven’t gotten them all. At least JVM languages like Groovy have figured out the right answer here: give people 1st-class regexes so you don’t go nuts. Here’s a fair collection of Groovy regex examples showing how simple it can and should be.

    regex的反斜杠符号与字符串中使用的反斜杠符号冲突。这让它变得非常尴尬,也容易出错,因为你必须不断地在每件事上加上许多反斜杠,而且很容易忘记其中的一个,既得不到警告,也得不到成功。简单的模式,比如\b\w+\b,就会成为排印过度的噩梦:“\\ \\ \\b\\ \\b”。祝你阅读好运。有些人在他们的模式上使用一个slash-inverter函数,这样他们就可以把它写成“/b/w+/b”。除了从字符串中读取模式外,没有其他方法可以以WYSIWYG字面方式构建模式;它总是充满了反斜杠。你把它们都弄好了吗?如果是这样的话,就很难读了。如果不是,你可能还没有把它们都弄到手。至少像Groovy这样的JVM语言已经找到了正确的答案:给人们一个一级正则表达式,这样你就不会发疯。这里有一个很好的Groovy regex示例集合,展示了它是多么简单,应该多么简单。

  5. The (?x) mode is deeply flawed. It doesn’t take comments in the Java style of // COMMENT but rather in the shell style of # COMMENT. It doesn’t work with multiline strings. It doesn’t accept literals as literals, forcing the backslash problems listed above, which fundamentally compromises any attempt at lining things up, like having all comments begin on the same column. Because of the backslashes, you either make them begin on the same column in the source code string and screw them up if you print them out, or vice versa. So much for legibility!

    (x)模式存在严重缺陷。它不接受Java风格的//评论,而是使用#注释的外壳风格。它不能处理多行字符串。它不接受文字作为文字,迫使上面列出的反斜杠问题,这从根本上损害了所有的对齐方式,比如所有的注释都是在同一列上开始的。由于反斜杠,您要么让它们从源代码字符串中的同一列开始,然后在打印时把它们搞砸,要么反过来。如此清晰!

  6. It is incredibly difficult — and indeed, fundamentally unfixably broken — to enter Unicode characters in a regex. There is no support for symbolically named characters like \N{QUOTATION MARK}, \N{LATIN SMALL LETTER E WITH GRAVE}, or \N{MATHEMATICAL BOLD CAPITAL C}. That means you’re stuck with unmaintainable magic numbers. And you cannot even enter them by code point, either. You cannot use \u0022 for the first one because the Java preprocessor makes that a syntax error. So then you move to \\u0022 instead, which works until you get to the next one, \\u00E8, which cannot be entered that way or it will break the CANON_EQ flag. And the last one is a pure nightmare: its code point is U+1D402, but Java does not support the full Unicode set using their code point numbers in regexes, forcing you to get out your calculator to figure out that that is \uD835\uDC02 or \\uD835\\uDC02 (but not \\uD835\uDC02), madly enough. But you cannot use those in character classes due to a design bug, making it impossible to match say, [\N{MATHEMATICAL BOLD CAPITAL A}-\N{MATHEMATICAL BOLD CAPITAL Z}] because the regex compiler screws up on the UTF-16. Again, this can never be fixed or it will change old programs. You cannot even get around the bug by using the normal workaround to Java’s Unicode-in-source-code troubles by compiling with java -encoding UTF-8, because the stupid thing stores the strings as nasty UTF-16, which necessarily breaks them in character classes. OOPS!

    在regex中输入Unicode字符是非常困难的,而且实际上是不可修复的。没有支持符号命名的字符,如\N{引号},\N{拉丁字母E与GRAVE},或\N{数学大胆的大写C}。这意味着您被不可维护的魔法数字所困扰。你甚至不能通过代码点输入它们。不能为第一个使用\u0022,因为Java预处理器导致语法错误。然后你移动到\u0022,它可以工作到下一个\u00E8,它不能那样输入,否则它会破坏CANON_EQ标志。最后一个是纯粹的噩梦:它的代码点是U+1D402,但是Java不支持完整的Unicode集,使用它们在regexes中的代码点号,这迫使您拿出计算器来计算出这是uD835\uDC02或\uD835\uDC02(但不是uD835\uDC02)。但是由于设计错误,您不能在字符类中使用它们,因此不可能匹配[\N{数学粗体大写a}-\N{数学粗体大写Z}],因为regex编译器在UTF-16上出错。同样,这永远不能被修复,或者它将改变旧的程序。您甚至不能通过使用Java编码UTF-8来解决Java中Unicode-in-source-code的问题,因为愚蠢的东西将字符串存储为令人讨厌的UTF-16,这必然会在字符类中破坏它们。哦!

  7. Many of the regex things we’ve come to rely on in other languages are missing from Java. There are no named groups for examples, nor even relatively-numbered ones. This makes constructing larger patterns out of smaller ones fundamentally error prone. There is a front-end library that allows you to have simple named groups, and indeed this will finally arrive in production JDK7. But even so there is no mechanism for what to do with more than one group by the same name. And you still don’t have relatively numbered buffers, either. We’re back to the Bad Old Days again, stuff that was solved aeons ago.

    我们在其他语言中所依赖的许多regex功能在Java中都没有。没有为示例命名的组,甚至没有相对编号的组。这使得用较小的模式构建较大的模式从根本上来说容易出错。有一个前端库允许您拥有简单的命名组,实际上这最终将到达生产版本JDK7。但是,即使这样,也没有机制来处理同一个名称中多个组之间的关系。你仍然没有相对编号的缓冲区。我们又回到了过去那些糟糕的日子,那些在亿万年以前就被解决的事情。

  8. There is no support a linebreak sequence, which is one of the only two “Strongly Recommended” parts of the standard, which suggests that \R be used for such. This is awkward to emulate because of its variable-length nature and Java’s lack of support for graphemes.

    不支持换行序列,这是标准中仅有的两个“强烈推荐”的部分之一,它建议使用\R。由于可变长度的特性和Java不支持graphemes,因此很难进行仿真。

  9. The character class escapes do not work on Java’s native character set! Yes, that’s right: routine stuff like \w and \s (or rather, "\\w" and "\\b") does not work on Unicode in Java! This is not the cool sort of retro. To make matters worse, Java’s \b (make that "\\b", which isn’t the same as "\b") does have some Unicode sensibility, although not what the standard says it must have. So for example a string like "élève" will never in Java match the pattern \b\w+\b, and not merely in entirety per Pattern.matches, but indeed at no point whatsoever as you might get from Pattern.find. This is just so screwed up as to beggar belief. They’ve broken the inherent connection between \w and \b, then misdefined them to boot!! It doesn’t even know what Unicode Alphabetic code points are. This is supremely broken, and they can never fix it because that would change the behavior of existing code, which is strictly forbidden in the Java Universe. The best you can do is create a rewrite library that acts as a front end before it gets to the compile phase; that way you can forcibly migrate your patterns from the 1960s into the 21st century of text processing.

    字符类转义不能在Java的本机字符集上工作!是的,没错:像\w和\s(或者更确切地说,“\w”和“\b”)这样的常规东西在Java中不能用于Unicode !这不是那种很酷的复古风格。更糟糕的是,Java的\b(做那个“\b”,它和“\b”不一样)确实有一些Unicode敏感性,尽管不是标准要求的那样。例如,像“eleve”这样的字符串在Java中永远不会与模式\b\w+\b匹配,而不仅仅是每个模式的全部。匹配,但实际上在任何时候,你可能从模式。这简直是让人难以置信。它们破坏了\w和\b之间的固有联系,然后将它们定义为boot!它甚至不知道什么是Unicode字母代码点。这是非常糟糕的,而且他们永远无法修复它,因为这会改变现有代码的行为,这在Java领域是被严格禁止的。您所能做的最好的事情就是创建一个重写库,在它到达编译阶段之前充当前端。这样,您就可以强行将您的模式从20世纪60年代迁移到21世纪的文本处理。

  10. The only two Unicode properties supported are the General Categories and the Block properties. The general category properties only support the abbreviations like \p{Sk}, contrary to the standards Strong Recommendation to also allow \p{Modifier Symbol}, \p{Modifier_Symbol}, etc. You don’t even get the required aliases the standard says you should. That makes your code even more unreadable and unmaintainable. You will finally get support for the Script property in production JDK7, but that is still seriously short of the mininum set of 11 essential properties that the Standard says you must provide for even the minimal level of Unicode support.

    支持的惟一两个Unicode属性是通用类别和块属性。一般的类别属性只支持像\p{Sk}这样的缩写,与标准强烈推荐的还允许\p{修饰符}、\p{Modifier_Symbol}等的相反。这使得您的代码更加不可读和难以维护。最终,您将在生产版JDK7中获得对脚本属性的支持,但这仍然严重低于标准要求的11个基本属性的最小集,即使是最小的Unicode支持级别也必须提供这些属性。

  11. Some of the meagre properties that Java does provide are faux amis: they have the same names as official Unicode propoperty names, but they do something altogether different. For example, Unicode requires that \p{alpha} be the same as \p{Alphabetic}, but Java makes it the archaic and no-longer-quaint 7-bit alphabetics only, which is more than 4 orders of magnitude too few. Whitespace is another flaw, since you use the Java version that masquerades as Unicode whitespace, your UTF-8 parsers will break because of their NO-BREAK SPACE code points, which Unicode normatively requires be deemed whitespace, but Java ignores that requirement, so breaks your parser.

    Java提供的一些很少的属性是人造的ami:它们与官方的Unicode propoperty名称具有相同的名称,但是它们做的事情完全不同。例如,Unicode要求\p{alpha}与\p{字母}相同,但Java将其设置为过时的、不再过时的7位字母,这比4个数量级要少得多。空白是另一个缺陷,因为您使用Java版本伪装成Unicode空白,您的UTF-8解析器会因为它们的不间断空格代码点而中断,而Unicode通常要求这些空格被视为空格,但是Java忽略了这一需求,因此中断了解析器。

  12. There is no support for graphemes, the way \X normally provides. That renders impossible innumerably many common tasks that you need and want to do with regexes. Not only are extended grapheme clusters out of your reach, because Java supports almost none of the Unicode properties, you cannot even approximate the old legacy grapheme clusters using the standard (?:\p{Grapheme_Base}\p{Grapheme_Extend}]*). Not being able to work with graphemes makes even the simplest sorts of Unicode text processing impossible. For example, you cannot match a vowel irrespective of diacritic in Java. The way you do this in a language with grapheme supports varies, but at the very least you should be able to throw the thing into NFD and match (?:(?=[aeiou])\X). In Java, you cannot do even that much: graphemes are beyond your reach. And that means Java cannot even handle its own native character set. It gives you Unicode and then makes it impossible to work with it.

    不支持graphemes,这是\X通常提供的方式。这使得您需要并希望使用regexes完成的许多常见任务都是不可能的。不仅扩展的grapheme集群超出了您的能力范围,因为Java几乎不支持Unicode属性,您甚至不能使用标准(?:\p{Grapheme_Base}\p{grapheme_extended}]*来近似旧的遗留的grapheme集群)。由于无法使用图形符号,即使是最简单的Unicode文本处理也不可能。例如,无论在Java中,你都不能匹配一个元音字母。在使用grapheme支持的语言中进行此操作的方式各不相同,但至少您应该能够将其放入NFD并匹配(?:(?=[aeiou])\X)。在Java中,您甚至不能做那么多事情:图形用户界面超出了您的能力范围。这意味着Java甚至不能处理它自己的本机字符集,它给你Unicode,然后使它不可能使用它。

  13. The convenience methods in the String class do not cache the compiled regex. In fact, there is no such thing as a compile-time pattern that gets syntax-checked at compile time — which is when syntax checking is supposed to occur. That means your program, which uses nothing but constant regexes fully understood at compile time, will bomb out with an exception in the middle of its run if you forget a little backslash here or there as one is wont to do due to the flaws previously discussed. Even Groovy gets this part right. Regexes are far too high-level a construct to be dealt with by Java’s unpleasant after-the-fact, bolted-on-the-side model — and they are far too important to routine text processing to be ignored. Java is much too low-level a language for this stuff, and it fails to provide the simple mechanics out of which might yourself build what you need: you can’t get there from here.

    String类中的便利方法不缓存已编译的正则表达式。事实上,编译时模式在编译时进行语法检查是不存在的——语法检查应该在编译时进行。这意味着您的程序,它只使用在编译时完全理解的常量正则表达式,如果您在运行过程中忘记了一些反斜杠(由于前面讨论的缺陷),那么它将在运行过程中出现异常。即使是Groovy也能正确地完成这一部分。Regexes是一个过于高级的结构,不能用Java令人不快的事后绑定的、在边的模型来处理——它们对于常规的文本处理来说太重要了,不能被忽略。Java是一种非常低级的语言,它不能提供简单的机制来构建您需要的东西:您不能从这里开始。

  14. The String and Pattern classes are marked final in Java. That completely kills any possibility of using proper OO design to extend those classes. You can’t create a better version of a matches method by subclassing and replacement. Heck, you can’t even subclass! Final is not a solution; final is a death sentence from which there is no appeal.

    字符串和模式类在Java中被标记为final。这完全消除了使用适当的OO设计来扩展这些类的可能性。通过子类化和替换,您无法创建一个更好的match方法的版本。见鬼,你甚至不能子类!最终不是解决方案;最后是没有上诉的死刑判决。

Finally, to show you just how brain-damaged Java’s truly regexes are, consider this multiline pattern, which shows many of the flaws already described:

最后,为了向您展示脑损伤Java的真正regexes是如何的,请考虑这个多行模式,它显示了前面描述的许多缺陷:

   String rx =
          "(?= ^ \\p{Lu} [_\\pL\\pM\\d\\-] + \$)\n"
        . "   # next is a big can't-have set    \n"
        . "(?! ^ .*                             \n"
        . "    (?: ^     \\d+              $    \n"
        . "      | ^ \\p{Lu} - \\p{Lu}     $    \n"
        . "      | Invitrogen                   \n"
        . "      | Clontech                     \n"
        . "      | L-L-X-X    # dashes ok       \n"
        . "      | Sarstedt                     \n"
        . "      | Roche                        \n"
        . "      | Beckman                      \n"
        . "      | Bayer                        \n"
        . "    )      # end alternatives        \n"
        . "    \\b    # only on a word boundary \n"
        . ")          # end negated lookahead   \n"
        ;

Do you see how unnatural that is? You have to put literal newlines in your strings; you have to use non-Java comments; you cannot make anything line up because of the extra backslashes; you have to use definitions of things that don’t work right on Unicode. There are many more problems beyond that.

你知道这有多不自然吗?你必须在你的字符串中加入文字换行;必须使用非java注释;因为有额外的反斜杠,你不能把任何东西排成一行;您必须使用对Unicode不起作用的东西的定义。除此之外还有更多的问题。

Not only are there no plans to fix almost any of these grievous flaws, it is indeed impossible to fix almost any of them at all, because you change old programs. Even the normal tools of OO design are forbidden to you because it’s all locked down with the finality of a death sentence, and it cannot be fixed.

不仅没有任何计划来解决几乎所有这些严重的缺陷,而且几乎不可能修复几乎所有的缺陷,因为你改变了旧的程序。即使是正常的OO设计工具也被禁止使用,因为它与死亡判决的终结有关,而且它不能被修复。

So Alireza Noori, if you feel Java’s clumsy regexes are too hosed for reliable and convenient regex processing ever to be possible in Java, I cannot gainsay you. Sorry, but that’s just the way it is.

所以Alireza Noori,如果您觉得Java笨拙的正则表达式在Java中无法实现可靠和方便的正则表达式处理,那么我无法反驳您。抱歉,但这就是事实。

“Fixed in the Next Release!”

Just because some things can never be fixed does not mean that nothing can ever be fixed. It just has to be done very carefully. Here are the things I know of which are already fixed in current JDK7 or proposed JDK8 builds:

有些事情永远不会被修正,但这并不意味着什么事情永远都不会被修正。只是需要非常小心地做。以下是我所知道的一些已经固定在当前JDK7或建议的JDK8构建的东西:

  1. The Unicode Script property is now supported. You may use any of the equivalent forms \p{Script=Greek}, \p{sc=Greek}, \p{IsGreek}, or \p{Greek}. This is inherently superior to the old clunky block properties. It means you can do things like [\p{Latin}\p{Common}\p{Inherited}], which is quite important.

    现在支持Unicode脚本属性。您可以使用任何等价的形式\p{Script=Greek}, \p{sc=Greek}, \p{IsGreek},或\p{希腊}。这在本质上优于旧的笨重块属性。这意味着您可以做类似[\p{Latin}\p{Common}\p{inheritance}]的事情,这是非常重要的。

  2. The UTF-16 bug has a workaround. You may now specify any Unicode code point by its number using the \x{⋯} notation, such as \x{1D402}. This works even inside character classes, finally allowing [\x{1D400}-\x{1D419}] to work properly. You still must double backslash it though, and it only works in regexex, not strings in general as it really ought to.

    UTF-16 bug有一个解决方案。现在你可以指定任何Unicode代码点的数量使用x \ {⋯}符号,如x \ { 1 d402 }。这甚至可以在字符类中工作,最终允许[\x{1D400}-\x{1D419}]正常工作。不过,您仍然必须双反斜杠,而且它只在regexex中工作,而不是像它应该的那样在一般情况下使用字符串。

  3. Named groups are now supported via the standard notation (?<NAME>⋯) to create it and \k<NAME> to backreference it. These still contribute to numeric group numbers, too. However, you cannot get at more than one of them in the same pattern, nor can you use them for recursion.

    命名组现在支持通过标准符号(? <名称> ⋯)来创建它,backreference \ k <名称> 。这些仍然有助于数字组号。但是,您不能在同一模式中获得多于一个,也不能使用它们进行递归。

  4. A new Pattern compile flag, Pattern.UNICODE_CHARACTER_CLASSES and associated embeddable switch, (?U), will now swap around all the definitions of things like \w, \b, \p{alpha}, and \p{punct}, so that they now conform to the definitions of those things required by The Unicode Standard.

    一个新的模式编译标志,模式。UNICODE_CHARACTER_CLASSES和相关的可嵌入开关(?U)现在将围绕诸如\w、\b、\p{alpha}和\p{punct}的所有定义进行交换,以便它们现在符合Unicode标准所需的那些定义。

  5. The missing or misdefined binary properties \p{IsLowercase}, \p{IsUppercase}, and \p{IsAlphabetic} will now be supported, and these correspond to methods in the Character class. This is important because Unicode makes a significant and pervasive distinction between mere letters and cased or alphabetic code points. These key properties are among those 11 essential properties that are absolutely required for Level 1 compliance with UTS#18, “Unicode Regular Expresions”, without which you really cannot work with Unicode.

    缺少或错误定义的二进制属性\p{is小写}、\p{IsUppercase}和\p{IsAlphabetic}现在将得到支持,这些都对应于字符类中的方法。这一点很重要,因为Unicode在字母和大小写或字母代码点之间做出了显著而普遍的区别。这些关键属性是11个基本属性中的一个,它们是第1级符合UTS#18的绝对必要属性,“Unicode常规执行”,如果没有这些属性,您就无法使用Unicode。

These enhancements and fixes are very important to finally have, and so I am glad, even excited, to have them.

这些增强和修复非常重要,所以我很高兴,甚至很兴奋。

But for industrial-strength, state-of-the-art regex and/or Unicode work, I will not be using Java. There’s just too much missing from Java’s still-patchy-after-20-years Unicode model to get real work done if you dare to use the character set that Java gives. And the bolted-on-the-side model never works, which is all Java regexes are. You have to start over from first principles, the way Groovy did.

但是对于工业强度、最先进的regex和/或Unicode工作,我不会使用Java。如果您敢于使用Java提供的字符集,那么Java 20年之后的Unicode模型中有太多的不足,无法完成真正的工作。而且,绑定在一边的模型永远不会工作,这就是所有的Java regexes。您必须从头开始,就像Groovy那样。

Sure, it might work for very limited applications whose small customer base is limited to English-language monoglots rural Iowa with no external interactions or any need for characters beyond what an old-style telegraph could send. But for how many projects is that really true? Fewer even that you think, it turns out.

当然,它可能适用于非常有限的应用程序,这些应用程序的小客户群仅限于爱荷华州乡下的英语单语语言,不需要外部交互,也不需要像老式电报那样发送字符。但是有多少项目是真的呢?事实证明,甚至比你想象的还要少。

It is for this reason that a certain (and obvious) multi-billion-dollar just recently cancelled international deployment of an important application. Java’s Unicode support — not just in regexes, but throughout — proved to be too weak for the needed internationalization to be done reliably in Java. Because of this, they have been forced to scale back from their originally planned wordwide deployment to a merely U.S. deployment. It’s positively parochial. And no, there are Nᴏᴛ Hᴀᴘᴘʏ; would you be?

正是由于这个原因,最近才取消了一项重要的应用程序的国际部署,这是一项(显然是)数十亿美元的事情。Java对Unicode的支持——不仅在regexes中,而且在整个过程中——被证明太弱了,无法在Java中可靠地实现所需的国际化。正因为如此,他们*缩减了原本计划的大规模部署,仅仅部署到美国。这是积极的狭隘。不,有N个ᴏᴛHᴀᴘᴘʏ;你会吗?

Java has had 20 years to get it right, and they demonstrably have not done so thus far, so I wouldn’t hold my breath. Or throw good money after bad; the lesson here is to ignore the hype and instead apply due diligence to make very sure that all the necessary infrastructure support is there before you invest too much. Otherwise you too may get stuck without any real options once you’re too far into it to rescue your project.

Java已经有20年的时间去做正确的事情了,而且到目前为止他们还没有做到,所以我不会屏息以待。或者把好钱扔在坏钱之后;这里的教训是,不要理会这些大肆宣传,而是要尽其所能,确保在你投入太多之前,所有必要的基础设施支持都到位了。否则,一旦你太沉迷于此而无法挽救你的项目,你也会陷入困境,没有任何真正的选择。

Caveat Emptor

#2


7  

One can rant, or one can simply write:

一个人可以咆哮,或者可以简单地写:

public class Regex {

    /**
     * @param source 
     *        the string to scan
     * @param pattern
     *        the regular expression to scan for
     * @return the matched 
     */
    public static Iterable<String> matches(final String source, final String pattern) {
        final Pattern p = Pattern.compile(pattern);
        final Matcher m = p.matcher(source);
        return new Iterable<String>() {
            @Override
            public Iterator<String> iterator() {
                return new Iterator<String>() {
                    @Override
                    public boolean hasNext() {
                        return m.find();
                    }
                    @Override
                    public String next() {
                        return source.substring(m.start(), m.end());
                    }    
                    @Override
                    public void remove() {
                        throw new UnsupportedOperationException();
                    }
                };
            }
        };
    }

}

Used as you wish:

作为你的愿望:

public class RegexTest {

    @Test
    public void test() {
       String source = "The colour of my bag matches the color of my shirt!";
       String pattern = "colou?r";
       for (String match : Regex.matches(source, pattern)) {
           System.out.println(match);
       }
    }
}

#3


1  

Boy, do I hear you on that one Alireza! Regex's are confusing enough without there being so many syntax variations amonng them. I too do a lot more C# than Java programming and had the same issue.

小子,我听到你说的那个阿里扎!Regex非常混乱,没有太多语法变体。与Java编程相比,我也做了很多c#的工作,并且遇到了同样的问题。

I found this to be very helpful: http://www.tusker.org/regex/regex_benchmark.html - it's a list of alternative regular expression implementations for Java, benchmarked.

我发现这很有帮助:http://www.tusker.org/regex/regex_benchmark.html—它是Java的一个可选正则表达式实现列表。

#4


0  

Some of the API flaws mentioned in @tchrist's answer were fixed in Kotlin.

@tchrist的回答中提到的一些API缺陷在Kotlin中得到了修正。

#1


102  

From your edited example, I can now see what you would like. And you have my sympathies in this, too. Java’s regexes are a long, long, long ways from the convenience you find in higher level programming languages like Ruby or Perl. And they pretty much always will be; this cannot be fixed, so we’re stuck with this mess forever — at least in Java. Other JVM languages do a better job at this, especially Groovy. But they still suffer some of the inherent flaws, and can only go so far.

从您编辑的示例中,我可以看到您想要什么。我也很同情你。Java的regex是一种很长的、很长的方法,从您在高级编程语言(如Ruby或Perl)中找到的方便。它们几乎总是;这是无法修复的,因此我们将永远无法摆脱这种混乱——至少在Java中是这样。其他JVM语言在这方面做得更好,特别是Groovy。但它们仍有一些固有的缺陷,只能走这么远。

Where to begin? There are the so-called convenience methods of the String class: matches, replaceAll, replaceFirst, and split. These can sometimes be ok in small programs, depending how you use them. However, they do indeed have several problems, which it appears you have discovered. Here’s a partial list of those problems, and what can and cannot be done about them.

从哪里开始呢?字符串类有所谓的方便方法:match、replaceAll、replaceFirst和split。这些在小程序中有时是可以的,这取决于您如何使用它们。然而,他们确实有几个问题,似乎你已经发现了。以下是这些问题的部分列表,以及可以做什么和不能做什么。

  1. The inconvenience method is very bizarrely named “matches” but it requires you to pad your regex on both sides to match the entire string. This counter-intuitive sense is contrary to any sense of the word match as used in any previous language, and constantly bites people. Patterns passed into the other 3 inconvenience methods work very unlike this one, because in the other 3, they work like normal patterns work everywhere else; just not in matches. This means you can’t just copy your patterns around, even within methods in the same darned class for goodness’ sake! And there is no find convenience method to do what every other matcher in the world does. The matches method should have been called something like FullMatch, and there should have been a PartialMatch or find method added to the String class.

    不方便的方法非常奇怪地命名为“matches”,但它要求您在两边填充regex以匹配整个字符串。这种反直觉的感觉与以往任何一种语言中使用的match(匹配)一词都是相反的,而且会不断地咬人。传递给其他3个不便方法的模式与此方法非常不同,因为在另外3个方法中,它们的工作方式与其他任何地方的正常模式一样;不匹配。这就意味着你不能只是在同一个该死的类里复制你的模式。世界上所有其他的matcher都没有找到方便的方法去做。match方法应该被称为FullMatch,并且应该有一个PartialMatch或find方法添加到String类中。

  2. There is no API that allows you to pass in Pattern.compile flags along with the strings you use for the 4 pattern-related convenience methods of the String class. That means you havce to rely on string versions like (?i) and (?x), but those do not exist for all possible Pattern compilation flags. This is highly inconvenient to say the least.

    没有API允许您通过Pattern.compile标志,以及您使用的4个与模式相关的字符串类的便利方法。这意味着您必须依赖(?i)和(?x)这样的字符串版本,但这些并不存在于所有可能的模式编译标志。这至少可以说是极不方便的。

  3. The split method does not return the same result in edge cases as split returns in the languages that Java borrowed split from. This is a sneaky little gotcha. How many elements do you think you should get back in the return list if you split the empty string, eh? Java manufacturers a fake return element where there should be one, which means you can’t distinguish between legit results and bogus ones. It is a serious design flaw that splitting on a ":", you cannot tell the difference between inputs of "" vs of ":". Aw, gee! Don’t people ever test this stuff? And again, the broken and fundamentally unreliable behavior is unfixable: you must never change things, even broken things. It’s not ok to break broken things in Java the wayt it is anywhere else. Broken is forever here.

    分割方法不返回与Java所借用的语言中分离的语言相同的结果。这是一个狡猾的小陷阱。如果你把空字符串分开,你认为你应该回到返回列表中有多少元素?Java生成了一个虚假的返回元素,这意味着您无法区分合法的结果和虚假的结果。分割成“:”是一个严重的设计缺陷,您无法区分“”与“:”的输入之间的区别。噢,天啊!人们从来没有测试过这个东西吗?再一次,这个坏的,根本不可靠的行为是不可修正的:你永远不能改变事情,甚至是坏的事情。在Java中不可以像在其他地方那样破坏东西。这里永远破碎。

  4. The backslash notation of regexes conflicts with the backslash notation used in strings. This makes it superduper awkward, and error-prone, too, because you have to constantly add lots of backslashes to everything, and it’s too easy to forget one and get neither warning nor success. Simple patterns like \b\w+\b become nightmares in typographical excess: "\\b\\w+\\b". Good luck with reading that. Some people use a slash-inverter function on their patterns so that they can write that as "/b/w+/b" instead. Other than reading in your patterns from a string, there is no way to construct your pattern in a WYSIWYG literal fashion; it’s always heavy-laden with backslashes. Did you get them all, and enough, and in the right places? If so, it makes it really really hard to read. If it isn’t, you probably haven’t gotten them all. At least JVM languages like Groovy have figured out the right answer here: give people 1st-class regexes so you don’t go nuts. Here’s a fair collection of Groovy regex examples showing how simple it can and should be.

    regex的反斜杠符号与字符串中使用的反斜杠符号冲突。这让它变得非常尴尬,也容易出错,因为你必须不断地在每件事上加上许多反斜杠,而且很容易忘记其中的一个,既得不到警告,也得不到成功。简单的模式,比如\b\w+\b,就会成为排印过度的噩梦:“\\ \\ \\b\\ \\b”。祝你阅读好运。有些人在他们的模式上使用一个slash-inverter函数,这样他们就可以把它写成“/b/w+/b”。除了从字符串中读取模式外,没有其他方法可以以WYSIWYG字面方式构建模式;它总是充满了反斜杠。你把它们都弄好了吗?如果是这样的话,就很难读了。如果不是,你可能还没有把它们都弄到手。至少像Groovy这样的JVM语言已经找到了正确的答案:给人们一个一级正则表达式,这样你就不会发疯。这里有一个很好的Groovy regex示例集合,展示了它是多么简单,应该多么简单。

  5. The (?x) mode is deeply flawed. It doesn’t take comments in the Java style of // COMMENT but rather in the shell style of # COMMENT. It doesn’t work with multiline strings. It doesn’t accept literals as literals, forcing the backslash problems listed above, which fundamentally compromises any attempt at lining things up, like having all comments begin on the same column. Because of the backslashes, you either make them begin on the same column in the source code string and screw them up if you print them out, or vice versa. So much for legibility!

    (x)模式存在严重缺陷。它不接受Java风格的//评论,而是使用#注释的外壳风格。它不能处理多行字符串。它不接受文字作为文字,迫使上面列出的反斜杠问题,这从根本上损害了所有的对齐方式,比如所有的注释都是在同一列上开始的。由于反斜杠,您要么让它们从源代码字符串中的同一列开始,然后在打印时把它们搞砸,要么反过来。如此清晰!

  6. It is incredibly difficult — and indeed, fundamentally unfixably broken — to enter Unicode characters in a regex. There is no support for symbolically named characters like \N{QUOTATION MARK}, \N{LATIN SMALL LETTER E WITH GRAVE}, or \N{MATHEMATICAL BOLD CAPITAL C}. That means you’re stuck with unmaintainable magic numbers. And you cannot even enter them by code point, either. You cannot use \u0022 for the first one because the Java preprocessor makes that a syntax error. So then you move to \\u0022 instead, which works until you get to the next one, \\u00E8, which cannot be entered that way or it will break the CANON_EQ flag. And the last one is a pure nightmare: its code point is U+1D402, but Java does not support the full Unicode set using their code point numbers in regexes, forcing you to get out your calculator to figure out that that is \uD835\uDC02 or \\uD835\\uDC02 (but not \\uD835\uDC02), madly enough. But you cannot use those in character classes due to a design bug, making it impossible to match say, [\N{MATHEMATICAL BOLD CAPITAL A}-\N{MATHEMATICAL BOLD CAPITAL Z}] because the regex compiler screws up on the UTF-16. Again, this can never be fixed or it will change old programs. You cannot even get around the bug by using the normal workaround to Java’s Unicode-in-source-code troubles by compiling with java -encoding UTF-8, because the stupid thing stores the strings as nasty UTF-16, which necessarily breaks them in character classes. OOPS!

    在regex中输入Unicode字符是非常困难的,而且实际上是不可修复的。没有支持符号命名的字符,如\N{引号},\N{拉丁字母E与GRAVE},或\N{数学大胆的大写C}。这意味着您被不可维护的魔法数字所困扰。你甚至不能通过代码点输入它们。不能为第一个使用\u0022,因为Java预处理器导致语法错误。然后你移动到\u0022,它可以工作到下一个\u00E8,它不能那样输入,否则它会破坏CANON_EQ标志。最后一个是纯粹的噩梦:它的代码点是U+1D402,但是Java不支持完整的Unicode集,使用它们在regexes中的代码点号,这迫使您拿出计算器来计算出这是uD835\uDC02或\uD835\uDC02(但不是uD835\uDC02)。但是由于设计错误,您不能在字符类中使用它们,因此不可能匹配[\N{数学粗体大写a}-\N{数学粗体大写Z}],因为regex编译器在UTF-16上出错。同样,这永远不能被修复,或者它将改变旧的程序。您甚至不能通过使用Java编码UTF-8来解决Java中Unicode-in-source-code的问题,因为愚蠢的东西将字符串存储为令人讨厌的UTF-16,这必然会在字符类中破坏它们。哦!

  7. Many of the regex things we’ve come to rely on in other languages are missing from Java. There are no named groups for examples, nor even relatively-numbered ones. This makes constructing larger patterns out of smaller ones fundamentally error prone. There is a front-end library that allows you to have simple named groups, and indeed this will finally arrive in production JDK7. But even so there is no mechanism for what to do with more than one group by the same name. And you still don’t have relatively numbered buffers, either. We’re back to the Bad Old Days again, stuff that was solved aeons ago.

    我们在其他语言中所依赖的许多regex功能在Java中都没有。没有为示例命名的组,甚至没有相对编号的组。这使得用较小的模式构建较大的模式从根本上来说容易出错。有一个前端库允许您拥有简单的命名组,实际上这最终将到达生产版本JDK7。但是,即使这样,也没有机制来处理同一个名称中多个组之间的关系。你仍然没有相对编号的缓冲区。我们又回到了过去那些糟糕的日子,那些在亿万年以前就被解决的事情。

  8. There is no support a linebreak sequence, which is one of the only two “Strongly Recommended” parts of the standard, which suggests that \R be used for such. This is awkward to emulate because of its variable-length nature and Java’s lack of support for graphemes.

    不支持换行序列,这是标准中仅有的两个“强烈推荐”的部分之一,它建议使用\R。由于可变长度的特性和Java不支持graphemes,因此很难进行仿真。

  9. The character class escapes do not work on Java’s native character set! Yes, that’s right: routine stuff like \w and \s (or rather, "\\w" and "\\b") does not work on Unicode in Java! This is not the cool sort of retro. To make matters worse, Java’s \b (make that "\\b", which isn’t the same as "\b") does have some Unicode sensibility, although not what the standard says it must have. So for example a string like "élève" will never in Java match the pattern \b\w+\b, and not merely in entirety per Pattern.matches, but indeed at no point whatsoever as you might get from Pattern.find. This is just so screwed up as to beggar belief. They’ve broken the inherent connection between \w and \b, then misdefined them to boot!! It doesn’t even know what Unicode Alphabetic code points are. This is supremely broken, and they can never fix it because that would change the behavior of existing code, which is strictly forbidden in the Java Universe. The best you can do is create a rewrite library that acts as a front end before it gets to the compile phase; that way you can forcibly migrate your patterns from the 1960s into the 21st century of text processing.

    字符类转义不能在Java的本机字符集上工作!是的,没错:像\w和\s(或者更确切地说,“\w”和“\b”)这样的常规东西在Java中不能用于Unicode !这不是那种很酷的复古风格。更糟糕的是,Java的\b(做那个“\b”,它和“\b”不一样)确实有一些Unicode敏感性,尽管不是标准要求的那样。例如,像“eleve”这样的字符串在Java中永远不会与模式\b\w+\b匹配,而不仅仅是每个模式的全部。匹配,但实际上在任何时候,你可能从模式。这简直是让人难以置信。它们破坏了\w和\b之间的固有联系,然后将它们定义为boot!它甚至不知道什么是Unicode字母代码点。这是非常糟糕的,而且他们永远无法修复它,因为这会改变现有代码的行为,这在Java领域是被严格禁止的。您所能做的最好的事情就是创建一个重写库,在它到达编译阶段之前充当前端。这样,您就可以强行将您的模式从20世纪60年代迁移到21世纪的文本处理。

  10. The only two Unicode properties supported are the General Categories and the Block properties. The general category properties only support the abbreviations like \p{Sk}, contrary to the standards Strong Recommendation to also allow \p{Modifier Symbol}, \p{Modifier_Symbol}, etc. You don’t even get the required aliases the standard says you should. That makes your code even more unreadable and unmaintainable. You will finally get support for the Script property in production JDK7, but that is still seriously short of the mininum set of 11 essential properties that the Standard says you must provide for even the minimal level of Unicode support.

    支持的惟一两个Unicode属性是通用类别和块属性。一般的类别属性只支持像\p{Sk}这样的缩写,与标准强烈推荐的还允许\p{修饰符}、\p{Modifier_Symbol}等的相反。这使得您的代码更加不可读和难以维护。最终,您将在生产版JDK7中获得对脚本属性的支持,但这仍然严重低于标准要求的11个基本属性的最小集,即使是最小的Unicode支持级别也必须提供这些属性。

  11. Some of the meagre properties that Java does provide are faux amis: they have the same names as official Unicode propoperty names, but they do something altogether different. For example, Unicode requires that \p{alpha} be the same as \p{Alphabetic}, but Java makes it the archaic and no-longer-quaint 7-bit alphabetics only, which is more than 4 orders of magnitude too few. Whitespace is another flaw, since you use the Java version that masquerades as Unicode whitespace, your UTF-8 parsers will break because of their NO-BREAK SPACE code points, which Unicode normatively requires be deemed whitespace, but Java ignores that requirement, so breaks your parser.

    Java提供的一些很少的属性是人造的ami:它们与官方的Unicode propoperty名称具有相同的名称,但是它们做的事情完全不同。例如,Unicode要求\p{alpha}与\p{字母}相同,但Java将其设置为过时的、不再过时的7位字母,这比4个数量级要少得多。空白是另一个缺陷,因为您使用Java版本伪装成Unicode空白,您的UTF-8解析器会因为它们的不间断空格代码点而中断,而Unicode通常要求这些空格被视为空格,但是Java忽略了这一需求,因此中断了解析器。

  12. There is no support for graphemes, the way \X normally provides. That renders impossible innumerably many common tasks that you need and want to do with regexes. Not only are extended grapheme clusters out of your reach, because Java supports almost none of the Unicode properties, you cannot even approximate the old legacy grapheme clusters using the standard (?:\p{Grapheme_Base}\p{Grapheme_Extend}]*). Not being able to work with graphemes makes even the simplest sorts of Unicode text processing impossible. For example, you cannot match a vowel irrespective of diacritic in Java. The way you do this in a language with grapheme supports varies, but at the very least you should be able to throw the thing into NFD and match (?:(?=[aeiou])\X). In Java, you cannot do even that much: graphemes are beyond your reach. And that means Java cannot even handle its own native character set. It gives you Unicode and then makes it impossible to work with it.

    不支持graphemes,这是\X通常提供的方式。这使得您需要并希望使用regexes完成的许多常见任务都是不可能的。不仅扩展的grapheme集群超出了您的能力范围,因为Java几乎不支持Unicode属性,您甚至不能使用标准(?:\p{Grapheme_Base}\p{grapheme_extended}]*来近似旧的遗留的grapheme集群)。由于无法使用图形符号,即使是最简单的Unicode文本处理也不可能。例如,无论在Java中,你都不能匹配一个元音字母。在使用grapheme支持的语言中进行此操作的方式各不相同,但至少您应该能够将其放入NFD并匹配(?:(?=[aeiou])\X)。在Java中,您甚至不能做那么多事情:图形用户界面超出了您的能力范围。这意味着Java甚至不能处理它自己的本机字符集,它给你Unicode,然后使它不可能使用它。

  13. The convenience methods in the String class do not cache the compiled regex. In fact, there is no such thing as a compile-time pattern that gets syntax-checked at compile time — which is when syntax checking is supposed to occur. That means your program, which uses nothing but constant regexes fully understood at compile time, will bomb out with an exception in the middle of its run if you forget a little backslash here or there as one is wont to do due to the flaws previously discussed. Even Groovy gets this part right. Regexes are far too high-level a construct to be dealt with by Java’s unpleasant after-the-fact, bolted-on-the-side model — and they are far too important to routine text processing to be ignored. Java is much too low-level a language for this stuff, and it fails to provide the simple mechanics out of which might yourself build what you need: you can’t get there from here.

    String类中的便利方法不缓存已编译的正则表达式。事实上,编译时模式在编译时进行语法检查是不存在的——语法检查应该在编译时进行。这意味着您的程序,它只使用在编译时完全理解的常量正则表达式,如果您在运行过程中忘记了一些反斜杠(由于前面讨论的缺陷),那么它将在运行过程中出现异常。即使是Groovy也能正确地完成这一部分。Regexes是一个过于高级的结构,不能用Java令人不快的事后绑定的、在边的模型来处理——它们对于常规的文本处理来说太重要了,不能被忽略。Java是一种非常低级的语言,它不能提供简单的机制来构建您需要的东西:您不能从这里开始。

  14. The String and Pattern classes are marked final in Java. That completely kills any possibility of using proper OO design to extend those classes. You can’t create a better version of a matches method by subclassing and replacement. Heck, you can’t even subclass! Final is not a solution; final is a death sentence from which there is no appeal.

    字符串和模式类在Java中被标记为final。这完全消除了使用适当的OO设计来扩展这些类的可能性。通过子类化和替换,您无法创建一个更好的match方法的版本。见鬼,你甚至不能子类!最终不是解决方案;最后是没有上诉的死刑判决。

Finally, to show you just how brain-damaged Java’s truly regexes are, consider this multiline pattern, which shows many of the flaws already described:

最后,为了向您展示脑损伤Java的真正regexes是如何的,请考虑这个多行模式,它显示了前面描述的许多缺陷:

   String rx =
          "(?= ^ \\p{Lu} [_\\pL\\pM\\d\\-] + \$)\n"
        . "   # next is a big can't-have set    \n"
        . "(?! ^ .*                             \n"
        . "    (?: ^     \\d+              $    \n"
        . "      | ^ \\p{Lu} - \\p{Lu}     $    \n"
        . "      | Invitrogen                   \n"
        . "      | Clontech                     \n"
        . "      | L-L-X-X    # dashes ok       \n"
        . "      | Sarstedt                     \n"
        . "      | Roche                        \n"
        . "      | Beckman                      \n"
        . "      | Bayer                        \n"
        . "    )      # end alternatives        \n"
        . "    \\b    # only on a word boundary \n"
        . ")          # end negated lookahead   \n"
        ;

Do you see how unnatural that is? You have to put literal newlines in your strings; you have to use non-Java comments; you cannot make anything line up because of the extra backslashes; you have to use definitions of things that don’t work right on Unicode. There are many more problems beyond that.

你知道这有多不自然吗?你必须在你的字符串中加入文字换行;必须使用非java注释;因为有额外的反斜杠,你不能把任何东西排成一行;您必须使用对Unicode不起作用的东西的定义。除此之外还有更多的问题。

Not only are there no plans to fix almost any of these grievous flaws, it is indeed impossible to fix almost any of them at all, because you change old programs. Even the normal tools of OO design are forbidden to you because it’s all locked down with the finality of a death sentence, and it cannot be fixed.

不仅没有任何计划来解决几乎所有这些严重的缺陷,而且几乎不可能修复几乎所有的缺陷,因为你改变了旧的程序。即使是正常的OO设计工具也被禁止使用,因为它与死亡判决的终结有关,而且它不能被修复。

So Alireza Noori, if you feel Java’s clumsy regexes are too hosed for reliable and convenient regex processing ever to be possible in Java, I cannot gainsay you. Sorry, but that’s just the way it is.

所以Alireza Noori,如果您觉得Java笨拙的正则表达式在Java中无法实现可靠和方便的正则表达式处理,那么我无法反驳您。抱歉,但这就是事实。

“Fixed in the Next Release!”

Just because some things can never be fixed does not mean that nothing can ever be fixed. It just has to be done very carefully. Here are the things I know of which are already fixed in current JDK7 or proposed JDK8 builds:

有些事情永远不会被修正,但这并不意味着什么事情永远都不会被修正。只是需要非常小心地做。以下是我所知道的一些已经固定在当前JDK7或建议的JDK8构建的东西:

  1. The Unicode Script property is now supported. You may use any of the equivalent forms \p{Script=Greek}, \p{sc=Greek}, \p{IsGreek}, or \p{Greek}. This is inherently superior to the old clunky block properties. It means you can do things like [\p{Latin}\p{Common}\p{Inherited}], which is quite important.

    现在支持Unicode脚本属性。您可以使用任何等价的形式\p{Script=Greek}, \p{sc=Greek}, \p{IsGreek},或\p{希腊}。这在本质上优于旧的笨重块属性。这意味着您可以做类似[\p{Latin}\p{Common}\p{inheritance}]的事情,这是非常重要的。

  2. The UTF-16 bug has a workaround. You may now specify any Unicode code point by its number using the \x{⋯} notation, such as \x{1D402}. This works even inside character classes, finally allowing [\x{1D400}-\x{1D419}] to work properly. You still must double backslash it though, and it only works in regexex, not strings in general as it really ought to.

    UTF-16 bug有一个解决方案。现在你可以指定任何Unicode代码点的数量使用x \ {⋯}符号,如x \ { 1 d402 }。这甚至可以在字符类中工作,最终允许[\x{1D400}-\x{1D419}]正常工作。不过,您仍然必须双反斜杠,而且它只在regexex中工作,而不是像它应该的那样在一般情况下使用字符串。

  3. Named groups are now supported via the standard notation (?<NAME>⋯) to create it and \k<NAME> to backreference it. These still contribute to numeric group numbers, too. However, you cannot get at more than one of them in the same pattern, nor can you use them for recursion.

    命名组现在支持通过标准符号(? <名称> ⋯)来创建它,backreference \ k <名称> 。这些仍然有助于数字组号。但是,您不能在同一模式中获得多于一个,也不能使用它们进行递归。

  4. A new Pattern compile flag, Pattern.UNICODE_CHARACTER_CLASSES and associated embeddable switch, (?U), will now swap around all the definitions of things like \w, \b, \p{alpha}, and \p{punct}, so that they now conform to the definitions of those things required by The Unicode Standard.

    一个新的模式编译标志,模式。UNICODE_CHARACTER_CLASSES和相关的可嵌入开关(?U)现在将围绕诸如\w、\b、\p{alpha}和\p{punct}的所有定义进行交换,以便它们现在符合Unicode标准所需的那些定义。

  5. The missing or misdefined binary properties \p{IsLowercase}, \p{IsUppercase}, and \p{IsAlphabetic} will now be supported, and these correspond to methods in the Character class. This is important because Unicode makes a significant and pervasive distinction between mere letters and cased or alphabetic code points. These key properties are among those 11 essential properties that are absolutely required for Level 1 compliance with UTS#18, “Unicode Regular Expresions”, without which you really cannot work with Unicode.

    缺少或错误定义的二进制属性\p{is小写}、\p{IsUppercase}和\p{IsAlphabetic}现在将得到支持,这些都对应于字符类中的方法。这一点很重要,因为Unicode在字母和大小写或字母代码点之间做出了显著而普遍的区别。这些关键属性是11个基本属性中的一个,它们是第1级符合UTS#18的绝对必要属性,“Unicode常规执行”,如果没有这些属性,您就无法使用Unicode。

These enhancements and fixes are very important to finally have, and so I am glad, even excited, to have them.

这些增强和修复非常重要,所以我很高兴,甚至很兴奋。

But for industrial-strength, state-of-the-art regex and/or Unicode work, I will not be using Java. There’s just too much missing from Java’s still-patchy-after-20-years Unicode model to get real work done if you dare to use the character set that Java gives. And the bolted-on-the-side model never works, which is all Java regexes are. You have to start over from first principles, the way Groovy did.

但是对于工业强度、最先进的regex和/或Unicode工作,我不会使用Java。如果您敢于使用Java提供的字符集,那么Java 20年之后的Unicode模型中有太多的不足,无法完成真正的工作。而且,绑定在一边的模型永远不会工作,这就是所有的Java regexes。您必须从头开始,就像Groovy那样。

Sure, it might work for very limited applications whose small customer base is limited to English-language monoglots rural Iowa with no external interactions or any need for characters beyond what an old-style telegraph could send. But for how many projects is that really true? Fewer even that you think, it turns out.

当然,它可能适用于非常有限的应用程序,这些应用程序的小客户群仅限于爱荷华州乡下的英语单语语言,不需要外部交互,也不需要像老式电报那样发送字符。但是有多少项目是真的呢?事实证明,甚至比你想象的还要少。

It is for this reason that a certain (and obvious) multi-billion-dollar just recently cancelled international deployment of an important application. Java’s Unicode support — not just in regexes, but throughout — proved to be too weak for the needed internationalization to be done reliably in Java. Because of this, they have been forced to scale back from their originally planned wordwide deployment to a merely U.S. deployment. It’s positively parochial. And no, there are Nᴏᴛ Hᴀᴘᴘʏ; would you be?

正是由于这个原因,最近才取消了一项重要的应用程序的国际部署,这是一项(显然是)数十亿美元的事情。Java对Unicode的支持——不仅在regexes中,而且在整个过程中——被证明太弱了,无法在Java中可靠地实现所需的国际化。正因为如此,他们*缩减了原本计划的大规模部署,仅仅部署到美国。这是积极的狭隘。不,有N个ᴏᴛHᴀᴘᴘʏ;你会吗?

Java has had 20 years to get it right, and they demonstrably have not done so thus far, so I wouldn’t hold my breath. Or throw good money after bad; the lesson here is to ignore the hype and instead apply due diligence to make very sure that all the necessary infrastructure support is there before you invest too much. Otherwise you too may get stuck without any real options once you’re too far into it to rescue your project.

Java已经有20年的时间去做正确的事情了,而且到目前为止他们还没有做到,所以我不会屏息以待。或者把好钱扔在坏钱之后;这里的教训是,不要理会这些大肆宣传,而是要尽其所能,确保在你投入太多之前,所有必要的基础设施支持都到位了。否则,一旦你太沉迷于此而无法挽救你的项目,你也会陷入困境,没有任何真正的选择。

Caveat Emptor

#2


7  

One can rant, or one can simply write:

一个人可以咆哮,或者可以简单地写:

public class Regex {

    /**
     * @param source 
     *        the string to scan
     * @param pattern
     *        the regular expression to scan for
     * @return the matched 
     */
    public static Iterable<String> matches(final String source, final String pattern) {
        final Pattern p = Pattern.compile(pattern);
        final Matcher m = p.matcher(source);
        return new Iterable<String>() {
            @Override
            public Iterator<String> iterator() {
                return new Iterator<String>() {
                    @Override
                    public boolean hasNext() {
                        return m.find();
                    }
                    @Override
                    public String next() {
                        return source.substring(m.start(), m.end());
                    }    
                    @Override
                    public void remove() {
                        throw new UnsupportedOperationException();
                    }
                };
            }
        };
    }

}

Used as you wish:

作为你的愿望:

public class RegexTest {

    @Test
    public void test() {
       String source = "The colour of my bag matches the color of my shirt!";
       String pattern = "colou?r";
       for (String match : Regex.matches(source, pattern)) {
           System.out.println(match);
       }
    }
}

#3


1  

Boy, do I hear you on that one Alireza! Regex's are confusing enough without there being so many syntax variations amonng them. I too do a lot more C# than Java programming and had the same issue.

小子,我听到你说的那个阿里扎!Regex非常混乱,没有太多语法变体。与Java编程相比,我也做了很多c#的工作,并且遇到了同样的问题。

I found this to be very helpful: http://www.tusker.org/regex/regex_benchmark.html - it's a list of alternative regular expression implementations for Java, benchmarked.

我发现这很有帮助:http://www.tusker.org/regex/regex_benchmark.html—它是Java的一个可选正则表达式实现列表。

#4


0  

Some of the API flaws mentioned in @tchrist's answer were fixed in Kotlin.

@tchrist的回答中提到的一些API缺陷在Kotlin中得到了修正。