这个正则表达式可以进一步优化吗？

I wrote this regex to parse entries from srt files.

我写了这个正则表达式来解析srt文件中的条目。

(?s)^\d++\s{1,2}(.{12}) --> (.{12})\s{1,2}(.+)\r?$

I don't know if it matters, but this is done using Scala programming language (Java Engine, but literal strings so that I don't have to double the backslashes).

我不知道它是否重要,但这是使用Scala编程语言(Java引擎,但文字字符串,所以我不必加倍反斜杠)。

The s{1,2} is used because some files will only have line breaks \n and others will have line breaks and carriage returns \n\r The first (?s) enables DOTALL mode so that the third capturing group can also match line breaks.

使用s {1,2}是因为某些文件只有换行符\ n而其他文件将有换行符和回车符\ n \ r \ n第一个(?s)启用DOTALL模式,以便第三个捕获组也可以匹配换行。

My program basically breaks a srt file using \n\r?\n as a delimiter and use Scala nice pattern matching feature to read each entry for further processing:

我的程序基本上使用\ n \ r?\ n作为分隔符来破坏srt文件,并使用Scala nice模式匹配功能读取每个条目以进行进一步处理:

val EntryRegex = """(?s)^\d++\s{1,2}(.{12}) --> (.{12})\s{1,2}(.+)\r?$""".r

def apply(string: String): Entry = string match {
  case EntryRegex(start, end, text) => Entry(0, timeFormat.parse(start),
    timeFormat.parse(end), text);
}

Sample entries:

One line:

1073
01:46:43,024 --> 01:46:45,015
I am your father.

Two Lines:

160
00:20:16,400 --> 00:20:19,312
<i>Help me, Obi-Wan Kenobi.
You're my only hope.</i>

The thing is, the profiler shows me that this parsing method is by far the most time consuming operation in my application (which does intensive time math and can even reencode the file several times faster than what it takes to read and parse the entries).

问题是,分析器告诉我这个解析方法是我的应用程序中最耗时的操作(它可以进行密集的时间数学运算,甚至可以比读取和解析条目快几倍的文件重新编码)。

So any regex wizards can help me optimize it? Or maybe I should sacrifice regex / pattern matching succinctness and try an old school java.util.Scanner approach?

那么任何正则表达式向导都可以帮我优化它吗?或者也许我应该牺牲正则表达式/模式匹配简洁性并尝试旧的java.util.Scanner方法?

Cheers,

4 个解决方案

#1

I'm not optimistic, but here are two things to try:

我不乐观,但这里有两件事要尝试:

you could do is move the (?s) to just before you need it.

你能做的就是将(?s)移到你需要之前。

remove the \r?$ and use a greedy .++ for the text .+

删除\ r?$并使用贪婪的。++文本。+

^\d++\s{1,2}(.{12}) --> (.{12})\s{1,2}(?s)(.++)$

^ \ d ++ \ s {1,2}(。{12}) - >(。{12})\ s {1,2}(?s)(。++)$

To really get good performance, I would refactor the code and regex to use findAllIn. The current code is doing a regex for every Entry in your file. I imagine the single findAllIn regex would perform better...But maybe not...

为了真正获得良好的性能,我将重构代码和正则表达式以使用findAllIn。当前代码正在为文件中的每个条目执行正则表达式。我想单一的findAllIn正则表达式会表现得更好......但也许不是......

#2

(?s)^\d++\s{1,2}(.{12}) --> (.{12})\s{1,2}(.+)\r?$

In Java, $ means the end of input or the beginning of a line-break immediately preceding the end of input. \z means unambiguously end of input, so if that is also the semantics in Scala, then \r?$ is redundant and $ would do just as well. If you really only want a CR at the end and not CRLF then \r?\z might be better.

在Java中,$表示输入结束或紧接在输入结束之前的换行符的开头。 \ z意味着明确地结束输入,所以如果这也是Scala中的语义,那么\ r?$是多余的,$也可以。如果你真的只想要一个CR而不是CRLF那么\ r?\ z可能会更好。

The (?s) should also make (.+)\r? redundant since the + is greedy, the . should always expand to include the \r. If you do not want the \r included in that third capturing group, then make the match lazy : (.+?) instead of (.+).

(?s)也应该(。+)\ r?多余,因为+是贪婪的,。应始终扩展以包含\ r。如果你不想在第三个捕获组中包含\ r,那么将匹配延迟:(。+?)而不是(。+)。

Maybe

(?s)^\d++\s\s?(.{12}) --> (.{12})\s\s?(.+?)\r?\z

Other fine high-performance alternatives to regular expressions that will run inside a JVM &| CLR include JavaCC and ANTLR. For a Scala only solution, see http://jim-mcbeath.blogspot.com/2008/09/scala-parser-combinators.html

正则表达式的其他优秀高性能替代方案,将在JVM和|中运行CLR包括JavaCC和ANTLR。有关仅Scala解决方案,请参阅http://jim-mcbeath.blogspot.com/2008/09/scala-parser-combinators.html

#3

Check this out:

看一下这个:

(?m)^\d++\r?+\n(.{12}) --> (.{12})\r?+\n(.++(?>\r?+\n.++)*+)$

This regex matches a complete .srt file entry in place. You don't have to split the contents up on line breaks first; that's a huge waste of resources.

此正则表达式匹配完整的.srt文件条目。您不必先在换行符上拆分内容;这是对资源的巨大浪费。

The regex takes advantage of the fact that there's exactly one line separator (\n or \r\n) separating the lines within an entry (multiple line separators are used to separate entries from each other). Using \r?+\n instead of \s{1,2} means you can never accidentally match two line separators (\n\n) when you only wanted to match one.

正则表达式利用了这样一个事实,即只有一个行分隔符(\ n或\ r \ n)将条目中的行分开(多行分隔符用于将条目彼此分开)。使用\ r?+ \ n而不是\ s {1,2}意味着当你只想匹配一个时,你永远不会意外地匹配两个行分隔符(\ n \ n)。

This way, too, you don't have to rely on the . in (?s) mode. @Jacob was right about that: it's not really helping you, and it's killing your performance. But (?m) mode is helpful, for correctness as well as performance.

这样,你也不必依赖。在(?s)模式。 @Jacob是对的:它并没有真正帮助你,而且它正在扼杀你的表现。但是(?m)模式对于正确性和性能是有帮助的。

You mentioned java.util.Scanner; this regex would work very nicely with findWithinHorizon(0). But I'd be surprised if Scala doesn't offer a nice, idiomatic way to use it as well.

你提到了java.util.Scanner;这个正则表达式可以很好地与findWithinHorizon(0)一起工作。但是,如果Scala没有提供一个好的,惯用的方式来使用它,我会感到惊讶。

#4

I wouldn't use java.util.Scanner or even strings. Everything you're doing will work perfectly on a byte stream as long as you can assume UTF-8 encoding of your files (or a lack of unicode). You should be able to speed things up by at least 5x.

我不会使用java.util.Scanner甚至字符串。只要您可以假设文件的UTF-8编码(或缺少unicode),您所做的一切都将在字节流上完美运行。你应该能够将速度提高至少5倍。

Edit: this is just a lot of low-level fiddling of bytes and indices. Here's something based loosely on things I've done before, which seems about 2x-5x faster, depending on file size, caching, etc.. I'm not doing the date parsing here, just returning strings, and I'm assuming the files are small enough to fit in a single block of memory (i.e. <2G). This is being rather pedantically careful; if you know, for example, that the date string format is always okay, then the parsing can be faster yet (just count the characters after the first line of digits).

编辑:这只是很多低级别的字节和索引。这里的内容松散地基于我以前做过的事情,看起来大约快2到5倍,具体取决于文件大小,缓存等等。我不是在这里解析日期,只是返回字符串,我假设文件足够小以适合单个内存块(即<2G)。这是相当谨慎的;例如,如果你知道日期字符串格式总是正常的话,那么解析可以更快(只计算第一行数字后面的字符)。

import java.io._
abstract class Entry {
  def isDefined: Boolean
  def date1: String
  def date2: String
  def text: String
}
case class ValidEntry(date1: String, date2: String, text: String) extends Entry {
  def isDefined = true
}
object NoEntry extends Entry {
  def isDefined = false
  def date1 = ""
  def date2 = ""
  def text = ""
}

final class Seeker(f: File) {
  private val buffer = {
    val buf = new Array[Byte](f.length.toInt)
    val fis = new FileInputStream(f)
    fis.read(buf)
    fis.close()
    buf
  }
  private var i = 0
  private var d1,d2 = 0
  private var txt,n = 0
  def isDig(b: Byte) = ('0':Byte) <= b && ('9':Byte) >= b
  def nextNL() {
    while (i < buffer.length && buffer(i) != '\n') i += 1
    i += 1
    if (i < buffer.length && buffer(i) == '\r') i += 1
  }
  def digits() = {
    val zero = i
    while (i < buffer.length && isDig(buffer(i))) i += 1
    if (i==zero || i >= buffer.length || buffer(i) != '\n') {
      nextNL()
      false
    }
    else {
      nextNL()
      true
    }
  }
  def dates(): Boolean = {
    if (i+30 >= buffer.length) {
      i = buffer.length
      false
    }
    else {
      d1 = i
      while (i < d1+12 && buffer(i) != '\n') i += 1
      if (i < d1+12 || buffer(i)!=' ' || buffer(i+1)!='-' || buffer(i+2)!='-' || buffer(i+3)!='>' || buffer(i+4)!=' ') {
        nextNL()
        false
      }
      else {
        i += 5
        d2 = i
        while (i < d2+12 && buffer(i) != '\n') i += 1
        if (i < d2+12 || buffer(i) != '\n') {
          nextNL()
          false
        }
        else {
          nextNL()
          true
        }
      }
    }
  }
  def gatherText() {
    txt = i
    while (i < buffer.length && buffer(i) != '\n') {
      i += 1
      nextNL()
    }
    n = i-txt
    nextNL()
  }
  def getNext: Entry = {
    while (i < buffer.length) {
      if (digits()) {
        if (dates()) {
          gatherText()
          return ValidEntry(new String(buffer,d1,12), new String(buffer,d2,12), new String(buffer,txt,n))
        }
      }
    }
    return NoEntry
  }
}

Now that you see that, aren't you glad that the regex solution was so quick to code?

现在您已经看到了,您是否很高兴正则表达式解决方案如此快速地编码?

#1