如何在此RegEx中指定可选的捕获组？

How can I fix this RegEx to optionally capture a file extension?

如何修复此RegEx以选择性地捕获文件扩展名?

I am trying to match a string with an optional component, but something appears to be wrong. (The strings being matched are from a printer log.)

我试图将字符串与可选组件匹配,但似乎有些错误。 (匹配的字符串来自打印机日志。)

My RegEx (.NET Flavor) is as follows:

我的RegEx(.NET Flavor)如下:

.*(header_\d{10,11}_).*(_.*_\d{8}).*(\.\w{3,4}).*
-------------------------------------------
.*                   # Ignore some garbage in the front
(header_             # Match the start of the file name,
    \d{10,11}_)      #     including the ID (10 - 11 digits)
.*                   # Ignore the type code in the middle
(_.*_\d{8})          # Match some random characters, then an 8-digit date
.*                   # Ignore anything between this and the file extension
(\.\w{3,4})          # Match the file extension, 3 or 4 characters long
.*                   # Ignore the rest of the string

I expect this to match strings like:

我希望这匹配字符串,如:

str1 = "header_0000000602_t_mc2e1nrobr1a3s55niyrrqvy_20081212[1].doc [Compatibility Mode]"
str2 = "Microsoft PowerPoint - header_00000000076_d_al41zguyvgqfj2454jki5l55_20071203[1].txt"
str3 = "header_00000000076_d_al41zguyvgqfj2454jki5l55_20071203[1]"

Where the capture groups return something like:

捕获组返回的内容如下:

$1  =  header_0000000602_
$2  =  _mc2e1nrobr1a3s55niyrrqvy_20081212
$3  =  .doc

Where $3 can be empty if no file extension is found. $3 is the optional part, as you can see in str3 above.

如果没有找到文件扩展名,$ 3可以为空。 $ 3是可选部分,如上面str3中所示。

If I add "?" to the end of the third capture group "(.\w{3,4})?", the RegEx no longer captures $3 for any string. If I add "+" instead "(.\w{3,4})+", the RegEx no longer captures str3 at all, which is to be expected.

如果我加“?”到第三个捕获组“(。\ w {3,4})?”的末尾,RegEx不再为任何字符串捕获3美元。如果我添加“+”而不是“(。\ w {3,4})+”,则RegEx根本不再捕获str3,这是预期的。

I feel that using "?" at the end of the third capture group is the appropriate thing to do, but it doesn't work as I expect. I am probably being too naive with the ".*" sections that I use to ignore parts of the string.

我觉得用“?”在第三个捕获组的末尾是适当的事情,但它不能像我期望的那样工作。对于我用来忽略字符串部分的“。*”部分,我可能太天真了。

Doesn't Work As Expected:

不按预期工作:

.*(header_\d*_).*(_.*_.{8}).*(\.\w{3,4})?.*

7 个解决方案

#1

One possibility is that the second to last .* is being greedy. You might try changing it to:

一种可能性是倒数第二个。*正在贪婪。您可以尝试将其更改为:

.*(header_\d*_).*(_.*_.{8}).*?(\.\w{3,4})?.*
                             ^ Added that

That wasn't correct, this one will match the input you supplied, but it assumes that the first . it encounters is the start of a file extension:

这不正确,这个将匹配您提供的输入,但它假设第一个。遇到的是文件扩展名的开头:

.*(header_\d*_).*(_.*_.{8})[^\.]*(\.\w{3,4})?.*

Edit: Remove the escaping I had in the second regex.

编辑:删除我在第二个正则表达式中的转义。

#2

I believe the problem is in your 3rd .*, which you annotated above with "Ignore anything between this and the file extension". It's greedy, so it will match ANYTHING. When you make the extension pattern optional, the 3rd .* matches up to the end of the string, which is allowed. Assuming that there will NEVER be a '.' character in that extraneous bit, you can replace .* with [^.]* and the rest will hopefully work after you restore the ? that you had to remove.

我相信问题出现在您的第3个。*中,您在上面注释了“忽略此文件扩展名之间的任何内容”。这很贪心,所以它会与任何东西相匹配。当您将扩展模式设置为可选时,第3。*匹配字符串的结尾,这是允许的。假设永远不会是'。'在那个无关紧要的位置,你可以用[^。] *替换。*,其余的希望在你恢复后工作?你必须删除。

#3

Well, .* is probably the wrong way to start the regex- it will match 0 or more (*) single characters of anything (.) ...which means your entire file name will be matched by that alone. If you leave that off the regex will start matching when it reaches header which is what you want. You could also replace it with \w, which matches word breaks. I also suggest using a tool such as The Regex Coach so you can step through it and see exactly what's wrong and what your capture groups will be.

嗯,。*可能是启动正则表达式的错误方法 - 它将匹配0或更多(*)任何单个字符(。)...这意味着您的整个文件名将仅由该匹配。如果你离开它,正则表达式将在它到达你想要的标题时开始匹配。你也可以用\ w替换它,它匹配单词分隔符。我还建议使用诸如The Regex Coach之类的工具,这样你就可以逐步完成它,看看到底出了什么问题以及你的捕获组将是什么。

#4

Specify in your second match that you only want to match all characters that do not have the period in them then do your match for your extension.

在第二场比赛中指定您只想匹配其中没有句号的所有字符,然后匹配您的分机。

".*(header_\d{10,11}_).*(_.*_\d{8})[^.]*(\.\w{3,4})?"

#5

This is your correct result

这是你的正确结果

.*?(header_\d*_).*?(_.*_.{8})[^.]*(\.\w{3,4})?.*
-------------------------------------------
.*?                  # Prevent a greedy match
(header_             # 
    \d{10,11}_)      # 
.*?                  # Prevent a greedy match
(_.*_\d{8})          # 
[^.]*                # Take everything that is NOT a period
(\.\w{3,4})          # Match the extension
.*                   #

The implicit assumption is that the period will be the beginning of a file extension after the digits match. The following wouldn't meet this requirement:

隐含的假设是句点将是数字匹配后文件扩展名的开头。以下内容不符合此要求:

string unmatched = "header_00000000076_d_al41zguyvgqfj2454jki5l55_20071203[1].foobar.txt"

Also, when taking out your groups in .NET make sure your code looks like this:

此外,在.NET中取出组时,请确保您的代码如下所示:

regex.Match(string_to_match).Groups[1].Value
regex.Match(string_to_match).Groups[2].Value
regex.Match(string_to_match).Groups[3].Value

and not this:

而不是这个:

// 0 index == string_to_match
regex.Match(string_to_match).Groups[0].Value
regex.Match(string_to_match).Groups[1].Value
regex.Match(string_to_match).Groups[2].Value

This is something that tripped me up at first.

这首先让我绊倒了。

#6

This works for the examples you've posted:

这适用于您发布的示例:

^.*?(?<header>\d+)_.*?_(?<date>\d{8}).*?(?:\.(?<ext>\w{3,4}))?[\w\s\[\]]*$

I'm assuming that the text "header" and the random characters between that and the date aren't important, so those aren't captured by this regex. I also used the .NET named capture feature for clarity, but be aware that it isn't supported in other flavors of RegEx.

我假设文本“标题”和它与日期之间的随机字符并不重要,因此这些正则表达式不会捕获这些字符。为清晰起见,我还使用了.NET命名捕获功能,但要注意其他版本的RegEx不支持它。

If the text after the file name contains any non-alphanumeric characters other than [ and ], the pattern will need to be revised.

如果文件名后面的文本包含[和]以外的任何非字母数字字符,则需要修改该模式。

#7

Here is one that works for what you're posting:

这是适用于您发布的内容的一个:

^.*(?<header>header_\d{10,11})_.*(?<date>_[a-z0-9]+_\d{8})(\[\d+\])(?<ext>(\.[a-zA-Z0-9]{3,4})?).*

The replacement is:

替换是:

Header: $1
Date: $2
Extension: $4

I didn't use the named groups in the replacement because I couldn't figure out how to get TextMate to do it, but the named groups were helpful to force the capture.

我没有在替换中使用命名组,因为我无法弄清楚如何让TextMate这样做,但命名组有助于强制捕获。

#1