I have a file of data fields, which may contain comments, like below:
我有一个数据字段文件,可能包含注释,如下所示:
id, data, data, data
101 a, b, c
102 d, e, f
103 g, h, i // has to do with 101 a, b, c
104 j, k, l
//105 m, n, o
// 106 p, q, r
As you can see in the first comment above, there are direct references to a matching pattern. Now, I want to capture 103 and it's three data fields, but I don't want to capture what's in the comments.
正如您在上面的第一条评论中所看到的,可以直接引用匹配模式。现在,我想捕获103和它的三个数据字段,但我不想捕获评论中的内容。
I've tried negative lookbehind to exclude 105 and 106, but I can't come up with a regex to capture both.
我尝试过消极的lookbehind排除105和106,但我不能拿出一个正则表达式捕获两者。
(?<!//)(\b\d+\b),\s(data),\s(data),\s(data)
This will capture all but exclude capture of 105, but to specify
这将捕获所有但不包括105的捕获,但要指定
(?<!//\s*) or (?<!//.*)
as my attempt to exclude a comment with any whitespace or any characters invalidates my entire regex.
因为我试图用任何空格或任何字符排除评论会使我的整个正则表达式无效。
I have a feeling I need a crafty use of an anchor, or I need to wrap what I want in a capture group and make a reference to it (like with $1
) in my lookbehind.
我有一种感觉,我需要一个狡猾的锚点使用,或者我需要在捕获组中包装我想要的东西并在我的lookbehind中引用它(比如$ 1)。
If this is another case of "regular expressions don't support recursion" because it's a regular language (a la automata theory), please point that out.
如果这是“正则表达式不支持递归”的另一种情况,因为它是常规语言(自动机理论),请指出。
Is it possible to exclude the comments in 103, and lines 105 and 106, using a regular expression? If so, how?
是否可以使用正则表达式排除103和第105和106行中的注释?如果是这样,怎么样?
3 个解决方案
#1
6
The easy way out is to replace \s*//.*
with the empty string before you begin.
最简单的方法是在开始之前用空字符串替换\ s * //。*。
This will remove all the (single-line) comments from your input and you can go on with a simple expression to match what actually you want.
这将从您的输入中删除所有(单行)注释,您可以继续使用一个简单的表达式来匹配您想要的实际内容。
The alternative would be to use look-ahead instead of look-behind:
替代方案是使用预测而不是后视:
^(?!//)(\b\d+\b),\s(data),\s(data),\s(data)
In your case it would even work to just anchor the regex because it is clear that the first thing on a line must be a digit:
在你的情况下它甚至可以只是锚定正则表达式,因为很明显,一行上的第一件事必须是一个数字:
^(\b\d+\b),\s(data),\s(data),\s(data)
Some regex engines (the one in .NET, for example), support variable-length look-behinds, your's does not seem to be capable of this, this is why (?<!//\s*)
fails for you.
一些正则表达式引擎(例如.NET中的引擎)支持可变长度的后视,你似乎不具备此功能,这就是为什么(?<!// \ s *)失败的原因。
#2
1
It seems to me you could just anchor the expression at the beginning of the line (to get all the data):
在我看来,你可以将表达式锚定在行的开头(以获取所有数据):
^(\d+),\s(data),\s(data),\s(data)\s*(?://|$)
Or maybe you can use a proper CSV parser which can handle comments.
或者也许您可以使用适当的CSV解析器来处理注释。
#3
1
You could simply anchor the regex to the start of the line:
您可以简单地将正则表达式锚定到行的开头:
(?m)^(\d+),\s(\S+),\s(\S+),\s(\S+)
#1
6
The easy way out is to replace \s*//.*
with the empty string before you begin.
最简单的方法是在开始之前用空字符串替换\ s * //。*。
This will remove all the (single-line) comments from your input and you can go on with a simple expression to match what actually you want.
这将从您的输入中删除所有(单行)注释,您可以继续使用一个简单的表达式来匹配您想要的实际内容。
The alternative would be to use look-ahead instead of look-behind:
替代方案是使用预测而不是后视:
^(?!//)(\b\d+\b),\s(data),\s(data),\s(data)
In your case it would even work to just anchor the regex because it is clear that the first thing on a line must be a digit:
在你的情况下它甚至可以只是锚定正则表达式,因为很明显,一行上的第一件事必须是一个数字:
^(\b\d+\b),\s(data),\s(data),\s(data)
Some regex engines (the one in .NET, for example), support variable-length look-behinds, your's does not seem to be capable of this, this is why (?<!//\s*)
fails for you.
一些正则表达式引擎(例如.NET中的引擎)支持可变长度的后视,你似乎不具备此功能,这就是为什么(?<!// \ s *)失败的原因。
#2
1
It seems to me you could just anchor the expression at the beginning of the line (to get all the data):
在我看来,你可以将表达式锚定在行的开头(以获取所有数据):
^(\d+),\s(data),\s(data),\s(data)\s*(?://|$)
Or maybe you can use a proper CSV parser which can handle comments.
或者也许您可以使用适当的CSV解析器来处理注释。
#3
1
You could simply anchor the regex to the start of the line:
您可以简单地将正则表达式锚定到行的开头:
(?m)^(\d+),\s(\S+),\s(\S+),\s(\S+)