regex用于特定字符的第一个实例，它不会在另一个特定字符后立即出现

I have a function, translate(), takes multiple parameters. The first param is the only required and is a string, that I always wrap in single quotes, like this:

我有一个函数,translate(),需要多个参数。第一个参数是唯一必需的并且是一个字符串,我总是用单引号括起来,如下所示:

translate('hello world');

The other params are optional, but could be included like this:

其他参数是可选的,但可以包含如下:

translate('hello world', true, 1, 'foobar', 'etc');

翻译('hello world',true,1,'foobar','etc');

And the string itself could contain escaped single quotes, like this:

字符串本身可以包含转义的单引号,如下所示:

translate('hello\'s world');

To the point, I now want to search through all code files for all instances of this function call, and extract just the string. To do so I've come up with the following grep, which returns everything between translate(' and either ') or ',. Almost perfect:

至此,我现在想要搜索所有代码文件以查找此函数调用的所有实例,并仅提取字符串。为此,我提出了以下grep,它返回translate('和'或')之间的所有内容。几乎完美:

grep -RoPh "(?<=translate\(').*?(?='\)|'\,)" .

grep -RoPh“(?<= translate \(')。*?(?='\)|'\,)”。

The problem with this though, is that if the call is something like this:

但问题是,如果调用是这样的:

translate('hello \'world\', you\'re great!');

翻译('你好'世界',你真棒!');

My grep would only return this:

我的grep只会返回这个:

hello \'world\

So I'm looking to modify this so that the part that currently looks for ') or ', instead looks for the first occurrence of ' that hasn't been escaped, i.e. doesn't immediately follow a \

所以我希望修改这个,以便当前寻找')或'的部分代替第一次出现'尚未转义的',即不会立即跟随\

Hopefully I'm making sense. Any suggestions please?

希望我有意义。有什么建议吗?

2 个解决方案

#1

You can use this grep with PCRE regex:

您可以将此grep与PCRE正则表达式一起使用:

grep -RoPh "\btranslate\(\s*\K'(?:[^'\\\\]*)(?:\\\\.[^'\\\\]*)*'" .

Here is a regex demo

这是一个正则表达式演示

RegEx Breakup:

\b            # word boundary
translate     # match literal translate
\(            # match a (
\s*           # match 0 or more whitespace
\K            # reset the matched information
'             # match starting single quote
(?:           # start non-capturing group
   [^'\\\\]*  # match 0 or more chars that are not a backslash or single quote
)             # end non-capturing group
(?:           # start non-capturing group
   \\\\.      # match a backslash followed by char that is "escaped"
   [^'\\\\]*  # match 0 or more chars that are not a backslash or single quote
)*            # end non-capturing group
'             # match ending single quote

Here is a version without \K using look-arounds:

这是一个没有\ K使用环视的版本:

grep -oPhR "(?<=\btranslate\(')(?:[^'\\\\]*)(?:\\\\.[^'\\\\]*)*(?=')" .

RegEx Demo 2

RegEx演示2

#2

I think the problem is the .*? part: the ? makes it a non-greedy pattern, meaning it'll take the shortest string that matches the pattern. In effect, you're saying, "give me the shortest string that's followed by quote+close-paren or quote+comma". In your example, "world\" is followed by a single quote and a comma, so it matches your pattern. In these cases, I like to use something like the following reasoning:

我认为问题是。*?部分:?使它成为一种非贪婪的模式,这意味着它将采用与模式匹配的最短字符串。实际上,你说,“给我最短的字符串,然后引用+ close-paren或quote +逗号”。在您的示例中,“world”后跟单引号和逗号,因此它与您的模式匹配。在这些情况下,我喜欢使用以下推理:

A string is a quote, zero or more characters, and a quote: '.*'

字符串是引号,零个或多个字符,以及引号:'。*'

A character is anything that isn't a quote (because a quote terminates the string): '[^']*'

字符是任何不是引号的字符(因为引号终止字符串):'[^'] *'

Except that you can put a quote in a string by escaping it with a backslash, so a character is either "backslash followed by a quote" or, failing that, "not a quote": '(\\'|[^'])*'

除非您可以通过使用反斜杠转义它来将字符串放入字符串中,因此字符要么是“反斜杠后跟引号”,要么失败,“不是引用”:'(\\'| [^'] )*”

Put it all together and you get

把它们放在一起就可以了

grep -RoPh "(?<=translate\(')(\\'|[^'])*(?='\)|'\,)" .

#1