正则表达式解析带有转义字符的字符串

时间:2022-10-07 21:52:33

I am reading information out of a formatted string. The format looks like this:

我正在从格式化的字符串中读取信息。格式如下:

"foo:bar:beer:123::lol"

Everything between the ":" is data I want to extract with regex. If a : is followed by another : (like "::") the data for this has to be "" (an empty string).

“:”之间的所有内容都是我想用正则表达式提取的数据。如果a:后跟另一个:(如“::”),则此数据必须为“”(空字符串)。

Currently I am parsing it with this regex:

目前我用这个正则表达式解析它:

(.*?)(:|$)

Now it came to my mind that ":" may exist within the data, as well. So it has to be escaped. Example:

现在我想到了数据中也可能存在“:”。所以必须逃脱。例:

"foo:bar:beer:\::1337"

How can I change my regular expression so that it matches the "\:" as data, too?

如何更改正则表达式以使其与“\:”匹配作为数据呢?

Edit: I am using JavaScript as programming language. It seems to have some limitations regarding complex regulat expressions. The solution should work in JavaScript, as well.

编辑:我使用JavaScript作为编程语言。它似乎对复杂的规则表达有一些限制。该解决方案也应该在JavaScript中运行。

Thanks, McFarlane

3 个解决方案

#1


3  

var myregexp = /((?:\\.|[^\\:])*)(?::|$)/g;
var match = myregexp.exec(subject);
while (match != null) {
    for (var i = 0; i < match.length; i++) {
        // Add match[1] to the list of matches
    }
    match = myregexp.exec(subject);
}

Input: "foo:bar:beer:\\:::1337"

Output: ["foo", "bar", "beer", "\\:", "", "1337", ""]

输出:[“foo”,“bar”,“beer”,“\\:”,“”,“1337”,“”]

You'll always get an empty string as the last match. This is unavoidable given the requirement that you also want empty strings to match between delimiters (and the lack of lookbehind assertions in JavaScript).

你总是得到一个空字符串作为最后一个匹配。考虑到您还希望空字符串在分隔符之间匹配(以及JavaScript中缺少lookbehind断言),这是不可避免的。

Explanation:

(          # Match and capture:
 (?:       # Either match...
  \\.      # an escaped character
 |         # or
  [^\\:]   # any character except backslash or colon
 )*        # zero or more times
)          # End of capturing group
(?::|$)    # Match (but don't capture) a colon or end-of-string

#2


2  

Use a negative lookbehind assertion.

使用负面的lookbehind断言。

(.*?)((?<!\\):|$)

This will only match : if it's not preceded by \.

这只会匹配:如果它之前没有\。

#3


1  

Here's a solution:

这是一个解决方案:

function tokenize(str) {
  var reg = /((\\.|[^\\:])*)/g;
  var array = [];
  while(reg.lastIndex < str.length) {
    match = reg.exec(str);
    array.push(match[0].replace(/\\(\\|:)/g, "$1"));
    reg.lastIndex++;
  }
  return array;
}

It splits a string into token depending on the : character.

它根据:字符将字符串拆分为令牌。

  • But you can escape the : character with \ if you want it to be part of a token.
  • 但是你可以使用\来转义:字符,如果你想让它成为令牌的一部分。

  • you can escape the \ with \ if you want it to be part of a token
  • 如果你希望它成为令牌的一部分,你可以用\来转义\

  • any other \ won't be interpreted. (ie: \a remains \a)
  • 任何其他\将不会被解释。 (即:\ a仍然\ a)

  • So you can put any data in your tokens provided that data is correctly formatted before hand.
  • 因此,只要数据格式正确,您就可以在令牌中放置任何数据。

Here is an example with the string \a:b:\n::\\:\::x, which should give these token: \a, b, \n, <empty string>, \, :, x.

下面是字符串\ a:b:\ n :: \\:\ :: x的示例,它应该提供以下标记:\ a,b,\ n, <空字符串> ,\,:,x。

>>> tokenize("\\a:b:\\n::\\\\:\\::x");
["\a", "b", "\n", "", "\", ":", "x"]

In an attempt to be clearer: the string put into the tokenizer will be interpreted, it has 2 special character: \ and :

为了更清楚:放入标记器的字符串将被解释,它有2个特殊字符:\和:

  • \ will only have a special meaning only if followed by \ or :, and will effectively "escape" these character: meaning that they will loose their special meaning for tokenizer, and they'll be considered as any normal character (and thus will be part of tokens).
  • \只有在跟着\或:后才会有特殊意义,并且会有效地“逃避”这些字符:这意味着它们将失去它们对于标记化器的特殊含义,并且它们将被视为任何正常字符(因此将是部分代币)。

  • : is the marker separating 2 tokens.
  • :是分隔2个令牌的标记。

I realize the OP didn't ask for slash escaping, but other viewers could need a complete parsing library allowing any character in data.

我意识到OP没有要求斜线转义,但其他观众可能需要一个完整的解析库来允许数据中的任何字符。

#1


3  

var myregexp = /((?:\\.|[^\\:])*)(?::|$)/g;
var match = myregexp.exec(subject);
while (match != null) {
    for (var i = 0; i < match.length; i++) {
        // Add match[1] to the list of matches
    }
    match = myregexp.exec(subject);
}

Input: "foo:bar:beer:\\:::1337"

Output: ["foo", "bar", "beer", "\\:", "", "1337", ""]

输出:[“foo”,“bar”,“beer”,“\\:”,“”,“1337”,“”]

You'll always get an empty string as the last match. This is unavoidable given the requirement that you also want empty strings to match between delimiters (and the lack of lookbehind assertions in JavaScript).

你总是得到一个空字符串作为最后一个匹配。考虑到您还希望空字符串在分隔符之间匹配(以及JavaScript中缺少lookbehind断言),这是不可避免的。

Explanation:

(          # Match and capture:
 (?:       # Either match...
  \\.      # an escaped character
 |         # or
  [^\\:]   # any character except backslash or colon
 )*        # zero or more times
)          # End of capturing group
(?::|$)    # Match (but don't capture) a colon or end-of-string

#2


2  

Use a negative lookbehind assertion.

使用负面的lookbehind断言。

(.*?)((?<!\\):|$)

This will only match : if it's not preceded by \.

这只会匹配:如果它之前没有\。

#3


1  

Here's a solution:

这是一个解决方案:

function tokenize(str) {
  var reg = /((\\.|[^\\:])*)/g;
  var array = [];
  while(reg.lastIndex < str.length) {
    match = reg.exec(str);
    array.push(match[0].replace(/\\(\\|:)/g, "$1"));
    reg.lastIndex++;
  }
  return array;
}

It splits a string into token depending on the : character.

它根据:字符将字符串拆分为令牌。

  • But you can escape the : character with \ if you want it to be part of a token.
  • 但是你可以使用\来转义:字符,如果你想让它成为令牌的一部分。

  • you can escape the \ with \ if you want it to be part of a token
  • 如果你希望它成为令牌的一部分,你可以用\来转义\

  • any other \ won't be interpreted. (ie: \a remains \a)
  • 任何其他\将不会被解释。 (即:\ a仍然\ a)

  • So you can put any data in your tokens provided that data is correctly formatted before hand.
  • 因此,只要数据格式正确,您就可以在令牌中放置任何数据。

Here is an example with the string \a:b:\n::\\:\::x, which should give these token: \a, b, \n, <empty string>, \, :, x.

下面是字符串\ a:b:\ n :: \\:\ :: x的示例,它应该提供以下标记:\ a,b,\ n, <空字符串> ,\,:,x。

>>> tokenize("\\a:b:\\n::\\\\:\\::x");
["\a", "b", "\n", "", "\", ":", "x"]

In an attempt to be clearer: the string put into the tokenizer will be interpreted, it has 2 special character: \ and :

为了更清楚:放入标记器的字符串将被解释,它有2个特殊字符:\和:

  • \ will only have a special meaning only if followed by \ or :, and will effectively "escape" these character: meaning that they will loose their special meaning for tokenizer, and they'll be considered as any normal character (and thus will be part of tokens).
  • \只有在跟着\或:后才会有特殊意义,并且会有效地“逃避”这些字符:这意味着它们将失去它们对于标记化器的特殊含义,并且它们将被视为任何正常字符(因此将是部分代币)。

  • : is the marker separating 2 tokens.
  • :是分隔2个令牌的标记。

I realize the OP didn't ask for slash escaping, but other viewers could need a complete parsing library allowing any character in data.

我意识到OP没有要求斜线转义,但其他观众可能需要一个完整的解析库来允许数据中的任何字符。