I am reading information out of a formatted string. The format looks like this:



Everything between the ":" is data I want to extract with regex. If a : is followed by another : (like "::") the data for this has to be "" (an empty string).


Currently I am parsing it with this regex:



Now it came to my mind that ":" may exist within the data, as well. So it has to be escaped. Example:



How can I change my regular expression so that it matches the "\:" as data, too?


Edit: I am using JavaScript as programming language. It seems to have some limitations regarding complex regulat expressions. The solution should work in JavaScript, as well.


3 个解决方案



var myregexp = /((?:\\.|[^\\:])*)(?::|$)/g;
var match = myregexp.exec(subject);
while (match != null) {
    for (var i = 0; i < match.length; i++) {
        // Add match[1] to the list of matches
    match = myregexp.exec(subject);

Input: "foo:bar:beer:\\:::1337"

Output: ["foo", "bar", "beer", "\\:", "", "1337", ""]


You'll always get an empty string as the last match. This is unavoidable given the requirement that you also want empty strings to match between delimiters (and the lack of lookbehind assertions in JavaScript).



(          # Match and capture:
 (?:       # Either match...
  \\.      # an escaped character
 |         # or
  [^\\:]   # any character except backslash or colon
 )*        # zero or more times
)          # End of capturing group
(?::|$)    # Match (but don't capture) a colon or end-of-string



Use a negative lookbehind assertion.



This will only match : if it's not preceded by \.




Here's a solution:


function tokenize(str) {
  var reg = /((\\.|[^\\:])*)/g;
  var array = [];
  while(reg.lastIndex < str.length) {
    match = reg.exec(str);
    array.push(match[0].replace(/\\(\\|:)/g, "$1"));
  return array;

It splits a string into token depending on the : character.


  • But you can escape the : character with \ if you want it to be part of a token.
  • 但是你可以使用\来转义:字符,如果你想让它成为令牌的一部分。

  • you can escape the \ with \ if you want it to be part of a token
  • 如果你希望它成为令牌的一部分,你可以用\来转义\

  • any other \ won't be interpreted. (ie: \a remains \a)
  • 任何其他\将不会被解释。 (即:\ a仍然\ a)

  • So you can put any data in your tokens provided that data is correctly formatted before hand.
  • 因此,只要数据格式正确,您就可以在令牌中放置任何数据。

Here is an example with the string \a:b:\n::\\:\::x, which should give these token: \a, b, \n, <empty string>, \, :, x.

下面是字符串\ a:b:\ n :: \\:\ :: x的示例,它应该提供以下标记:\ a,b,\ n, <空字符串> ,\,:,x。

>>> tokenize("\\a:b:\\n::\\\\:\\::x");
["\a", "b", "\n", "", "\", ":", "x"]

In an attempt to be clearer: the string put into the tokenizer will be interpreted, it has 2 special character: \ and :


  • \ will only have a special meaning only if followed by \ or :, and will effectively "escape" these character: meaning that they will loose their special meaning for tokenizer, and they'll be considered as any normal character (and thus will be part of tokens).
  • \只有在跟着\或:后才会有特殊意义,并且会有效地“逃避”这些字符:这意味着它们将失去它们对于标记化器的特殊含义,并且它们将被视为任何正常字符(因此将是部分代币)。

  • : is the marker separating 2 tokens.
  • :是分隔2个令牌的标记。

I realize the OP didn't ask for slash escaping, but other viewers could need a complete parsing library allowing any character in data.




