如何将一定长度的字符串与正则表达式匹配

时间:2020-12-30 19:23:46

For a project of mine, I'm trying to implement a small part of the BitTorrent protocol, which can be found here. Specifically, I want to use the "Bencoding" part of it, which is a way to safely encode data for transfer over a socket. The format is as follows:

对于我的项目,我正在尝试实现BitTorrent协议的一小部分,可在此处找到。具体来说,我想使用它的“Bencoding”部分,这是一种安全编码数据以便通过套接字传输的方法。格式如下:

8:a string => "a string"
i1234e => 1234
l1:a1:be => ['a', 'b']
d1:a1:b3:one3:twoe => {'a':'b', 'one':two}

The encoding part was easy enough, but decoding is become quite a hassle. For example, if I have a list of strings, I have no way to separate them into individual strings. I've tried several different solutions, including PyParsing and a custom token parser. I'm currently attempting to use regexes, and it seems to be going fairly well, but I'm still hung up on the string problem. My current regex is:

编码部分很简单,但解码变得非常麻烦。例如,如果我有一个字符串列表,我无法将它们分成单独的字符串。我尝试了几种不同的解决方案,包括PyParsing和自定义令牌解析器。我目前正在尝试使用正则表达式,它看起来相当不错,但我仍然挂在字符串问题上。我现在的正则表达式是:

(?P<length>\d+):(?P<contents>.{\1})

However, i can't seem to use the first group as the length of the second group. Is there any good way to do this? Or am I approaching this all wrong, and the answer is sitting right in front of me?

但是,我似乎无法使用第一组作为第二组的长度。有没有好办法呢?或者我接近这一切都错了,答案就在我面前?

5 个解决方案

#1


8  

Any parser you use for this is going to need to be stateful (i.e. remember stuff), and regexes are, by and large, not stateful. They're the wrong tool for this job.

你用于此的任何解析器都需要是有状态的(即记住东西),并且正则表达式基本上不是有状态的。他们是这项工作的错误工具。

If those are the only data types you have to worry about, I think I'd just write custom parsers for each data type, passing control to the appropriate parser after reading the first character.

如果这些是您唯一需要担心的数据类型,我想我只是为每种数据类型编写自定义解析器,在读取第一个字符后将控制权传递给相应的解析器。

I'd actually implement one now, but it's late.

我现在实际上实现了一个,但现在已经很晚了。

Alright I decided to write an implementation:

好吧我决定写一个实现:

from StringIO import StringIO
import string

inputs = ["10:a stringly",
         "i1234e" ,
         "l1:a1:be",
         "d1:a1:b3:one3:twoe"]

# Constants
DICT_TYPE = 'd'
LIST_TYPE = 'l'
INT_TYPE  = 'i'
TOKEN_EOF = ''
TOKEN_END = 'e'
COLON     = ':'


class BadTypeIndicatorException(Exception):pass


def read_int(stream):

   s = ""

   while True:
      ch = stream.read(1)
      if ch not in [TOKEN_EOF, TOKEN_END, COLON]:
         s += ch
      else:
         break

   return s


def tokenize(stream):

   s = ""

   while True:

      ch = stream.read(1)

      if ch == TOKEN_END or ch == TOKEN_EOF:
         return 

      if ch == COLON:
         length = int(s)
         yield stream.read(length)
         s = ""

      else:
         s += ch


def parse(stream):

   TYPE = stream.read(1)

   if TYPE in string.digits:
      length = int( TYPE + read_int(stream) )
      return stream.read(length)

   elif TYPE is INT_TYPE: 
      return int( read_int(stream) )

   elif TYPE is LIST_TYPE: 
      return list(tokenize(stream))

   elif TYPE is DICT_TYPE:
      tokens = list(tokenize(stream))
      return dict(zip(tokens[0::2], tokens[1::2]))

   else: 
      raise BadTypeIndicatorException



for input in inputs:
   stream = StringIO(input)
   print parse(stream)

#2


2  

You can do it if you parse the string twice. Apply the first regex to get the length. Concatenate the length in your second regex to form a valid expression.

如果您将字符串解析两次,则可以执行此操作。应用第一个正则表达式以获得长度。连接第二个正则表达式中的长度以形成有效的表达式。

Not sure how that can be done in python, but a sample in C# would be:

不知道如何在python中完成,但C#中的示例将是:

string regex = "^[A-Za-z0-9_]{1," + length + "}$"

To match 1 to length no of chars which can be alpanumeric or _ where length is determined from a previous regex that retrieves only the length.

匹配1到长度no的字符可以是alpanumeric或_其中length是从先前的正则表达式中确定的,该正则表达式只检索长度。

Hope this helps :)

希望这可以帮助 :)

#3


2  

You'll want to do this in two steps. Regular expressions are actually a little overkill for such simple parsing problems as this. Here's how I'd do it:

您需要分两步完成此操作。对于像这样简单的解析问题,正则表达式实际上有点过分。我是这样做的:

def read_string(stream):
    pos = stream.index(':')
    length = int(stream[0:pos])
    string = stream[pos+1:pos+1+length]
    return string, stream[pos+1+length:]

It's a functional-style way of parsing, it returns the value parsed and the rest of the stream.

它是一种功能风格的解析方式,它返回解析的值和流的其余部分。

For lists, maybe:

对于列表,可能:

def read_list(stream):
    stream = stream[1:]
    result = []
    while stream[0] != 'e':
        obj, stream = read_object(stream)
        result.append(obj)
    stream = stream[1:]
    return result

And then you'd define a read_object that checks the first character of the stream and dispatches appropriately.

然后你定义一个read_object来检查流的第一个字符并适当地调度。

#4


1  

You are using the wrong tool for the job... This requires some sort of state keeping, and generally speaking, regular expressions are stateless.

你正在使用错误的工具......这需要某种状态保持,一般来说,正则表达式是无状态的。


An example implementation of bdecoding (and bencoding) in PERL that I did can be found here.

我在这里可以找到PERL中的bdecoding(和bencoding)的示例实现。

An explanation of how that function works (since I never did get to comment it [oops]):

解释该函数是如何工作的(因为我从来没有对它进行评论[oops]):

Basically what you need to do is setup a recursive function. This function takes a string reference (so it can be modified) and returns "something" (the nature of this means it could be an array, a hashtable, an int, or a string).

基本上你需要做的是设置一个递归函数。此函数接受一个字符串引用(因此可以修改)并返回“something”(这意味着它可以是一个数组,一个哈希表,一个int或一个字符串)。

The function itself just checks the first character in the string and decides what to do based of that:

函数本身只是检查字符串中的第一个字符,并根据它决定要做什么:

  • If it is an i, then parse out all the text between the i and the first e, and try to parse it as an int according to the rules of what is allowed.
  • 如果是i,则解析i和第一个e之间的所有文本,并尝试根据允许的规则将其解析为int。
  • If it is a digit, then read all the digits up to :, then read that many characters off the string.
  • 如果是数字,则读取所有数字到:,然后从字符串中读取许多字符。

Lists and dictionaries are where things start to get interesting... if there is an l or d as the first character, then you need to strip off the l/d, then pass the current string back into the function, so that it can start parsing elements in the list or dictionary. Then just store the returned values in the appropriate places in an appropriate structure till you hit an e, and return the structure you're left with.

列表和词典是事情开始变得有趣的地方......如果有一个l或d作为第一个字符,那么你需要剥离l / d,然后将当前字符串传递回函数,以便它可以开始解析列表或字典中的元素。然后将返回的值存储在适当结构中的适当位置,直到您触及e,并返回您剩下的结构。

Remember, the function as I implemented it was DESTRUCTIVE. The string passed in is empty when the function returns due to it being passed by reference, or more accurately, it will be devoid of anything it parsed and returned (which is why it can be used recursively: anything it doesn't process is left untouched). In most cases of the initial call though, this should process everything unless you've been doing something odd, so the above holds.

请记住,我实现它的功能是破坏性的。函数返回时传入的字符串是空的,因为它是通过引用传递的,或者更准确地说,它将没有任何解析和返回的东西(这就是为什么它可以递归使用:它不处理的任何东西都留下了不变)。在大多数初始调用的情况下,这应该处理所有事情,除非你做了一些奇怪的事情,所以上面说的话。

#5


1  

Pseudo-code, without syntax checks:

伪代码,没有语法检查:

define read-integer (stream):
    let number 0, sign 1:
        if string-equal ('-', (c <- read-char (stream))):
            sign <- -1
          else:
            number <- parse-integer (c)
        while number? (c <- read-char (stream)):
            number <- (number * 10) + parse-integer (c)
        return sign * number

define bdecode-string (stream):
    let count read-integer (stream):
        return read-n-chars (stream, count)

define bdecode-integer (stream):
    ignore read-char (stream)
    return read-integer (stream)

define bdecode-list (stream):
    ignore read-char (stream)
    let list []:
        while not string-equal ('e', peek-char (stream)):
            append (list, bdecode (stream))
        return list

define bdecode-dictionary (stream):
    let list bdecode-list stream:
        return dictionarify (list)

define bdecode (stream):
    case peek-char (stream):
        number? => bdecode-string (stream)
        'i' => bdecode-integer (stream)
        'l' => bdecode-list (stream)
        'd' => bdecode-dictionary (stream)

#1


8  

Any parser you use for this is going to need to be stateful (i.e. remember stuff), and regexes are, by and large, not stateful. They're the wrong tool for this job.

你用于此的任何解析器都需要是有状态的(即记住东西),并且正则表达式基本上不是有状态的。他们是这项工作的错误工具。

If those are the only data types you have to worry about, I think I'd just write custom parsers for each data type, passing control to the appropriate parser after reading the first character.

如果这些是您唯一需要担心的数据类型,我想我只是为每种数据类型编写自定义解析器,在读取第一个字符后将控制权传递给相应的解析器。

I'd actually implement one now, but it's late.

我现在实际上实现了一个,但现在已经很晚了。

Alright I decided to write an implementation:

好吧我决定写一个实现:

from StringIO import StringIO
import string

inputs = ["10:a stringly",
         "i1234e" ,
         "l1:a1:be",
         "d1:a1:b3:one3:twoe"]

# Constants
DICT_TYPE = 'd'
LIST_TYPE = 'l'
INT_TYPE  = 'i'
TOKEN_EOF = ''
TOKEN_END = 'e'
COLON     = ':'


class BadTypeIndicatorException(Exception):pass


def read_int(stream):

   s = ""

   while True:
      ch = stream.read(1)
      if ch not in [TOKEN_EOF, TOKEN_END, COLON]:
         s += ch
      else:
         break

   return s


def tokenize(stream):

   s = ""

   while True:

      ch = stream.read(1)

      if ch == TOKEN_END or ch == TOKEN_EOF:
         return 

      if ch == COLON:
         length = int(s)
         yield stream.read(length)
         s = ""

      else:
         s += ch


def parse(stream):

   TYPE = stream.read(1)

   if TYPE in string.digits:
      length = int( TYPE + read_int(stream) )
      return stream.read(length)

   elif TYPE is INT_TYPE: 
      return int( read_int(stream) )

   elif TYPE is LIST_TYPE: 
      return list(tokenize(stream))

   elif TYPE is DICT_TYPE:
      tokens = list(tokenize(stream))
      return dict(zip(tokens[0::2], tokens[1::2]))

   else: 
      raise BadTypeIndicatorException



for input in inputs:
   stream = StringIO(input)
   print parse(stream)

#2


2  

You can do it if you parse the string twice. Apply the first regex to get the length. Concatenate the length in your second regex to form a valid expression.

如果您将字符串解析两次,则可以执行此操作。应用第一个正则表达式以获得长度。连接第二个正则表达式中的长度以形成有效的表达式。

Not sure how that can be done in python, but a sample in C# would be:

不知道如何在python中完成,但C#中的示例将是:

string regex = "^[A-Za-z0-9_]{1," + length + "}$"

To match 1 to length no of chars which can be alpanumeric or _ where length is determined from a previous regex that retrieves only the length.

匹配1到长度no的字符可以是alpanumeric或_其中length是从先前的正则表达式中确定的,该正则表达式只检索长度。

Hope this helps :)

希望这可以帮助 :)

#3


2  

You'll want to do this in two steps. Regular expressions are actually a little overkill for such simple parsing problems as this. Here's how I'd do it:

您需要分两步完成此操作。对于像这样简单的解析问题,正则表达式实际上有点过分。我是这样做的:

def read_string(stream):
    pos = stream.index(':')
    length = int(stream[0:pos])
    string = stream[pos+1:pos+1+length]
    return string, stream[pos+1+length:]

It's a functional-style way of parsing, it returns the value parsed and the rest of the stream.

它是一种功能风格的解析方式,它返回解析的值和流的其余部分。

For lists, maybe:

对于列表,可能:

def read_list(stream):
    stream = stream[1:]
    result = []
    while stream[0] != 'e':
        obj, stream = read_object(stream)
        result.append(obj)
    stream = stream[1:]
    return result

And then you'd define a read_object that checks the first character of the stream and dispatches appropriately.

然后你定义一个read_object来检查流的第一个字符并适当地调度。

#4


1  

You are using the wrong tool for the job... This requires some sort of state keeping, and generally speaking, regular expressions are stateless.

你正在使用错误的工具......这需要某种状态保持,一般来说,正则表达式是无状态的。


An example implementation of bdecoding (and bencoding) in PERL that I did can be found here.

我在这里可以找到PERL中的bdecoding(和bencoding)的示例实现。

An explanation of how that function works (since I never did get to comment it [oops]):

解释该函数是如何工作的(因为我从来没有对它进行评论[oops]):

Basically what you need to do is setup a recursive function. This function takes a string reference (so it can be modified) and returns "something" (the nature of this means it could be an array, a hashtable, an int, or a string).

基本上你需要做的是设置一个递归函数。此函数接受一个字符串引用(因此可以修改)并返回“something”(这意味着它可以是一个数组,一个哈希表,一个int或一个字符串)。

The function itself just checks the first character in the string and decides what to do based of that:

函数本身只是检查字符串中的第一个字符,并根据它决定要做什么:

  • If it is an i, then parse out all the text between the i and the first e, and try to parse it as an int according to the rules of what is allowed.
  • 如果是i,则解析i和第一个e之间的所有文本,并尝试根据允许的规则将其解析为int。
  • If it is a digit, then read all the digits up to :, then read that many characters off the string.
  • 如果是数字,则读取所有数字到:,然后从字符串中读取许多字符。

Lists and dictionaries are where things start to get interesting... if there is an l or d as the first character, then you need to strip off the l/d, then pass the current string back into the function, so that it can start parsing elements in the list or dictionary. Then just store the returned values in the appropriate places in an appropriate structure till you hit an e, and return the structure you're left with.

列表和词典是事情开始变得有趣的地方......如果有一个l或d作为第一个字符,那么你需要剥离l / d,然后将当前字符串传递回函数,以便它可以开始解析列表或字典中的元素。然后将返回的值存储在适当结构中的适当位置,直到您触及e,并返回您剩下的结构。

Remember, the function as I implemented it was DESTRUCTIVE. The string passed in is empty when the function returns due to it being passed by reference, or more accurately, it will be devoid of anything it parsed and returned (which is why it can be used recursively: anything it doesn't process is left untouched). In most cases of the initial call though, this should process everything unless you've been doing something odd, so the above holds.

请记住,我实现它的功能是破坏性的。函数返回时传入的字符串是空的,因为它是通过引用传递的,或者更准确地说,它将没有任何解析和返回的东西(这就是为什么它可以递归使用:它不处理的任何东西都留下了不变)。在大多数初始调用的情况下,这应该处理所有事情,除非你做了一些奇怪的事情,所以上面说的话。

#5


1  

Pseudo-code, without syntax checks:

伪代码,没有语法检查:

define read-integer (stream):
    let number 0, sign 1:
        if string-equal ('-', (c <- read-char (stream))):
            sign <- -1
          else:
            number <- parse-integer (c)
        while number? (c <- read-char (stream)):
            number <- (number * 10) + parse-integer (c)
        return sign * number

define bdecode-string (stream):
    let count read-integer (stream):
        return read-n-chars (stream, count)

define bdecode-integer (stream):
    ignore read-char (stream)
    return read-integer (stream)

define bdecode-list (stream):
    ignore read-char (stream)
    let list []:
        while not string-equal ('e', peek-char (stream)):
            append (list, bdecode (stream))
        return list

define bdecode-dictionary (stream):
    let list bdecode-list stream:
        return dictionarify (list)

define bdecode (stream):
    case peek-char (stream):
        number? => bdecode-string (stream)
        'i' => bdecode-integer (stream)
        'l' => bdecode-list (stream)
        'd' => bdecode-dictionary (stream)