如何在Python中使用八进制值解析char数组?

时间:2022-10-11 12:16:24

EDIT: I should note that I want a general case for any hex array, not just the google one I provided.

编辑:我要注意的是,我想要的是任何十六进制数组的通用情况,而不仅仅是我提供的谷歌。

EDIT BACKGROUND: Background is networking: I'm parsing a DNS packet and trying to get its QNAME. I'm taking in the whole packet as a string, and every character represents a byte. Apparently this problem looks like a Pascal string problem, and using the struct module seems like the way to go.

编辑背景:背景是网络:我正在解析一个DNS包并试图获取它的QNAME。我把整个包当作一个字符串,每个字符代表一个字节。显然,这个问题看起来像一个Pascal字符串问题,使用struct模块似乎是一种方法。

I have a char array in Python 2.7 which includes octal values. For example, let's say I have an array

我有一个Python 2.7中的char数组,它包含八进制值。例如,假设我有一个数组

DNS = "\03www\06google\03com\0"

I want to get:

我想要:

www.google.com

www.google.com

What's an efficient way to do this? My first thought would be iterating through the DNS char array and adding chars to my new array answer. Every time i see a '\' char, I would ignore the '\' and two chars after it. Is there a way to get the resulting www.google.com without using a new array?

有什么有效的方法呢?我的第一个想法是遍历DNS char数组并向我的新数组答案添加chars。每当我看到一个'\' char时,我会忽略'\'和后面的两个字符。有没有一种方法可以在不使用新数组的情况下获得结果www.google.com ?

my disgusting implementation (my answer is an array of chars, which is not what i want, i want just the string www.google.com:

我讨厌的实现(我的答案是一系列字符,这不是我想要的,我只想要字符串www.google.com:

DNS = "\\03www\\06google\\03com\\0"
answer = []
i = 0
while i < len(DNS):
    if DNS[i] == '\\' and DNS[i+1] != 0:
        i += 3    
    elif DNS[i] == '\\' and DNS[i+1] == 0:
        break
    else:
        answer.append(DNS[i])
        i += 1

4 个解决方案

#1


2  

Now that you've explained your real problem, none of the answers you've gotten so far will work. Why? Because they're all ways to remove sequences like \03 from a string. But you don't have sequences like \03, you have single control characters.

既然你已经解释了你真正的问题,到目前为止你得到的所有答案都不会奏效。为什么?因为它们都可以从字符串中删除像\03这样的序列。但是你没有像\03这样的序列,你只有一个控制字符。

You could, of course, do something similar, just replacing any control character with a dot.

当然,你也可以做一些类似的事情,用点替换任何控制字符。

But what you're really trying to do is not replace control characters with dots, but parse DNS packets.

但是您真正要做的不是用点替换控制字符,而是解析DNS数据包。

DNS is defined by RFC 1035. The QNAME in a DNS packet is:

DNS由RFC 1035定义。DNS数据包中的QNAME是:

a domain name represented as a sequence of labels, where each label consists of a length octet followed by that number of octets. The domain name terminates with the zero length octet for the null label of the root. Note that this field may be an odd number of octets; no padding is used.

一种域名,表示为一个标签序列,其中每个标签由一个长度的八位元组成,后面跟着这个数目的八位元。域名以根的空标签的零长度八位元结束。注意,该字段可能是奇数个八进制数;不使用填充。

So, let's parse that. If you understand how "labels consisting of "a length octet followed by that number of octets" relates to "Pascal strings", there's a quicker way. Also, you could write this more cleanly and less verbosely as a generator. But let's do it the dead-simple way:

那么,让我们来解析。如果你理解“由”一个长度的八隅体和那个数量的八隅体组成的标签”是如何与“帕斯卡字符串”联系在一起的,有一种更快的方法。同样,您也可以把它写得更简洁,而不太冗长。但让我们用非常简单的方式来做:

def parse_qname(packet):
    components = []
    offset = 0
    while True:
        length, = struct.unpack_from('B', packet, offset)
        offset += 1
        if not length:
            break
        component = struct.unpack_from('{}s'.format(length), packet, offset)
        offset += length
        components.append(component)
    return components, offset

#2


1  

import re
DNS = "\\03www\\06google\\03com\\0"
m = re.sub("\\\\([0-9,a-f]){2}", "", DNS)
print(m)

#3


1  

Maybe something like this?

也许是这样的?

#!/usr/bin/python3

import re

def convert(adorned_hostname):
    result1 = re.sub(r'^\\03', '', adorned_hostname )
    result2 = re.sub(r'\\0[36]', '.', result1)
    result3 = re.sub(r'\\0$', '', result2)
    return result3

def main():
    adorned_hostname = r"\03www\06google\03com\0"
    expected_result = 'www.google.com'
    actual_result = convert(adorned_hostname)
    print(actual_result, expected_result)
    assert actual_result == expected_result

main()

#4


1  

For the question as originally asked, replacing the backslash-hex sequences in strings like "\\03www\\06google\\03com\\0" with dots…

对于最初提出的问题,用圆点代替反斜杠-十六进制序列,例如“\03www\ 06google\ 06\ \03com\ 03com\ 0”。

If you want to do this with a regular expression:

如果你想用正则表达式来做:

  • \\ matches a backslash.
  • \ \匹配一个反斜杠。
  • [0-9A-Fa-f] matches any hex digit.
  • [0-9A-Fa-f]匹配任何十六进制数字。
  • [0-9A-Fa-f]+ matches one or more hex digits.
  • [0-9A-Fa-f]+匹配一个或多个十六进制数字。
  • \\[0-9A-Fa-f]+ matches a backslash followed by one or more hex digits.
  • \[0-9A-Fa-f]+匹配一个反斜杠,后面跟着一个或多个十六进制数字。

You want to find each such sequence, and replace it with a dot, right? If you look through the re docs, you'll find a function called sub which is used for replacing a pattern with a replacement string:

你想要找到每一个这样的序列,并用一个点替换它,对吧?如果你看一下re文档,你会发现一个叫做sub的函数,用来用替换字符串替换模式:

re.sub(r'\\[0-9A-Fa-f]+', '.', DNS)

I suspect these may actually be octal, not hex, in which case you want [0-7] rather than [0-9A-Fa-f], but nothing else would change.

我怀疑这些可能是八进制的,而不是十六进制的,在这种情况下,你想要[0-7]而不是[0-9A-Fa-f],但是没有别的东西会改变。


A different way to do this is to recognize that these are valid Python escape sequences. And, if we unescape them back to where they came from (e.g., with DNS.decode('string_escape')), this turns into a sequence of length-prefixed (aka "Pascal") strings, a standard format that you can parse in any number of ways, including the stdlib struct module. This has the advantage of validating the data as you read it, and not being thrown off by any false positives that could show up if one of the string components, say, had a backslash in the middle of it.

另一种方法是识别这些是有效的Python转义序列。而且,如果我们将它们解压到它们的来源(例如,使用dn .decode('string_escape'))),这将变成一个长前缀字符串序列(又名“Pascal”),这是一种标准格式,您可以用多种方式解析,包括stdlib struct模块。这具有在读取数据时验证数据的优势,并且不会因为字符串组件(比如其中一个在其中有一个反斜杠)中出现的任何误报而被抛出。

Of course that's presuming more about the data. It seems likely that the real meaning of this is "a sequence of length-prefixed strings, concatenated, then backslash-escaped", in which case you should parse it as such. But it could be just a coincidence that it looks like that, in which case it would be a very bad idea to parse it as such.

这当然是对数据的假设。它的真正含义似乎是“一串长前缀的字符串,连接起来,然后反斜线转义”,在这种情况下,您应该这样解析它。但它看起来像这样可能只是一个巧合,在这种情况下,把它解析成这样是一个非常糟糕的想法。

#1


2  

Now that you've explained your real problem, none of the answers you've gotten so far will work. Why? Because they're all ways to remove sequences like \03 from a string. But you don't have sequences like \03, you have single control characters.

既然你已经解释了你真正的问题,到目前为止你得到的所有答案都不会奏效。为什么?因为它们都可以从字符串中删除像\03这样的序列。但是你没有像\03这样的序列,你只有一个控制字符。

You could, of course, do something similar, just replacing any control character with a dot.

当然,你也可以做一些类似的事情,用点替换任何控制字符。

But what you're really trying to do is not replace control characters with dots, but parse DNS packets.

但是您真正要做的不是用点替换控制字符,而是解析DNS数据包。

DNS is defined by RFC 1035. The QNAME in a DNS packet is:

DNS由RFC 1035定义。DNS数据包中的QNAME是:

a domain name represented as a sequence of labels, where each label consists of a length octet followed by that number of octets. The domain name terminates with the zero length octet for the null label of the root. Note that this field may be an odd number of octets; no padding is used.

一种域名,表示为一个标签序列,其中每个标签由一个长度的八位元组成,后面跟着这个数目的八位元。域名以根的空标签的零长度八位元结束。注意,该字段可能是奇数个八进制数;不使用填充。

So, let's parse that. If you understand how "labels consisting of "a length octet followed by that number of octets" relates to "Pascal strings", there's a quicker way. Also, you could write this more cleanly and less verbosely as a generator. But let's do it the dead-simple way:

那么,让我们来解析。如果你理解“由”一个长度的八隅体和那个数量的八隅体组成的标签”是如何与“帕斯卡字符串”联系在一起的,有一种更快的方法。同样,您也可以把它写得更简洁,而不太冗长。但让我们用非常简单的方式来做:

def parse_qname(packet):
    components = []
    offset = 0
    while True:
        length, = struct.unpack_from('B', packet, offset)
        offset += 1
        if not length:
            break
        component = struct.unpack_from('{}s'.format(length), packet, offset)
        offset += length
        components.append(component)
    return components, offset

#2


1  

import re
DNS = "\\03www\\06google\\03com\\0"
m = re.sub("\\\\([0-9,a-f]){2}", "", DNS)
print(m)

#3


1  

Maybe something like this?

也许是这样的?

#!/usr/bin/python3

import re

def convert(adorned_hostname):
    result1 = re.sub(r'^\\03', '', adorned_hostname )
    result2 = re.sub(r'\\0[36]', '.', result1)
    result3 = re.sub(r'\\0$', '', result2)
    return result3

def main():
    adorned_hostname = r"\03www\06google\03com\0"
    expected_result = 'www.google.com'
    actual_result = convert(adorned_hostname)
    print(actual_result, expected_result)
    assert actual_result == expected_result

main()

#4


1  

For the question as originally asked, replacing the backslash-hex sequences in strings like "\\03www\\06google\\03com\\0" with dots…

对于最初提出的问题,用圆点代替反斜杠-十六进制序列,例如“\03www\ 06google\ 06\ \03com\ 03com\ 0”。

If you want to do this with a regular expression:

如果你想用正则表达式来做:

  • \\ matches a backslash.
  • \ \匹配一个反斜杠。
  • [0-9A-Fa-f] matches any hex digit.
  • [0-9A-Fa-f]匹配任何十六进制数字。
  • [0-9A-Fa-f]+ matches one or more hex digits.
  • [0-9A-Fa-f]+匹配一个或多个十六进制数字。
  • \\[0-9A-Fa-f]+ matches a backslash followed by one or more hex digits.
  • \[0-9A-Fa-f]+匹配一个反斜杠,后面跟着一个或多个十六进制数字。

You want to find each such sequence, and replace it with a dot, right? If you look through the re docs, you'll find a function called sub which is used for replacing a pattern with a replacement string:

你想要找到每一个这样的序列,并用一个点替换它,对吧?如果你看一下re文档,你会发现一个叫做sub的函数,用来用替换字符串替换模式:

re.sub(r'\\[0-9A-Fa-f]+', '.', DNS)

I suspect these may actually be octal, not hex, in which case you want [0-7] rather than [0-9A-Fa-f], but nothing else would change.

我怀疑这些可能是八进制的,而不是十六进制的,在这种情况下,你想要[0-7]而不是[0-9A-Fa-f],但是没有别的东西会改变。


A different way to do this is to recognize that these are valid Python escape sequences. And, if we unescape them back to where they came from (e.g., with DNS.decode('string_escape')), this turns into a sequence of length-prefixed (aka "Pascal") strings, a standard format that you can parse in any number of ways, including the stdlib struct module. This has the advantage of validating the data as you read it, and not being thrown off by any false positives that could show up if one of the string components, say, had a backslash in the middle of it.

另一种方法是识别这些是有效的Python转义序列。而且,如果我们将它们解压到它们的来源(例如,使用dn .decode('string_escape'))),这将变成一个长前缀字符串序列(又名“Pascal”),这是一种标准格式,您可以用多种方式解析,包括stdlib struct模块。这具有在读取数据时验证数据的优势,并且不会因为字符串组件(比如其中一个在其中有一个反斜杠)中出现的任何误报而被抛出。

Of course that's presuming more about the data. It seems likely that the real meaning of this is "a sequence of length-prefixed strings, concatenated, then backslash-escaped", in which case you should parse it as such. But it could be just a coincidence that it looks like that, in which case it would be a very bad idea to parse it as such.

这当然是对数据的假设。它的真正含义似乎是“一串长前缀的字符串,连接起来,然后反斜线转义”,在这种情况下,您应该这样解析它。但它看起来像这样可能只是一个巧合,在这种情况下,把它解析成这样是一个非常糟糕的想法。