大多数pythonic方式打破高度分支的解析器

时间:2022-03-16 21:38:18

I'm working on a parser for a specific type of file that is broken up into sections by some header keyword followed a bunch of heterogeneous data. Headers are always separated by blank lines. Something along the lines of the following:

我正在为一个特定类型的文件开发一个解析器,这个文件被一些header关键字分成几部分,后面跟着一堆异构数据。标题始终用空行分隔。以下内容:

Header_A

1 1.02345
2 2.97959
...

Header_B

1   5.1700   10.2500
2   5.0660   10.5000
...

Every header contains very different types of data and depending on certain keywords within a block, the data must be stored in different locations. The general approach I took is to have some regex that catches all of the keywords that can define a header and then iterate through the lines in the file. Once I find a match, I pop lines until I reach a blank line, storing all of the data from lines in the appropriate locations.

每个标头包含非常不同类型的数据,并且根据块内的某些关键字,数据必须存储在不同的位置。我采用的一般方法是使用一些正则表达式捕获所有可以定义标题的关键字,然后遍历文件中的行。一旦找到匹配项,我会弹出行直到我到达一个空行,将所有数据存储在相应位置的行中。

This is the basic structure of the code where "do stuff with current_line" will involve a bunch of branches depending on what the line contains:

这是代码的基本结构,其中“使用current_line做东西”将涉及一堆分支,具体取决于该行包含的内容:

headers = re.compile(r"""
    ((?P<header_a>Header_A)
    |
    (?P<header_b>Header_B))
    """, re.VERBOSE)

i = 0
while i < len(data_lines):
    match = header.match(data_lines[i])
    if match:
        if match.group('header_a'):
            data_lines.pop(i)
            data_lines.pop(i)

            #     not end of file         not blank line
            while i < len(data_lines) and data_lines[i].strip():
                current_line = data_lines.pop(i)
                # do stuff with current_line

        elif match.group('header_b'):
            data_lines.pop(i)
            data_lines.pop(i)

            while i < len(data_lines) and data_lines[i].strip():
                current_line = data_lines.pop(i)
                # do stuff with current_line
        else:
            i += 1
    else:
        i += 1

Everything works correctly but it amounts to a highly branched structure that I find to be highly illegible and likely hard to follow for anyone unfamiliar with the code. It also makes it more difficult to keep lines at <79 characters and more generally doesn't feel very pythonic.

一切正常,但它相当于一个高度分支的结构,我觉得非常难以辨认,很可能很难跟上任何不熟悉代码的人。这也使得将行保持在<79个字符变得更加困难,而且通常不会感觉非常pythonic。

One thing I'm working on is separating the branch for each header into separate functions. This will hopefully improve readability quite a bit but...

我正在做的一件事是将每个标题的分支分成单独的函数。这有望提高可读性,但......

...is there a cleaner way to perform the outer looping/matching structure? Maybe using itertools?

Also for various reasons this code must be able to run in 2.7.

此外,由于各种原因,此代码必须能够在2.7中运行。

3 个解决方案

#1


3  

You could use itertools.groupby to group the lines according to which processing function you wish to perform:

您可以使用itertools.groupby根据您希望执行的处理函数对行进行分组:

import itertools as IT

def process_a(lines):
    for line in lines:
        line = line.strip()
        if not line: continue        
        print('processing A: {}'.format(line))

def process_b(lines):
    for line in lines:
        line = line.strip()
        if not line: continue        
        print('processing B: {}'.format(line))

def header_func(line):
    if line.startswith('Header_A'):
        return process_a
    elif line.startswith('Header_B'):
        return process_b
    else: return None  # you could omit this, but it might be nice to be explicit

with open('data', 'r') as f:
    for key, lines in IT.groupby(f, key=header_func):
        if key is None:
            if func is not None:
                func(lines)
        else:
            func = key

Applied to the data you posted, the above code prints

应用于您发布的数据,上面的代码打印出来

processing A: 1 1.02345
processing A: 2 2.97959
processing A: ...
processing B: 1   5.1700   10.2500
processing B: 2   5.0660   10.5000
processing B: ...

The one complicated line in the code above is

上面代码中的一行复杂是

for key, lines in IT.groupby(f, key=header_func):

Let's try to break it down into its component parts:

让我们尝试将其分解为组成部分:

In [31]: f = open('data')

In [32]: list(IT.groupby(f, key=header_func))
Out[32]: 
[(<function __main__.process_a>, <itertools._grouper at 0xa0efecc>),
 (None, <itertools._grouper at 0xa0ef7cc>),
 (<function __main__.process_b>, <itertools._grouper at 0xa0eff0c>),
 (None, <itertools._grouper at 0xa0ef84c>)]

IT.groupby(f, key=header_func) returns an iterator. The items yielded by the iterator are 2-tuples, such as

IT.groupby(f,key = header_func)返回一个迭代器。迭代器产生的项目是2元组,例如

(<function __main__.process_a>, <itertools._grouper at 0xa0efecc>)

The first item in the 2-tuple is the value returned by header_func. The second item in the 2-tuple is an iterator. This iterator yields lines from f for which header_func(line) all return the same value.

2元组中的第一项是header_func返回的值。 2元组中的第二项是迭代器。此迭代器从f生成行,其中header_func(行)都返回相同的值。

Thus, IT.groupby is grouping the lines in f according to the return value of header_func. When the line in f is a header line -- either Header_A or Header_B -- then header_func returns process_a or process_b, the function we wish to use to process subsequent lines.

因此,IT.groupby根据header_func的返回值对f中的行进行分组。当f中的行是标题行 - Header_A或Header_B - 然后header_func返回process_a或process_b,我们希望用来处理后续行的函数。

When the line in f is a header line, the group of lines returned by IT.groupby (the second item in the 2-tuple) is short and uninteresting -- it is just the header line.

当f中的行是标题行时,IT.groupby返回的行组(2元组中的第二项)很短且不感兴趣 - 它只是标题行。

We need to look in the next group for the interesting lines. For these lines, header_func returns None.

我们需要在下一组中寻找有趣的线条。对于这些行,header_func返回None。

So we need to look at two 2-tuples: the first 2-tuple yielded by IT.groupby gives us the function to use, and the second 2-tuple gives the lines to which the header function should be applied.

所以我们需要看两个2元组:由IT.groupby产生的第一个2元组给我们使用的函数,第二个2元组给出了应该应用头函数的行。

Once you have both the function and the iterator with the interesting lines, you just call func(lines) and you're done!

一旦你有了函数和带有趣线条的迭代器,你只需要调用func(lines)就可以了!

Notice that it would be very easy to expand this to process other kinds of headers. You would only need to write another process_* function, and modify header_func to return process_* when the line indicates to do so.

请注意,扩展它以处理其他类型的标头非常容易。您只需要编写另一个process_ *函数,并在行指示这样做时修改header_func以返回process_ *。


Edit: I removed the use of izip(*[iterator]*2) since it assumes the first line is a header line. The first line could be blank or a non-header line, which would throw everything off. I replaced it with some if-statements. It's not quite as succinct, but the result is a bit more robust.

编辑:我删除了使用izip(* [iterator] * 2),因为它假定第一行是标题行。第一行可以是空白或非标题行,这会抛弃所有内容。我用一些if语句替换它。它不是那么简洁,但结果更加强大。

#2


2  

How about splitting out the logic for parsing the different header's types of data into separate functions, then using a dictionary to map from the given header to the right one:

如何将用于解析不同标头数据类型的逻辑拆分为单独的函数,然后使用字典从给定标头映射到正确的标头:

def parse_data_a(iterator):
    next(iterator) # throw away the blank line after the header
    for line in iterator:
        if not line.strip():
            break  # bale out if we find a blank line, another header is about to start
        # do stuff with each line here

# define similar functions to parse other blocks of data, e.g. parse_data_b()

# define a mapping from header strings to the functions that parse the following data
parser_for_header = {"Header_A": parse_data_a} # put other parsers in here too!

def parse(lines):
    iterator = iter(lines)
    for line in iterator:
        header = line.strip()
        if header in parser_for_header:
            parser_for_header[header](iterator)

This code uses iteration, rather than indexing to handle the lines. An advantage of this is that you can run it directly on a file in addition to on a list of lines, since files are iterable. It also makes the bounds checking very easy, since a for loop will end automatically when there's nothing left in the iterable, as well as when a break statement is hit.

此代码使用迭代,而不是索引来处理行。这样做的一个优点是除了在行列表上之外,您还可以直接在文件上运行它,因为文件是可迭代的。它还使得边界检查非常容易,因为当迭代中没有任何内容时,以及当命中break语句时,for循环将自动结束。

Depending on what you're doing with the data you're parsing, you may need to have the individual parsers return something, rather than just going off and doing their own thing. In that case, you'll need some logic in the top-level parse function to get the results and assemble it into some useful format. Perhaps a dictionary would make the most sense, with the last line becoming:

根据您正在对正在解析的数据执行的操作,您可能需要让各个解析器返回一些内容,而不是仅仅关闭并执行自己的操作。在这种情况下,您需要在*解析函数中使用一些逻辑来获取结​​果并将其组合成一些有用的格式。也许字典最有意义,最后一行成为:

results_dict[header] = parser_for_header[header](iterator)

#3


2  

You can do it with the send function of generators as well :)

您也可以使用发电机的发送功能:)

data_lines = [
    'Header_A   ',
    '',
    '',
    '1 1.02345',
    '2 2.97959',
    '',
]

def process_header_a(line):
    while True:
        line = yield line
        # process line
        print 'A', line

header_processors = {
    'Header_A': process_header_a(None),
}

current_processer = None
for line in data_lines:
    line = line.strip()
    if line in header_processors:
        current_processor = header_processors[line]
        current_processor.send(None)
    elif line:
        current_processor.send(line)    

for processor in header_processors.values():
    processor.close()

You can remove all if conditions from the main loop if you replace

如果替换,可以从主循环中删除所有if条件

current_processer = None
for line in data_lines:
    line = line.strip()
    if line in header_processors:
        current_processor = header_processors[line]
        current_processor.send(None)
    elif line:
        current_processor.send(line)    

with

map(next, header_processors.values())
current_processor = header_processors['Header_A']
for line in data_lines:
    line = line.strip()
    current_processor = header_processors.get(line, current_processor)
    line and line not in header_processors and current_processor.send(line)

#1


3  

You could use itertools.groupby to group the lines according to which processing function you wish to perform:

您可以使用itertools.groupby根据您希望执行的处理函数对行进行分组:

import itertools as IT

def process_a(lines):
    for line in lines:
        line = line.strip()
        if not line: continue        
        print('processing A: {}'.format(line))

def process_b(lines):
    for line in lines:
        line = line.strip()
        if not line: continue        
        print('processing B: {}'.format(line))

def header_func(line):
    if line.startswith('Header_A'):
        return process_a
    elif line.startswith('Header_B'):
        return process_b
    else: return None  # you could omit this, but it might be nice to be explicit

with open('data', 'r') as f:
    for key, lines in IT.groupby(f, key=header_func):
        if key is None:
            if func is not None:
                func(lines)
        else:
            func = key

Applied to the data you posted, the above code prints

应用于您发布的数据,上面的代码打印出来

processing A: 1 1.02345
processing A: 2 2.97959
processing A: ...
processing B: 1   5.1700   10.2500
processing B: 2   5.0660   10.5000
processing B: ...

The one complicated line in the code above is

上面代码中的一行复杂是

for key, lines in IT.groupby(f, key=header_func):

Let's try to break it down into its component parts:

让我们尝试将其分解为组成部分:

In [31]: f = open('data')

In [32]: list(IT.groupby(f, key=header_func))
Out[32]: 
[(<function __main__.process_a>, <itertools._grouper at 0xa0efecc>),
 (None, <itertools._grouper at 0xa0ef7cc>),
 (<function __main__.process_b>, <itertools._grouper at 0xa0eff0c>),
 (None, <itertools._grouper at 0xa0ef84c>)]

IT.groupby(f, key=header_func) returns an iterator. The items yielded by the iterator are 2-tuples, such as

IT.groupby(f,key = header_func)返回一个迭代器。迭代器产生的项目是2元组,例如

(<function __main__.process_a>, <itertools._grouper at 0xa0efecc>)

The first item in the 2-tuple is the value returned by header_func. The second item in the 2-tuple is an iterator. This iterator yields lines from f for which header_func(line) all return the same value.

2元组中的第一项是header_func返回的值。 2元组中的第二项是迭代器。此迭代器从f生成行,其中header_func(行)都返回相同的值。

Thus, IT.groupby is grouping the lines in f according to the return value of header_func. When the line in f is a header line -- either Header_A or Header_B -- then header_func returns process_a or process_b, the function we wish to use to process subsequent lines.

因此,IT.groupby根据header_func的返回值对f中的行进行分组。当f中的行是标题行 - Header_A或Header_B - 然后header_func返回process_a或process_b,我们希望用来处理后续行的函数。

When the line in f is a header line, the group of lines returned by IT.groupby (the second item in the 2-tuple) is short and uninteresting -- it is just the header line.

当f中的行是标题行时,IT.groupby返回的行组(2元组中的第二项)很短且不感兴趣 - 它只是标题行。

We need to look in the next group for the interesting lines. For these lines, header_func returns None.

我们需要在下一组中寻找有趣的线条。对于这些行,header_func返回None。

So we need to look at two 2-tuples: the first 2-tuple yielded by IT.groupby gives us the function to use, and the second 2-tuple gives the lines to which the header function should be applied.

所以我们需要看两个2元组:由IT.groupby产生的第一个2元组给我们使用的函数,第二个2元组给出了应该应用头函数的行。

Once you have both the function and the iterator with the interesting lines, you just call func(lines) and you're done!

一旦你有了函数和带有趣线条的迭代器,你只需要调用func(lines)就可以了!

Notice that it would be very easy to expand this to process other kinds of headers. You would only need to write another process_* function, and modify header_func to return process_* when the line indicates to do so.

请注意,扩展它以处理其他类型的标头非常容易。您只需要编写另一个process_ *函数,并在行指示这样做时修改header_func以返回process_ *。


Edit: I removed the use of izip(*[iterator]*2) since it assumes the first line is a header line. The first line could be blank or a non-header line, which would throw everything off. I replaced it with some if-statements. It's not quite as succinct, but the result is a bit more robust.

编辑:我删除了使用izip(* [iterator] * 2),因为它假定第一行是标题行。第一行可以是空白或非标题行,这会抛弃所有内容。我用一些if语句替换它。它不是那么简洁,但结果更加强大。

#2


2  

How about splitting out the logic for parsing the different header's types of data into separate functions, then using a dictionary to map from the given header to the right one:

如何将用于解析不同标头数据类型的逻辑拆分为单独的函数,然后使用字典从给定标头映射到正确的标头:

def parse_data_a(iterator):
    next(iterator) # throw away the blank line after the header
    for line in iterator:
        if not line.strip():
            break  # bale out if we find a blank line, another header is about to start
        # do stuff with each line here

# define similar functions to parse other blocks of data, e.g. parse_data_b()

# define a mapping from header strings to the functions that parse the following data
parser_for_header = {"Header_A": parse_data_a} # put other parsers in here too!

def parse(lines):
    iterator = iter(lines)
    for line in iterator:
        header = line.strip()
        if header in parser_for_header:
            parser_for_header[header](iterator)

This code uses iteration, rather than indexing to handle the lines. An advantage of this is that you can run it directly on a file in addition to on a list of lines, since files are iterable. It also makes the bounds checking very easy, since a for loop will end automatically when there's nothing left in the iterable, as well as when a break statement is hit.

此代码使用迭代,而不是索引来处理行。这样做的一个优点是除了在行列表上之外,您还可以直接在文件上运行它,因为文件是可迭代的。它还使得边界检查非常容易,因为当迭代中没有任何内容时,以及当命中break语句时,for循环将自动结束。

Depending on what you're doing with the data you're parsing, you may need to have the individual parsers return something, rather than just going off and doing their own thing. In that case, you'll need some logic in the top-level parse function to get the results and assemble it into some useful format. Perhaps a dictionary would make the most sense, with the last line becoming:

根据您正在对正在解析的数据执行的操作,您可能需要让各个解析器返回一些内容,而不是仅仅关闭并执行自己的操作。在这种情况下,您需要在*解析函数中使用一些逻辑来获取结​​果并将其组合成一些有用的格式。也许字典最有意义,最后一行成为:

results_dict[header] = parser_for_header[header](iterator)

#3


2  

You can do it with the send function of generators as well :)

您也可以使用发电机的发送功能:)

data_lines = [
    'Header_A   ',
    '',
    '',
    '1 1.02345',
    '2 2.97959',
    '',
]

def process_header_a(line):
    while True:
        line = yield line
        # process line
        print 'A', line

header_processors = {
    'Header_A': process_header_a(None),
}

current_processer = None
for line in data_lines:
    line = line.strip()
    if line in header_processors:
        current_processor = header_processors[line]
        current_processor.send(None)
    elif line:
        current_processor.send(line)    

for processor in header_processors.values():
    processor.close()

You can remove all if conditions from the main loop if you replace

如果替换,可以从主循环中删除所有if条件

current_processer = None
for line in data_lines:
    line = line.strip()
    if line in header_processors:
        current_processor = header_processors[line]
        current_processor.send(None)
    elif line:
        current_processor.send(line)    

with

map(next, header_processors.values())
current_processor = header_processors['Header_A']
for line in data_lines:
    line = line.strip()
    current_processor = header_processors.get(line, current_processor)
    line and line not in header_processors and current_processor.send(line)