使用Python和正则表达式解析模板架构

时间:2021-04-22 07:38:30

I'm working on a script for work to extract data from an old template engine schema:

我正在编写一个脚本,用于从旧的模板引擎模式中提取数据:

[%price%]
{
$54.99
}
[%/price%]

[%model%]
{
WRT54G
}
[%/model%]

[%brand%]{
LINKSYS
}
[%/brand%]

everything within the [% %] is the key, and everything in the { } is the value. Using Python and regex, I was able to get this far: (?<=[%)(?P\w*?)(?=\%])

[%%]中的所有内容都是键,{}中的所有内容都是值。使用Python和正则表达式,我能够做到这一点:(?<= [%](?P \ w *?)(?= \%])

which returns ['price', 'model', 'brand']

返回['price','model','brand']

I'm just having a problem getting it match the bracket data as a value

我只是遇到一个问题,它将括号数据与值匹配

3 个解决方案

#1


just for grins:

只为了笑容:

import re
RE_kv = re.compile("\[%(.*)%\].*?\n?\s*{\s*(.*)")
matches = re.findall(RE_kv, test, re.M)
for k, v in matches:
    print k, v

output:

price $54.99
model WRT54G
brand LINKSYS

Note I did just enough regex to get the matches to show up, it's not even bounded at the end for the close brace. Use at your own risk.

注意我做了足够的正则表达式以使比赛显示出来,它在结束时甚至没有限制。使用风险由您自己承担。

#2


I agree with Devin that a single regex isn't the best solution. If there do happen to be any strange cases that aren't handled by your regex, there's a real risk that you won't find out.

我同意德文的观点,即单一正则表达式不是最好的解决方案。如果确实发生了任何未被正则表达式处理的奇怪案例,那么您将无法找到真正的风险。

I'd suggest using a finite state machine approach. Parse the file line by line, first looking for a price-model-brand block, then parse whatever is within the braces. Also, make sure to note if any blocks aren't opened or closed correctly as these are probably malformed.

我建议使用有限状态机方法。逐行解析文件,首先查找价格模型品牌块,然后解析大括号内的任何内容。此外,请务必注意是否有任何块未正确打开或关闭,因为这些块可能格式不正确。

You should be able to write something like this in python in about 30-40 lines of code.

您应该能够在python中编写类似于30-40行代码的内容。

#3


It looks like it'd be easier to do with re.Scanner (sadly undocumented) than with a single regular expression.

看起来使用re.Scanner(可悲的是未记录的)比使用单个正则表达式更容易。

#1


just for grins:

只为了笑容:

import re
RE_kv = re.compile("\[%(.*)%\].*?\n?\s*{\s*(.*)")
matches = re.findall(RE_kv, test, re.M)
for k, v in matches:
    print k, v

output:

price $54.99
model WRT54G
brand LINKSYS

Note I did just enough regex to get the matches to show up, it's not even bounded at the end for the close brace. Use at your own risk.

注意我做了足够的正则表达式以使比赛显示出来,它在结束时甚至没有限制。使用风险由您自己承担。

#2


I agree with Devin that a single regex isn't the best solution. If there do happen to be any strange cases that aren't handled by your regex, there's a real risk that you won't find out.

我同意德文的观点,即单一正则表达式不是最好的解决方案。如果确实发生了任何未被正则表达式处理的奇怪案例,那么您将无法找到真正的风险。

I'd suggest using a finite state machine approach. Parse the file line by line, first looking for a price-model-brand block, then parse whatever is within the braces. Also, make sure to note if any blocks aren't opened or closed correctly as these are probably malformed.

我建议使用有限状态机方法。逐行解析文件,首先查找价格模型品牌块,然后解析大括号内的任何内容。此外,请务必注意是否有任何块未正确打开或关闭,因为这些块可能格式不正确。

You should be able to write something like this in python in about 30-40 lines of code.

您应该能够在python中编写类似于30-40行代码的内容。

#3


It looks like it'd be easier to do with re.Scanner (sadly undocumented) than with a single regular expression.

看起来使用re.Scanner(可悲的是未记录的)比使用单个正则表达式更容易。