This question has been asked and answered many times before. Some examples: [1], [2]. But there doesn't seem to be something somewhat more general. What I'm looking for is for a way to split strings at commas that are not within quotes or pairs of delimiters. For instance:
这个问题已经被问过很多次了。一些例子:[1],[2]。但似乎没有比这更普遍的东西。我要寻找的是一种方法,以逗号分隔字符串,而不是在引号或分隔符对中。例如:
s1 = 'obj<1, 2, 3>, x(4, 5), "msg, with comma"'
should be split into a list of three elements
应该将它分成三个元素的列表吗
['obj<1, 2, 3>', 'x(4, 5)', '"msg, with comma"']
The problem now is that this can get more complicated since we can look into pairs of <>
and ()
.
现在的问题是,这可能会变得更复杂,因为我们可以对<>和()进行分析。
s2 = 'obj<1, sub<6, 7>, 3>, x(4, y(8, 9), 5), "msg, with comma"'
which should be split into:
应将其分为:
['obj<1, sub<6, 7>, 3>', 'x(4, y(8, 9), 5)', '"msg, with comma"']
The naive solution without using regex is to parse the string by looking for the characters ,<(
. If either <
or (
are found then we start counting the parity. We can only split at a comma if the parity is zero. For instance say we want to split s2
, we can start with parity = 0
and when we reach s2[3]
we encounter <
which will increase parity by 1. The parity will only decrease when it encounters >
or )
and it will increase when it encounters <
or (
. While the parity is not 0 we can simply ignore the commas and not do any splitting.
不使用regex的简单解决方案是通过查找字符<()来解析字符串。如果找到 <或(则开始计算奇偶性。如果奇偶性为0,我们只能在逗号处进行分割。例如,我们想要分割s2,我们可以从奇偶性= 0开始当我们到达s2[3]时,我们会遇到<它将奇偶性增加1。奇偶性只会在遇到> 或时减少,当它遇到 <或时增加。当奇偶性不是0时,我们可以忽略逗号,不做任何分裂。< p>
The question here is, is there a way to this quickly with regex? I was really looking into this solution but this doesn't seem like it covers the examples I have given.
这里的问题是,是否有一种使用regex的快速实现方法?我确实在研究这个解决方案,但它似乎没有涵盖我给出的例子。
A more general function would be something like this:
更一般的函数是这样的:
def split_at(text, delimiter, exceptions):
"""Split text at the specified delimiter if the delimiter is not
within the exceptions"""
Some uses would be like this:
有些用途是这样的:
split_at('obj<1, 2, 3>, x(4, 5), "msg, with comma"', ',', [('<', '>'), ('(', ')'), ('"', '"')]
Would regex be able to handle this or is it necessary to create a specialized parser?
regex能够处理这个问题吗?还是需要创建专门的解析器?
3 个解决方案
#1
8
While it's not possible to use a Regular Expression, the following simple code will achieve the desired result:
虽然不可能使用正则表达式,但是下面的简单代码将实现所需的结果:
def split_at(text, delimiter, opens='<([', closes='>)]', quotes='"\''):
result = []
buff = ""
level = 0
is_quoted = False
for char in text:
if char in delimiter and level == 0 and not is_quoted:
result.append(buff)
buff = ""
else:
buff += char
if char in opens:
level += 1
if char in closes:
level -= 1
if char in quotes:
is_quoted = not is_quoted
if not buff == "":
result.append(buff)
return result
Running this in the interpreter:
在解释器中运行这个:
>>> split_at('obj<1, 2, 3>, x(4, 5), "msg, with comma"', ',')
#=>['obj<1, 2, 3>', ' x(4, 5)', ' "msg with comma"']
#2
5
using iterators and generators:
使用迭代器和发电机:
def tokenize(txt, delim=',', pairs={'"':'"', '<':'>', '(':')'}):
fst, snd = set(pairs.keys()), set(pairs.values())
it = txt.__iter__()
def loop():
from collections import defaultdict
cnt = defaultdict(int)
while True:
ch = it.__next__()
if ch == delim and not any (cnt[x] for x in snd):
return
elif ch in fst:
cnt[pairs[ch]] += 1
elif ch in snd:
cnt[ch] -= 1
yield ch
while it.__length_hint__():
yield ''.join(loop())
and,
而且,
>>> txt = 'obj<1, sub<6, 7>, 3>,x(4, y(8, 9), 5),"msg, with comma"'
>>> [x for x in tokenize(txt)]
['obj<1, sub<6, 7>, 3>', 'x(4, y(8, 9), 5)', '"msg, with comma"']
#3
4
If you have recursive nested expressions, you can split on the commas and validate that they are matching doing this with pyparsing:
如果您有递归嵌套表达式,您可以在逗号上拆分,并验证它们是否与pyparse匹配:
import pyparsing as pp
def CommaSplit(txt):
''' Replicate the function of str.split(',') but do not split on nested expressions or in quoted strings'''
com_lok=[]
comma = pp.Suppress(',')
# note the location of each comma outside an ignored expression:
comma.setParseAction(lambda s, lok, toks: com_lok.append(lok))
ident = pp.Word(pp.alphas+"_", pp.alphanums+"_") # python identifier
ex1=(ident+pp.nestedExpr(opener='<', closer='>')) # Ignore everthing inside nested '< >'
ex2=(ident+pp.nestedExpr()) # Ignore everthing inside nested '( )'
ex3=pp.Regex(r'("|\').*?\1') # Ignore everything inside "'" or '"'
atom = ex1 | ex2 | ex3 | comma
expr = pp.OneOrMore(atom) + pp.ZeroOrMore(comma + atom )
try:
result=expr.parseString(txt)
except pp.ParseException:
return [txt]
else:
return [txt[st:end] for st,end in zip([0]+[e+1 for e in com_lok],com_lok+[len(txt)])]
tests='''\
obj<1, 2, 3>, x(4, 5), "msg, with comma"
nesteobj<1, sub<6, 7>, 3>, nestedx(4, y(8, 9), 5), "msg, with comma"
nestedobj<1, sub<6, 7>, 3>, nestedx(4, y(8, 9), 5), 'msg, with comma', additional<1, sub<6, 7>, 3>
bare_comma<1, sub(6, 7), 3>, x(4, y(8, 9), 5), , 'msg, with comma', obj<1, sub<6, 7>, 3>
bad_close<1, sub<6, 7>, 3), x(4, y(8, 9), 5), 'msg, with comma', obj<1, sub<6, 7>, 3)
'''
for te in tests.splitlines():
result=CommaSplit(te)
print(te,'==>\n\t',result)
Prints:
打印:
obj<1, 2, 3>, x(4, 5), "msg, with comma" ==>
['obj<1, 2, 3>', ' x(4, 5)', ' "msg, with comma"']
nesteobj<1, sub<6, 7>, 3>, nestedx(4, y(8, 9), 5), "msg, with comma" ==>
['nesteobj<1, sub<6, 7>, 3>', ' nestedx(4, y(8, 9), 5)', ' "msg, with comma"']
nestedobj<1, sub<6, 7>, 3>, nestedx(4, y(8, 9), 5), 'msg, with comma', additional<1, sub<6, 7>, 3> ==>
['nestedobj<1, sub<6, 7>, 3>', ' nestedx(4, y(8, 9), 5)', " 'msg, with comma'", ' additional<1, sub<6, 7>, 3>']
bare_comma<1, sub(6, 7), 3>, x(4, y(8, 9), 5), , 'msg, with comma', obj<1, sub<6, 7>, 3> ==>
['bare_comma<1, sub(6, 7), 3>', ' x(4, y(8, 9), 5)', ' ', " 'msg, with comma'", ' obj<1, sub<6, 7>, 3>']
bad_close<1, sub<6, 7>, 3), x(4, y(8, 9), 5), 'msg, with comma', obj<1, sub<6, 7>, 3) ==>
["bad_close<1, sub<6, 7>, 3), x(4, y(8, 9), 5), 'msg, with comma', obj<1, sub<6, 7>, 3)"]
The current behavior is just like '(something does not split), b, "in quotes", c'.split',')
including keeping the leading spaces and the quotes. It is trivial to strip the quotes and leading spaces from the fields.
当前的行为就像'(不拆分的东西),b, "in quotes", c'.split','),包括保留前导空格和引号。从字段中除去引号和前导空间是很简单的。
Change the else
under try
to:
改变下的else,尝试:
else:
rtr = [txt[st:end] for st,end in zip([0]+[e+1 for e in com_lok],com_lok+[len(txt)])]
if strip_fields:
rtr=[e.strip().strip('\'"') for e in rtr]
return rtr
#1
8
While it's not possible to use a Regular Expression, the following simple code will achieve the desired result:
虽然不可能使用正则表达式,但是下面的简单代码将实现所需的结果:
def split_at(text, delimiter, opens='<([', closes='>)]', quotes='"\''):
result = []
buff = ""
level = 0
is_quoted = False
for char in text:
if char in delimiter and level == 0 and not is_quoted:
result.append(buff)
buff = ""
else:
buff += char
if char in opens:
level += 1
if char in closes:
level -= 1
if char in quotes:
is_quoted = not is_quoted
if not buff == "":
result.append(buff)
return result
Running this in the interpreter:
在解释器中运行这个:
>>> split_at('obj<1, 2, 3>, x(4, 5), "msg, with comma"', ',')
#=>['obj<1, 2, 3>', ' x(4, 5)', ' "msg with comma"']
#2
5
using iterators and generators:
使用迭代器和发电机:
def tokenize(txt, delim=',', pairs={'"':'"', '<':'>', '(':')'}):
fst, snd = set(pairs.keys()), set(pairs.values())
it = txt.__iter__()
def loop():
from collections import defaultdict
cnt = defaultdict(int)
while True:
ch = it.__next__()
if ch == delim and not any (cnt[x] for x in snd):
return
elif ch in fst:
cnt[pairs[ch]] += 1
elif ch in snd:
cnt[ch] -= 1
yield ch
while it.__length_hint__():
yield ''.join(loop())
and,
而且,
>>> txt = 'obj<1, sub<6, 7>, 3>,x(4, y(8, 9), 5),"msg, with comma"'
>>> [x for x in tokenize(txt)]
['obj<1, sub<6, 7>, 3>', 'x(4, y(8, 9), 5)', '"msg, with comma"']
#3
4
If you have recursive nested expressions, you can split on the commas and validate that they are matching doing this with pyparsing:
如果您有递归嵌套表达式,您可以在逗号上拆分,并验证它们是否与pyparse匹配:
import pyparsing as pp
def CommaSplit(txt):
''' Replicate the function of str.split(',') but do not split on nested expressions or in quoted strings'''
com_lok=[]
comma = pp.Suppress(',')
# note the location of each comma outside an ignored expression:
comma.setParseAction(lambda s, lok, toks: com_lok.append(lok))
ident = pp.Word(pp.alphas+"_", pp.alphanums+"_") # python identifier
ex1=(ident+pp.nestedExpr(opener='<', closer='>')) # Ignore everthing inside nested '< >'
ex2=(ident+pp.nestedExpr()) # Ignore everthing inside nested '( )'
ex3=pp.Regex(r'("|\').*?\1') # Ignore everything inside "'" or '"'
atom = ex1 | ex2 | ex3 | comma
expr = pp.OneOrMore(atom) + pp.ZeroOrMore(comma + atom )
try:
result=expr.parseString(txt)
except pp.ParseException:
return [txt]
else:
return [txt[st:end] for st,end in zip([0]+[e+1 for e in com_lok],com_lok+[len(txt)])]
tests='''\
obj<1, 2, 3>, x(4, 5), "msg, with comma"
nesteobj<1, sub<6, 7>, 3>, nestedx(4, y(8, 9), 5), "msg, with comma"
nestedobj<1, sub<6, 7>, 3>, nestedx(4, y(8, 9), 5), 'msg, with comma', additional<1, sub<6, 7>, 3>
bare_comma<1, sub(6, 7), 3>, x(4, y(8, 9), 5), , 'msg, with comma', obj<1, sub<6, 7>, 3>
bad_close<1, sub<6, 7>, 3), x(4, y(8, 9), 5), 'msg, with comma', obj<1, sub<6, 7>, 3)
'''
for te in tests.splitlines():
result=CommaSplit(te)
print(te,'==>\n\t',result)
Prints:
打印:
obj<1, 2, 3>, x(4, 5), "msg, with comma" ==>
['obj<1, 2, 3>', ' x(4, 5)', ' "msg, with comma"']
nesteobj<1, sub<6, 7>, 3>, nestedx(4, y(8, 9), 5), "msg, with comma" ==>
['nesteobj<1, sub<6, 7>, 3>', ' nestedx(4, y(8, 9), 5)', ' "msg, with comma"']
nestedobj<1, sub<6, 7>, 3>, nestedx(4, y(8, 9), 5), 'msg, with comma', additional<1, sub<6, 7>, 3> ==>
['nestedobj<1, sub<6, 7>, 3>', ' nestedx(4, y(8, 9), 5)', " 'msg, with comma'", ' additional<1, sub<6, 7>, 3>']
bare_comma<1, sub(6, 7), 3>, x(4, y(8, 9), 5), , 'msg, with comma', obj<1, sub<6, 7>, 3> ==>
['bare_comma<1, sub(6, 7), 3>', ' x(4, y(8, 9), 5)', ' ', " 'msg, with comma'", ' obj<1, sub<6, 7>, 3>']
bad_close<1, sub<6, 7>, 3), x(4, y(8, 9), 5), 'msg, with comma', obj<1, sub<6, 7>, 3) ==>
["bad_close<1, sub<6, 7>, 3), x(4, y(8, 9), 5), 'msg, with comma', obj<1, sub<6, 7>, 3)"]
The current behavior is just like '(something does not split), b, "in quotes", c'.split',')
including keeping the leading spaces and the quotes. It is trivial to strip the quotes and leading spaces from the fields.
当前的行为就像'(不拆分的东西),b, "in quotes", c'.split','),包括保留前导空格和引号。从字段中除去引号和前导空间是很简单的。
Change the else
under try
to:
改变下的else,尝试:
else:
rtr = [txt[st:end] for st,end in zip([0]+[e+1 for e in com_lok],com_lok+[len(txt)])]
if strip_fields:
rtr=[e.strip().strip('\'"') for e in rtr]
return rtr