I know there are a lot of other posts about parsing comma-separated values, but I couldn't find one that splits key-value pairs and handles quoted commas.
我知道有许多其他关于解析逗号分隔值的帖子,但我找不到分割键值对并处理引用逗号的帖子。
I have strings like this:
我有这样的字符串:
age=12,name=bob,hobbies="games,reading",phrase="I'm cool!"
And I want to get this:
我希望得到这个:
{
'age': '12',
'name': 'bob',
'hobbies': 'games,reading',
'phrase': "I'm cool!",
}
I tried using shlex
like this:
我试过像这样使用shlex:
lexer = shlex.shlex('''age=12,name=bob,hobbies="games,reading",phrase="I'm cool!"''')
lexer.whitespace_split = True
lexer.whitespace = ','
props = dict(pair.split('=', 1) for pair in lexer)
The trouble is that shlex
will split the hobbies
entry into two tokens, i.e. hobbies="games
and reading"
. Is there a way to make it take the double quotes into account? Or is there another module I can use?
麻烦的是,shlex会将爱好条目分成两个令牌,即爱好=“游戏和阅读”。有没有办法让它考虑双引号?或者我可以使用另一个模块吗?
EDIT: Fixed typo for whitespace_split
编辑:修复了whitespace_split的拼写错误
EDIT 2: I'm not tied to using shlex
. Regex is fine too, but I didn't know how to handle the matching quotes.
编辑2:我不喜欢使用shlex。正则表达式也很好,但我不知道如何处理匹配的引号。
5 个解决方案
#1
5
You just needed to use your shlex
lexer in POSIX mode.
您只需要在POSIX模式下使用shlex词法分析器。
Add posix=True
when creating the lexer.
创建词法分析器时添加posix = True。
(See the shlex parsing rules)
(参见shlex解析规则)
lexer = shlex.shlex('''age=12,name=bob,hobbies="games,reading",phrase="I'm cool!"''', posix=True)
lexer.whitespace_split = True
lexer.whitespace = ','
props = dict(pair.split('=', 1) for pair in lexer)
Outputs :
{'age': '12', 'phrase': "I'm cool!", 'hobbies': 'games,reading', 'name': 'bob'}
PS : Regular expressions won't be able to parse key-value pairs as long as the input can contain quoted =
or ,
characters. Even preprocessing the string wouldn't be able to make the input be parsed by a regular expression, because that kind of input cannot be formally defined as a regular language.
PS:正则表达式将无法解析键值对,只要输入可以包含quoted =或,字符。即使预处理字符串也无法使输入被正则表达式解析,因为这种输入不能正式定义为常规语言。
#2
4
It's possible to do with a regular expression. In this case, it might actually be the best option, too. I think this will work with most input, even escaped quotes such as this one: phrase='I\'m cool'
可以使用正则表达式。在这种情况下,它实际上也可能是最好的选择。我认为这将适用于大多数输入,甚至是逃脱的引用,例如这一个:phrase ='我很酷'
With the VERBOSE flag, it's possible to make complicated regular expressions quite readable.
使用VERBOSE标志,可以使复杂的正则表达式具有可读性。
import re
text = '''age=12,name=bob,hobbies="games,reading",phrase="I'm cool!"'''
regex = re.compile(
r'''
(?P<key>\w+)= # Key consists of only alphanumerics
(?P<quote>["']?) # Optional quote character.
(?P<value>.*?) # Value is a non greedy match
(?P=quote) # Closing quote equals the first.
($|,) # Entry ends with comma or end of string
''',
re.VERBOSE
)
d = {match.group('key'): match.group('value') for match in regex.finditer(text)}
print(d) # {'name': 'bob', 'phrase': "I'm cool!", 'age': '12', 'hobbies': 'games,reading'}
#3
2
You could abuse Python tokenizer to parse the key-value list:
您可以滥用Python tokenizer来解析键值列表:
#!/usr/bin/env python
from tokenize import generate_tokens, NAME, NUMBER, OP, STRING, ENDMARKER
def parse_key_value_list(text):
key = value = None
for type, string, _,_,_ in generate_tokens(lambda it=iter([text]): next(it)):
if type == NAME and key is None:
key = string
elif type in {NAME, NUMBER, STRING}:
value = {
NAME: lambda x: x,
NUMBER: int,
STRING: lambda x: x[1:-1]
}[type](string)
elif ((type == OP and string == ',') or
(type == ENDMARKER and key is not None)):
yield key, value
key = value = None
text = '''age=12,name=bob,hobbies="games,reading",phrase="I'm cool!"'''
print(dict(parse_key_value_list(text)))
Output
{'phrase': "I'm cool!", 'age': 12, 'name': 'bob', 'hobbies': 'games,reading'}
You could use a finite-state machine (FSM) to implement a stricter parser. The parser uses only the current state and the next token to parse input:
您可以使用有限状态机(FSM)来实现更严格的解析器。解析器仅使用当前状态和下一个标记来解析输入:
#!/usr/bin/env python
from tokenize import generate_tokens, NAME, NUMBER, OP, STRING, ENDMARKER
def parse_key_value_list(text):
def check(condition):
if not condition:
raise ValueError((state, token))
KEY, EQ, VALUE, SEP = range(4)
state = KEY
for token in generate_tokens(lambda it=iter([text]): next(it)):
type, string = token[:2]
if state == KEY:
check(type == NAME)
key = string
state = EQ
elif state == EQ:
check(type == OP and string == '=')
state = VALUE
elif state == VALUE:
check(type in {NAME, NUMBER, STRING})
value = {
NAME: lambda x: x,
NUMBER: int,
STRING: lambda x: x[1:-1]
}[type](string)
state = SEP
elif state == SEP:
check(type == OP and string == ',' or type == ENDMARKER)
yield key, value
state = KEY
text = '''age=12,name=bob,hobbies="games,reading",phrase="I'm cool!"'''
print(dict(parse_key_value_list(text)))
#4
1
Ok, I actually figured out a pretty nifty way, which is to split on both comma and equal sign, then take 2 tokens at a time.
好吧,我实际上想出了一个非常漂亮的方式,即在逗号和等号上分开,然后一次取2个令牌。
input_str = '''age=12,name=bob,hobbies="games,reading",phrase="I'm cool!"'''
lexer = shlex.shlex(input_str)
lexer.whitespace_split = True
lexer.whitespace = ',='
ret = {}
try:
while True:
key = next(lexer)
value = next(lexer)
# Remove surrounding quotes
if len(value) >= 2 and (value[0] == value[-1] == '"' or
value[0] == value[-1] == '\''):
value = value[1:-1]
ret[key] = value
except StopIteration:
# Somehow do error checking to see if you ended up with an extra token.
pass
print ret
Then you get:
然后你得到:
{
'age': '12',
'name': 'bob',
'hobbies': 'games,reading',
'phrase': "I'm cool!",
}
However, this doesn't check that you don't have weird stuff like: age,12=name,bob
, but I'm ok with that in my use case.
但是,这并没有检查你是否有像:age,12 = name,bob这样奇怪的东西,但是在我的用例中我很好。
EDIT: Handle both double-quotes and single-quotes.
编辑:处理双引号和单引号。
#5
0
Python seems to offer many ways to solve the task. Here is a little more c like implemented way, processing each char. Would be interesting to know different run times.
Python似乎提供了许多方法来解决任务。这里有一个更像c实现的方式,处理每个char。了解不同的运行时间会很有趣。
str = 'age=12,name=bob,hobbies="games,reading",phrase="I\'m cool!"'
key = ""
val = ""
dict = {}
parse_string = False
parse_key = True
# parse_val = False
for c in str:
print(c)
if c == '"' and not parse_string:
parse_string = True
continue
elif c == '"' and parse_string:
parse_string = False
continue
if parse_string:
val += c
continue
if c == ',': # terminate entry
dict[key] = val #add to dict
key = ""
val = ""
parse_key = True
continue
elif c == '=' and parse_key:
parse_key = False
elif parse_key:
key += c
else:
val+=c
dict[key] = val
print(dict.items())
# {'phrase': "I'm cool!", 'age': '12', 'name': 'bob', 'hobbies': 'games,reading'}
demo: http://repl.it/6oC/1
#1
5
You just needed to use your shlex
lexer in POSIX mode.
您只需要在POSIX模式下使用shlex词法分析器。
Add posix=True
when creating the lexer.
创建词法分析器时添加posix = True。
(See the shlex parsing rules)
(参见shlex解析规则)
lexer = shlex.shlex('''age=12,name=bob,hobbies="games,reading",phrase="I'm cool!"''', posix=True)
lexer.whitespace_split = True
lexer.whitespace = ','
props = dict(pair.split('=', 1) for pair in lexer)
Outputs :
{'age': '12', 'phrase': "I'm cool!", 'hobbies': 'games,reading', 'name': 'bob'}
PS : Regular expressions won't be able to parse key-value pairs as long as the input can contain quoted =
or ,
characters. Even preprocessing the string wouldn't be able to make the input be parsed by a regular expression, because that kind of input cannot be formally defined as a regular language.
PS:正则表达式将无法解析键值对,只要输入可以包含quoted =或,字符。即使预处理字符串也无法使输入被正则表达式解析,因为这种输入不能正式定义为常规语言。
#2
4
It's possible to do with a regular expression. In this case, it might actually be the best option, too. I think this will work with most input, even escaped quotes such as this one: phrase='I\'m cool'
可以使用正则表达式。在这种情况下,它实际上也可能是最好的选择。我认为这将适用于大多数输入,甚至是逃脱的引用,例如这一个:phrase ='我很酷'
With the VERBOSE flag, it's possible to make complicated regular expressions quite readable.
使用VERBOSE标志,可以使复杂的正则表达式具有可读性。
import re
text = '''age=12,name=bob,hobbies="games,reading",phrase="I'm cool!"'''
regex = re.compile(
r'''
(?P<key>\w+)= # Key consists of only alphanumerics
(?P<quote>["']?) # Optional quote character.
(?P<value>.*?) # Value is a non greedy match
(?P=quote) # Closing quote equals the first.
($|,) # Entry ends with comma or end of string
''',
re.VERBOSE
)
d = {match.group('key'): match.group('value') for match in regex.finditer(text)}
print(d) # {'name': 'bob', 'phrase': "I'm cool!", 'age': '12', 'hobbies': 'games,reading'}
#3
2
You could abuse Python tokenizer to parse the key-value list:
您可以滥用Python tokenizer来解析键值列表:
#!/usr/bin/env python
from tokenize import generate_tokens, NAME, NUMBER, OP, STRING, ENDMARKER
def parse_key_value_list(text):
key = value = None
for type, string, _,_,_ in generate_tokens(lambda it=iter([text]): next(it)):
if type == NAME and key is None:
key = string
elif type in {NAME, NUMBER, STRING}:
value = {
NAME: lambda x: x,
NUMBER: int,
STRING: lambda x: x[1:-1]
}[type](string)
elif ((type == OP and string == ',') or
(type == ENDMARKER and key is not None)):
yield key, value
key = value = None
text = '''age=12,name=bob,hobbies="games,reading",phrase="I'm cool!"'''
print(dict(parse_key_value_list(text)))
Output
{'phrase': "I'm cool!", 'age': 12, 'name': 'bob', 'hobbies': 'games,reading'}
You could use a finite-state machine (FSM) to implement a stricter parser. The parser uses only the current state and the next token to parse input:
您可以使用有限状态机(FSM)来实现更严格的解析器。解析器仅使用当前状态和下一个标记来解析输入:
#!/usr/bin/env python
from tokenize import generate_tokens, NAME, NUMBER, OP, STRING, ENDMARKER
def parse_key_value_list(text):
def check(condition):
if not condition:
raise ValueError((state, token))
KEY, EQ, VALUE, SEP = range(4)
state = KEY
for token in generate_tokens(lambda it=iter([text]): next(it)):
type, string = token[:2]
if state == KEY:
check(type == NAME)
key = string
state = EQ
elif state == EQ:
check(type == OP and string == '=')
state = VALUE
elif state == VALUE:
check(type in {NAME, NUMBER, STRING})
value = {
NAME: lambda x: x,
NUMBER: int,
STRING: lambda x: x[1:-1]
}[type](string)
state = SEP
elif state == SEP:
check(type == OP and string == ',' or type == ENDMARKER)
yield key, value
state = KEY
text = '''age=12,name=bob,hobbies="games,reading",phrase="I'm cool!"'''
print(dict(parse_key_value_list(text)))
#4
1
Ok, I actually figured out a pretty nifty way, which is to split on both comma and equal sign, then take 2 tokens at a time.
好吧,我实际上想出了一个非常漂亮的方式,即在逗号和等号上分开,然后一次取2个令牌。
input_str = '''age=12,name=bob,hobbies="games,reading",phrase="I'm cool!"'''
lexer = shlex.shlex(input_str)
lexer.whitespace_split = True
lexer.whitespace = ',='
ret = {}
try:
while True:
key = next(lexer)
value = next(lexer)
# Remove surrounding quotes
if len(value) >= 2 and (value[0] == value[-1] == '"' or
value[0] == value[-1] == '\''):
value = value[1:-1]
ret[key] = value
except StopIteration:
# Somehow do error checking to see if you ended up with an extra token.
pass
print ret
Then you get:
然后你得到:
{
'age': '12',
'name': 'bob',
'hobbies': 'games,reading',
'phrase': "I'm cool!",
}
However, this doesn't check that you don't have weird stuff like: age,12=name,bob
, but I'm ok with that in my use case.
但是,这并没有检查你是否有像:age,12 = name,bob这样奇怪的东西,但是在我的用例中我很好。
EDIT: Handle both double-quotes and single-quotes.
编辑:处理双引号和单引号。
#5
0
Python seems to offer many ways to solve the task. Here is a little more c like implemented way, processing each char. Would be interesting to know different run times.
Python似乎提供了许多方法来解决任务。这里有一个更像c实现的方式,处理每个char。了解不同的运行时间会很有趣。
str = 'age=12,name=bob,hobbies="games,reading",phrase="I\'m cool!"'
key = ""
val = ""
dict = {}
parse_string = False
parse_key = True
# parse_val = False
for c in str:
print(c)
if c == '"' and not parse_string:
parse_string = True
continue
elif c == '"' and parse_string:
parse_string = False
continue
if parse_string:
val += c
continue
if c == ',': # terminate entry
dict[key] = val #add to dict
key = ""
val = ""
parse_key = True
continue
elif c == '=' and parse_key:
parse_key = False
elif parse_key:
key += c
else:
val+=c
dict[key] = val
print(dict.items())
# {'phrase': "I'm cool!", 'age': '12', 'name': 'bob', 'hobbies': 'games,reading'}
demo: http://repl.it/6oC/1