Python,如何解析字符串看起来像sys.argv

时间:2022-06-30 20:57:15

I would like to parse a string like this:

我想解析这样的字符串:

-o 1  --long "Some long string"  

into this:

进入这个:

["-o", "1", "--long", 'Some long string']

or similar.

或类似的。

This is different than either getopt, or optparse, which start with sys.argv parsed input (like the output I have above). Is there a standard way to do this? Basically, this is "splitting" while keeping quoted strings together.

这与getopt或optparse不同,后者以sys.argv解析的输入开头(就像我上面的输出一样)。有没有标准的方法来做到这一点?基本上,这是“分裂”,同时保持引用的字符串在一起。

My best function so far:

到目前为止我的最佳功能:

import csv
def split_quote(string,quotechar='"'):
    '''

    >>> split_quote('--blah "Some argument" here')
    ['--blah', 'Some argument', 'here']

    >>> split_quote("--blah 'Some argument' here", quotechar="'")
    ['--blah', 'Some argument', 'here']
    '''
    s = csv.StringIO(string)
    C = csv.reader(s, delimiter=" ",quotechar=quotechar)
    return list(C)[0]

2 个解决方案

#1


71  

I believe you want the shlex module.

我相信你想要shlex模块。

>>> import shlex
>>> shlex.split('-o 1 --long "Some long string"')
['-o', '1', '--long', 'Some long string']

#2


0  

Before I was aware of shlex.split, I made the following:

在我了解shlex.split之前,我做了以下内容:

import sys

_WORD_DIVIDERS = set((' ', '\t', '\r', '\n'))

_QUOTE_CHARS_DICT = {
    '\\':   '\\',
    ' ':    ' ',
    '"':    '"',
    'r':    '\r',
    'n':    '\n',
    't':    '\t',
}

def _raise_type_error():
    raise TypeError("Bytes must be decoded to Unicode first")

def parse_to_argv_gen(instring):
    is_in_quotes = False
    instring_iter = iter(instring)
    join_string = instring[0:0]

    c_list = []
    c = ' '
    while True:
        # Skip whitespace
        try:
            while True:
                if not isinstance(c, str) and sys.version_info[0] >= 3:
                    _raise_type_error()
                if c not in _WORD_DIVIDERS:
                    break
                c = next(instring_iter)
        except StopIteration:
            break
        # Read word
        try:
            while True:
                if not isinstance(c, str) and sys.version_info[0] >= 3:
                    _raise_type_error()
                if not is_in_quotes and c in _WORD_DIVIDERS:
                    break
                if c == '"':
                    is_in_quotes = not is_in_quotes
                    c = None
                elif c == '\\':
                    c = next(instring_iter)
                    c = _QUOTE_CHARS_DICT.get(c)
                if c is not None:
                    c_list.append(c)
                c = next(instring_iter)
            yield join_string.join(c_list)
            c_list = []
        except StopIteration:
            yield join_string.join(c_list)
            break

def parse_to_argv(instring):
    return list(parse_to_argv_gen(instring))

This works with Python 2.x and 3.x. On Python 2.x, it works directly with byte strings and Unicode strings. On Python 3.x, it only accepts [Unicode] strings, not bytes objects.

这适用于Python 2.x和3.x.在Python 2.x上,它直接使用字节字符串和Unicode字符串。在Python 3.x上,它只接受[Unicode]字符串,而不接受字节对象。

This doesn't behave exactly the same as shell argv splitting—it also allows quoting of CR, LF and TAB characters as \r, \n and \t, converting them to real CR, LF, TAB (shlex.split doesn't do that). So writing my own function was useful for my needs. I guess shlex.split is better if you just want plain shell-style argv splitting. I'm sharing this code in case it's useful as a baseline for doing something slightly different.

这与shell argv拆分的行为完全不同 - 它还允许引用CR,LF和TAB字符为\ r,\ n和\ t,将它们转换为实际的CR,LF,TAB(shlex.split不会去做)。所以编写我自己的函数对我的需求很有用。我想shlex.split更好,如果你只是想要简单的shell样式的argv拆分。我正在分享这段代码,以防它作为执行稍微不同的事情的基线。

#1


71  

I believe you want the shlex module.

我相信你想要shlex模块。

>>> import shlex
>>> shlex.split('-o 1 --long "Some long string"')
['-o', '1', '--long', 'Some long string']

#2


0  

Before I was aware of shlex.split, I made the following:

在我了解shlex.split之前,我做了以下内容:

import sys

_WORD_DIVIDERS = set((' ', '\t', '\r', '\n'))

_QUOTE_CHARS_DICT = {
    '\\':   '\\',
    ' ':    ' ',
    '"':    '"',
    'r':    '\r',
    'n':    '\n',
    't':    '\t',
}

def _raise_type_error():
    raise TypeError("Bytes must be decoded to Unicode first")

def parse_to_argv_gen(instring):
    is_in_quotes = False
    instring_iter = iter(instring)
    join_string = instring[0:0]

    c_list = []
    c = ' '
    while True:
        # Skip whitespace
        try:
            while True:
                if not isinstance(c, str) and sys.version_info[0] >= 3:
                    _raise_type_error()
                if c not in _WORD_DIVIDERS:
                    break
                c = next(instring_iter)
        except StopIteration:
            break
        # Read word
        try:
            while True:
                if not isinstance(c, str) and sys.version_info[0] >= 3:
                    _raise_type_error()
                if not is_in_quotes and c in _WORD_DIVIDERS:
                    break
                if c == '"':
                    is_in_quotes = not is_in_quotes
                    c = None
                elif c == '\\':
                    c = next(instring_iter)
                    c = _QUOTE_CHARS_DICT.get(c)
                if c is not None:
                    c_list.append(c)
                c = next(instring_iter)
            yield join_string.join(c_list)
            c_list = []
        except StopIteration:
            yield join_string.join(c_list)
            break

def parse_to_argv(instring):
    return list(parse_to_argv_gen(instring))

This works with Python 2.x and 3.x. On Python 2.x, it works directly with byte strings and Unicode strings. On Python 3.x, it only accepts [Unicode] strings, not bytes objects.

这适用于Python 2.x和3.x.在Python 2.x上,它直接使用字节字符串和Unicode字符串。在Python 3.x上,它只接受[Unicode]字符串,而不接受字节对象。

This doesn't behave exactly the same as shell argv splitting—it also allows quoting of CR, LF and TAB characters as \r, \n and \t, converting them to real CR, LF, TAB (shlex.split doesn't do that). So writing my own function was useful for my needs. I guess shlex.split is better if you just want plain shell-style argv splitting. I'm sharing this code in case it's useful as a baseline for doing something slightly different.

这与shell argv拆分的行为完全不同 - 它还允许引用CR,LF和TAB字符为\ r,\ n和\ t,将它们转换为实际的CR,LF,TAB(shlex.split不会去做)。所以编写我自己的函数对我的需求很有用。我想shlex.split更好,如果你只是想要简单的shell样式的argv拆分。我正在分享这段代码,以防它作为执行稍微不同的事情的基线。