Python正则表达到解析字符串和返回元组

时间:2021-08-23 15:50:19

I've been given some strings to work with. Each one represents a data set and consists of the data set's name and the associated statistics. They all have the following form:

我已经获得了一些可以使用的字符串。每个代表一个数据集,由数据集的名称和相关的统计信息组成。它们都有以下形式:

s= "| 'TOMATOES_PICKED'                                  |       914 |       1397 |"

I'm trying to implement a function that will parse the string and return the name of the data set, the first number, and the second number. There are lots of these strings and each one has a different name and associated stats so I've figured the best way to do this is with regular expressions. Here's what I have so far:

我正在尝试实现一个函数,它将解析字符串并返回数据集的名称,第一个数字和第二个数字。有很多这些字符串,每个字符串都有不同的名称和相关的统计数据,所以我认为最好的方法是使用正则表达式。这是我到目前为止所拥有的:

def extract_data2(s):
    import re
    name=re.search('\'(.*?)\'',s).group(1)
    n1=re.search('\|(.*)\|',s)
    return(name,n1,)

So I've done a bit of reading on regular expressions and figured out how to return the name. For each of the strings that I'm working with, the name of the data set is bounded by ' ' so that's how I found the name. That part works fine. My problem is with getting the numbers. What I'm thinking right now is to try to match a pattern that is preceded by a vertical bar ('|'), then anything (which is why I used .*), and followed by another vertical bar to try to get the first number. Does anyone know how I can do this in Python? What I tried in the above code for the first number returns basically the whole string as my output, whereas I want to get just the number. -I am very new to programming so I apologize if this question seems rudimentary, but I have been reading and searching quite diligently for answers that are close to my case with no luck. I appreciate any help. The idea is that it will be able to:

所以我已经对正则表达式做了一些阅读,并想出了如何返回名称。对于我正在使用的每个字符串,数据集的名称都以''为界,这就是我找到名称的方式。那部分工作正常。我的问题是得到数字。我现在想的是尝试匹配一个前面有一个垂直条('|')的模式,然后是任何东西(这就是我使用的原因。*),然后是另一个垂直条以试图获得第一个号码。有谁知道我怎么能用Python做到这一点?我在上面的代码中尝试的第一个数字基本上返回整个字符串作为我的输出,而我想得到的只是数字。 - 我是编程的新手,所以如果这个问题看起来很简陋,我会道歉,但我一直在努力寻找与我的案例相近但没有运气的答案。我感谢任何帮助。这个想法是它将能够:

return(name,n1,n2)

so that when the user inputs a string, it can just parse up the string and return the important information. I've noticed in my attempts to get the numbers so far that it will return the number as a string. Is there anyway to return n1 or n2 as just a number? Note that for some of the strings n1 and n2 could be either integers or have a decimal.

这样当用户输入一个字符串时,它只能解析字符串并返回重要信息。我已经注意到,在我尝试获取数字到目前为止它会将数字作为字符串返回。无论如何将n1或n2作为一个数字返回?请注意,对于某些字符串,n1和n2可以是整数,也可以是小数。

6 个解决方案

#1


21  

I would use a single regular expression to match the entire line, with the parts I want in named groups ((?P<name>exampl*e)).

我会使用一个正则表达式来匹配整行,在命名组中使用我想要的部分((?P exampl * e))。

import re
def extract_data2(s):
    pattern = re.compile(r"""\|\s*                 # opening bar and whitespace
                             '(?P<name>.*?)'       # quoted name
                             \s*\|\s*(?P<n1>.*?)   # whitespace, next bar, n1
                             \s*\|\s*(?P<n2>.*?)   # whitespace, next bar, n2
                             \s*\|""", re.VERBOSE)
    match = pattern.match(s)

    name = match.group("name")
    n1 = float(match.group("n1"))
    n2 = float(match.group("n2"))

    return (name, n1, n2)

To convert n1 and n2 from strings to numbers, I use the float function. (If they were only integers, I would use the int function.)

要将n1和n2从字符串转换为数字,我使用float函数。 (如果它们只是整数,我会使用int函数。)

I used the re.VERBOSE flag and raw multiline strings (r"""...""") to make the regex easier to read.

我使用了re.VERBOSE标志和原始多行字符串(r“”“......”“”)来使正则表达式更容易阅读。

#2


3  

Try using split.

尝试使用拆分。

s= "| 'TOMATOES_PICKED'                                  |       914 |       1397 |"
print map(lambda x:x.strip("' "),s.split('|'))[1:-1]
  • Split : transform your string into a list of string
  • 拆分:将您的字符串转换为字符串列表
  • lambda function : removes spaces and '
  • lambda函数:删除空格和'
  • Selector : take only expected parts
  • 选择器:仅采用预期的部件

#3


2  

Using regex:

使用正则表达式:

#! /usr/bin/env python

import re

tests = [
"| 'TOMATOES_PICKED'                                  |       914 |       1397 |",
"| 'TOMATOES_FLICKED'                                 |     32914 |       1123 |",
"| 'TOMATOES_RIGGED'                                  |        14 |       1343 |",
"| 'TOMATOES_PICKELED'                                |         4 |         23 |"]

def parse (s):
    mo = re.match ("\\|\s*'([^']*)'\s*\\|\s*(\d*)\s*\\|\s*(\d*)\s*\\|", s)
    if mo: return mo.groups ()

for test in tests: print parse (test)

#4


1  

Not sure that i have correctly understood you but try this:

不确定我是否正确理解你,但试试这个:

import re

print re.findall(r'\b\w+\b', yourtext)

#5


1  

I would have to agree with the other posters that said use the split() method on your strings. If your given string is,

我不得不同意其他海报说你在你的字符串上使用split()方法。如果给定的字符串是,

>> s = "| 'TOMATOES_PICKED'                          |       914 |       1397 |"

You just split the string and voila, you now have a list with the name in the second position, and the two values in the following entries, i.e.

你只需拆分字符串和瞧,你现在有一个名字在第二个位置的列表,以及以下条目中的两个值,即

>> s_new = s.split()
>> s_new
['|', "'TOMATOES_PICKED'", '|', '914', '|', '1397', '|']

Of course you do also have the "|" character but that seems to be consistent in your data set so it isn't a big problem to deal with. Just ignore them.

你当然也有“|”字符,但在您的数据集中似乎是一致的,因此处理它不是一个大问题。只是忽略它们。

#6


0  

With pyparsing, you can have the parser create a dict-like structure for you, using the first column values as the keys, and the subsequent values as an array of values for that key:

使用pyparsing,您可以让解析器为您创建类似dict的结构,使用第一列值作为键,将后续值作为该键的值数组:

>>> from pyparsing import *
>>> s = "| 'TOMATOES_PICKED'                                  |       914 |       1397 |"
>>> VERT = Suppress('|')
>>> title = quotedString.setParseAction(removeQuotes)
>>> integer = Word(nums).setParseAction(lambda tokens:int(tokens[0]))
>>> entry = Group(VERT + title + VERT + integer + VERT + integer + VERT)
>>> entries = Dict(OneOrMore(entry))
>>> data = entries.parseString(s)
>>> data.keys()
['TOMATOES_PICKED']
>>> data['TOMATOES_PICKED']
([914, 1397], {})
>>> data['TOMATOES_PICKED'].asList()
[914, 1397]
>>> data['TOMATOES_PICKED'][0]
914
>>> data['TOMATOES_PICKED'][1]
1397

This already comprehends multiple entries, so you can just pass it a single multiline string containing all of your data values, and a single keyed data structure will be built for you. (Processing this kind of pipe-delimited tabular data was one of the earliest applications I had for pyparsing.)

这已经包含了多个条目,因此您只需传递一个包含所有数据值的多行字符串,就可以为您构建单个键控数据结构。 (处理这种以管道分隔的表格数据是我用于pyparsing的最早的应用程序之一。)

#1


21  

I would use a single regular expression to match the entire line, with the parts I want in named groups ((?P<name>exampl*e)).

我会使用一个正则表达式来匹配整行,在命名组中使用我想要的部分((?P exampl * e))。

import re
def extract_data2(s):
    pattern = re.compile(r"""\|\s*                 # opening bar and whitespace
                             '(?P<name>.*?)'       # quoted name
                             \s*\|\s*(?P<n1>.*?)   # whitespace, next bar, n1
                             \s*\|\s*(?P<n2>.*?)   # whitespace, next bar, n2
                             \s*\|""", re.VERBOSE)
    match = pattern.match(s)

    name = match.group("name")
    n1 = float(match.group("n1"))
    n2 = float(match.group("n2"))

    return (name, n1, n2)

To convert n1 and n2 from strings to numbers, I use the float function. (If they were only integers, I would use the int function.)

要将n1和n2从字符串转换为数字,我使用float函数。 (如果它们只是整数,我会使用int函数。)

I used the re.VERBOSE flag and raw multiline strings (r"""...""") to make the regex easier to read.

我使用了re.VERBOSE标志和原始多行字符串(r“”“......”“”)来使正则表达式更容易阅读。

#2


3  

Try using split.

尝试使用拆分。

s= "| 'TOMATOES_PICKED'                                  |       914 |       1397 |"
print map(lambda x:x.strip("' "),s.split('|'))[1:-1]
  • Split : transform your string into a list of string
  • 拆分:将您的字符串转换为字符串列表
  • lambda function : removes spaces and '
  • lambda函数:删除空格和'
  • Selector : take only expected parts
  • 选择器:仅采用预期的部件

#3


2  

Using regex:

使用正则表达式:

#! /usr/bin/env python

import re

tests = [
"| 'TOMATOES_PICKED'                                  |       914 |       1397 |",
"| 'TOMATOES_FLICKED'                                 |     32914 |       1123 |",
"| 'TOMATOES_RIGGED'                                  |        14 |       1343 |",
"| 'TOMATOES_PICKELED'                                |         4 |         23 |"]

def parse (s):
    mo = re.match ("\\|\s*'([^']*)'\s*\\|\s*(\d*)\s*\\|\s*(\d*)\s*\\|", s)
    if mo: return mo.groups ()

for test in tests: print parse (test)

#4


1  

Not sure that i have correctly understood you but try this:

不确定我是否正确理解你,但试试这个:

import re

print re.findall(r'\b\w+\b', yourtext)

#5


1  

I would have to agree with the other posters that said use the split() method on your strings. If your given string is,

我不得不同意其他海报说你在你的字符串上使用split()方法。如果给定的字符串是,

>> s = "| 'TOMATOES_PICKED'                          |       914 |       1397 |"

You just split the string and voila, you now have a list with the name in the second position, and the two values in the following entries, i.e.

你只需拆分字符串和瞧,你现在有一个名字在第二个位置的列表,以及以下条目中的两个值,即

>> s_new = s.split()
>> s_new
['|', "'TOMATOES_PICKED'", '|', '914', '|', '1397', '|']

Of course you do also have the "|" character but that seems to be consistent in your data set so it isn't a big problem to deal with. Just ignore them.

你当然也有“|”字符,但在您的数据集中似乎是一致的,因此处理它不是一个大问题。只是忽略它们。

#6


0  

With pyparsing, you can have the parser create a dict-like structure for you, using the first column values as the keys, and the subsequent values as an array of values for that key:

使用pyparsing,您可以让解析器为您创建类似dict的结构,使用第一列值作为键,将后续值作为该键的值数组:

>>> from pyparsing import *
>>> s = "| 'TOMATOES_PICKED'                                  |       914 |       1397 |"
>>> VERT = Suppress('|')
>>> title = quotedString.setParseAction(removeQuotes)
>>> integer = Word(nums).setParseAction(lambda tokens:int(tokens[0]))
>>> entry = Group(VERT + title + VERT + integer + VERT + integer + VERT)
>>> entries = Dict(OneOrMore(entry))
>>> data = entries.parseString(s)
>>> data.keys()
['TOMATOES_PICKED']
>>> data['TOMATOES_PICKED']
([914, 1397], {})
>>> data['TOMATOES_PICKED'].asList()
[914, 1397]
>>> data['TOMATOES_PICKED'][0]
914
>>> data['TOMATOES_PICKED'][1]
1397

This already comprehends multiple entries, so you can just pass it a single multiline string containing all of your data values, and a single keyed data structure will be built for you. (Processing this kind of pipe-delimited tabular data was one of the earliest applications I had for pyparsing.)

这已经包含了多个条目,因此您只需传递一个包含所有数据值的多行字符串,就可以为您构建单个键控数据结构。 (处理这种以管道分隔的表格数据是我用于pyparsing的最早的应用程序之一。)