如何在Python中从字符串中提取子字符串?

时间:2020-12-08 19:16:25

Let's say I have a string 'gfgfdAAA1234ZZZuijjk' and I want to extract just the '1234' part.

假设我有一个字符串“gfgfdAAA1234ZZZuijjk”,我想提取“1234”部分。

I only know what will be the few characters directly before AAA, and after ZZZ the part I am interested in 1234.

我只知道在AAA之前会有哪些角色,在ZZZ之后我对1234感兴趣。

With sed it is possible to do something like this with a string:

有了sed,有可能用字符串来做这样的事情:

echo "$STRING" | sed -e "s|.*AAA\(.*\)ZZZ.*|\1|"

And this will give me 1234 as a result.

结果是1234。

How to do the same thing in Python?

如何在Python中做同样的事情?

12 个解决方案

#1


319  

Using regular expressions - documentation for further reference

使用正则表达式-文档以供进一步参考

import re

text = 'gfgfdAAA1234ZZZuijjk'

m = re.search('AAA(.+?)ZZZ', text)
if m:
    found = m.group(1)

# found: 1234

or:

或者:

import re

text = 'gfgfdAAA1234ZZZuijjk'

try:
    found = re.search('AAA(.+?)ZZZ', text).group(1)
except AttributeError:
    # AAA, ZZZ not found in the original string
    found = '' # apply your error handling

# found: 1234

#2


79  

>>> s = 'gfgfdAAA1234ZZZuijjk'
>>> start = s.find('AAA') + 3
>>> end = s.find('ZZZ', start)
>>> s[start:end]
'1234'

Then you can use regexps with the re module as well, if you want, but that's not necessary in your case.

然后,如果您愿意,也可以在re模块中使用regexp,但在您的情况下,这不是必需的。

#3


25  

regular expression

import re

re.search(r"(?<=AAA).*?(?=ZZZ)", your_text).group(0)

The above as-is will fail with an AttributeError if there are no "AAA" and "ZZZ" in your_text

如果在your_text中没有“AAA”和“ZZZ”,那么上面的as-is将在AttributeError中失败

string methods

your_text.partition("AAA")[2].partition("ZZZ")[0]

The above will return an empty string if either "AAA" or "ZZZ" don't exist in your_text.

如果在your_text中不存在“AAA”或“ZZZ”,上面的语句将返回一个空字符串。

PS Python Challenge?

PS Python挑战?

#4


13  

import re
print re.search('AAA(.*?)ZZZ', 'gfgfdAAA1234ZZZuijjk').group(1)

#5


6  

You can use re module for that:

可以使用re模块实现:

>>> import re
>>> re.compile(".*AAA(.*)ZZZ.*").match("gfgfdAAA1234ZZZuijjk").groups()
('1234,)

#6


5  

With sed it is possible to do something like this with a string:

有了sed,有可能用字符串来做这样的事情:

echo "$STRING" | sed -e "s|.*AAA\(.*\)ZZZ.*|\1|"

echo "$STRING" | sed -e "s|.*AAA\(.*\)ZZZ.*|\1|"

And this will give me 1234 as a result.

结果是1234。

You could do the same with re.sub function using the same regex.

您可以使用相同的regex对re.sub函数执行相同的操作。

>>> re.sub(r'.*AAA(.*)ZZZ.*', r'\1', 'gfgfdAAA1234ZZZuijjk')
'1234'

In basic sed, capturing group are represented by \(..\), but in python it was represented by (..).

在基本sed中,捕获组由\(..\)表示,但在python中,它由(..)表示。

#7


4  

You can find first substring with this function in your code (by character index). Also, you can find what is after a substring.

您可以在代码中(通过字符索引)找到第一个带有此函数的子字符串。此外,您还可以找到子字符串之后的内容。

def FindSubString(strText, strSubString, Offset=None):
    try:
        Start = strText.find(strSubString)
        if Start == -1:
            return -1 # Not Found
        else:
            if Offset == None:
                Result = strText[Start+len(strSubString):]
            elif Offset == 0:
                return Start
            else:
                AfterSubString = Start+len(strSubString)
                Result = strText[AfterSubString:AfterSubString + int(Offset)]
            return Result
    except:
        return -1

# Example:

Text = "Thanks for contributing an answer to Stack Overflow!"
subText = "to"

print("Start of first substring in a text:")
start = FindSubString(Text, subText, 0)
print(start); print("")

print("Exact substring in a text:")
print(Text[start:start+len(subText)]); print("")

print("What is after substring \"%s\"?" %(subText))
print(FindSubString(Text, subText))

# Your answer:

Text = "gfgfdAAA1234ZZZuijjk"
subText1 = "AAA"
subText2 = "ZZZ"

AfterText1 = FindSubString(Text, subText1, 0) + len(subText1)
BeforText2 = FindSubString(Text, subText2, 0) 

print("\nYour answer:\n%s" %(Text[AfterText1:BeforText2]))

#8


2  

Just in case somebody will have to do the same thing that I did. I had to extract everything inside parenthesis in a line. For example, if I have a line like 'US president (Barack Obama) met with ...' and I want to get only 'Barack Obama' this is solution:

以防万一有人要做和我一样的事情。我必须把括号里的所有东西都提取出来。例如,如果我说“美国总统(巴拉克·奥巴马)与……”我只希望得到“巴拉克•奥巴马”这就是解决方案:

regex = '.*\((.*?)\).*'
matches = re.search(regex, line)
line = matches.group(1) + '\n'

I.e. you need to block parenthesis with slash \ sign. Though it is a problem about more regular expressions that Python.

也就是说,你需要用斜杠来阻止括号。尽管Python有一个关于正则表达式的问题。

Also, in some cases you may see 'r' symbols before regex definition. If there is no r prefix, you need to use escape characters like in C. Here is more discussion on that.

此外,在某些情况下,您可能会在regex定义之前看到“r”符号。如果没有r前缀,则需要使用转义字符,如c。

#9


1  

>>> s = '/tmp/10508.constantstring'
>>> s.split('/tmp/')[1].split('constantstring')[0].strip('.')

#10


1  

you can do using just one line of code

您可以只使用一行代码

>>> import re

>>> re.findall(r'\d{1,5}','gfgfdAAA1234ZZZuijjk')

>>> ['1234']

result will receive list...

结果将得到清单……

#11


0  

In python, extracting substring form string can be done using findall method in regular expression (re) module.

在python中,可以在正则表达式(re)模块中使用findall方法提取子字符串表单字符串。

>>> import re
>>> s = 'gfgfdAAA1234ZZZuijjk'
>>> ss = re.findall('AAA(.+)ZZZ', s)
>>> print ss
['1234']

#12


0  

One liners that return other string if there was no match. Edit: improved version uses next function, replace "not-found" with something else if needed:

如果没有匹配,返回另一个字符串的衬垫。编辑:改进后的版本使用下一个功能,如果需要,可以用其他功能替换“not-found”:

import re
res = next( (m.group(1) for m in [re.search("AAA(.*?)ZZZ", "gfgfdAAA1234ZZZuijjk" ),] if m), "not-found" )

My other method to do this, less optimal, uses regex 2nd time, still didn't found a shorter way:

我的另一个方法,不是最优的,使用regex第二次,仍然没有找到更短的方法:

import re
res = ( ( re.search("AAA(.*?)ZZZ", "gfgfdAAA1234ZZZuijjk") or re.search("()","") ).group(1) )

#1


319  

Using regular expressions - documentation for further reference

使用正则表达式-文档以供进一步参考

import re

text = 'gfgfdAAA1234ZZZuijjk'

m = re.search('AAA(.+?)ZZZ', text)
if m:
    found = m.group(1)

# found: 1234

or:

或者:

import re

text = 'gfgfdAAA1234ZZZuijjk'

try:
    found = re.search('AAA(.+?)ZZZ', text).group(1)
except AttributeError:
    # AAA, ZZZ not found in the original string
    found = '' # apply your error handling

# found: 1234

#2


79  

>>> s = 'gfgfdAAA1234ZZZuijjk'
>>> start = s.find('AAA') + 3
>>> end = s.find('ZZZ', start)
>>> s[start:end]
'1234'

Then you can use regexps with the re module as well, if you want, but that's not necessary in your case.

然后,如果您愿意,也可以在re模块中使用regexp,但在您的情况下,这不是必需的。

#3


25  

regular expression

import re

re.search(r"(?<=AAA).*?(?=ZZZ)", your_text).group(0)

The above as-is will fail with an AttributeError if there are no "AAA" and "ZZZ" in your_text

如果在your_text中没有“AAA”和“ZZZ”,那么上面的as-is将在AttributeError中失败

string methods

your_text.partition("AAA")[2].partition("ZZZ")[0]

The above will return an empty string if either "AAA" or "ZZZ" don't exist in your_text.

如果在your_text中不存在“AAA”或“ZZZ”,上面的语句将返回一个空字符串。

PS Python Challenge?

PS Python挑战?

#4


13  

import re
print re.search('AAA(.*?)ZZZ', 'gfgfdAAA1234ZZZuijjk').group(1)

#5


6  

You can use re module for that:

可以使用re模块实现:

>>> import re
>>> re.compile(".*AAA(.*)ZZZ.*").match("gfgfdAAA1234ZZZuijjk").groups()
('1234,)

#6


5  

With sed it is possible to do something like this with a string:

有了sed,有可能用字符串来做这样的事情:

echo "$STRING" | sed -e "s|.*AAA\(.*\)ZZZ.*|\1|"

echo "$STRING" | sed -e "s|.*AAA\(.*\)ZZZ.*|\1|"

And this will give me 1234 as a result.

结果是1234。

You could do the same with re.sub function using the same regex.

您可以使用相同的regex对re.sub函数执行相同的操作。

>>> re.sub(r'.*AAA(.*)ZZZ.*', r'\1', 'gfgfdAAA1234ZZZuijjk')
'1234'

In basic sed, capturing group are represented by \(..\), but in python it was represented by (..).

在基本sed中,捕获组由\(..\)表示,但在python中,它由(..)表示。

#7


4  

You can find first substring with this function in your code (by character index). Also, you can find what is after a substring.

您可以在代码中(通过字符索引)找到第一个带有此函数的子字符串。此外,您还可以找到子字符串之后的内容。

def FindSubString(strText, strSubString, Offset=None):
    try:
        Start = strText.find(strSubString)
        if Start == -1:
            return -1 # Not Found
        else:
            if Offset == None:
                Result = strText[Start+len(strSubString):]
            elif Offset == 0:
                return Start
            else:
                AfterSubString = Start+len(strSubString)
                Result = strText[AfterSubString:AfterSubString + int(Offset)]
            return Result
    except:
        return -1

# Example:

Text = "Thanks for contributing an answer to Stack Overflow!"
subText = "to"

print("Start of first substring in a text:")
start = FindSubString(Text, subText, 0)
print(start); print("")

print("Exact substring in a text:")
print(Text[start:start+len(subText)]); print("")

print("What is after substring \"%s\"?" %(subText))
print(FindSubString(Text, subText))

# Your answer:

Text = "gfgfdAAA1234ZZZuijjk"
subText1 = "AAA"
subText2 = "ZZZ"

AfterText1 = FindSubString(Text, subText1, 0) + len(subText1)
BeforText2 = FindSubString(Text, subText2, 0) 

print("\nYour answer:\n%s" %(Text[AfterText1:BeforText2]))

#8


2  

Just in case somebody will have to do the same thing that I did. I had to extract everything inside parenthesis in a line. For example, if I have a line like 'US president (Barack Obama) met with ...' and I want to get only 'Barack Obama' this is solution:

以防万一有人要做和我一样的事情。我必须把括号里的所有东西都提取出来。例如,如果我说“美国总统(巴拉克·奥巴马)与……”我只希望得到“巴拉克•奥巴马”这就是解决方案:

regex = '.*\((.*?)\).*'
matches = re.search(regex, line)
line = matches.group(1) + '\n'

I.e. you need to block parenthesis with slash \ sign. Though it is a problem about more regular expressions that Python.

也就是说,你需要用斜杠来阻止括号。尽管Python有一个关于正则表达式的问题。

Also, in some cases you may see 'r' symbols before regex definition. If there is no r prefix, you need to use escape characters like in C. Here is more discussion on that.

此外,在某些情况下,您可能会在regex定义之前看到“r”符号。如果没有r前缀,则需要使用转义字符,如c。

#9


1  

>>> s = '/tmp/10508.constantstring'
>>> s.split('/tmp/')[1].split('constantstring')[0].strip('.')

#10


1  

you can do using just one line of code

您可以只使用一行代码

>>> import re

>>> re.findall(r'\d{1,5}','gfgfdAAA1234ZZZuijjk')

>>> ['1234']

result will receive list...

结果将得到清单……

#11


0  

In python, extracting substring form string can be done using findall method in regular expression (re) module.

在python中,可以在正则表达式(re)模块中使用findall方法提取子字符串表单字符串。

>>> import re
>>> s = 'gfgfdAAA1234ZZZuijjk'
>>> ss = re.findall('AAA(.+)ZZZ', s)
>>> print ss
['1234']

#12


0  

One liners that return other string if there was no match. Edit: improved version uses next function, replace "not-found" with something else if needed:

如果没有匹配,返回另一个字符串的衬垫。编辑:改进后的版本使用下一个功能,如果需要,可以用其他功能替换“not-found”:

import re
res = next( (m.group(1) for m in [re.search("AAA(.*?)ZZZ", "gfgfdAAA1234ZZZuijjk" ),] if m), "not-found" )

My other method to do this, less optimal, uses regex 2nd time, still didn't found a shorter way:

我的另一个方法,不是最优的,使用regex第二次,仍然没有找到更短的方法:

import re
res = ( ( re.search("AAA(.*?)ZZZ", "gfgfdAAA1234ZZZuijjk") or re.search("()","") ).group(1) )