I want to create a list from the characters in a string, but keep specific keywords together.
我想从字符串中的字符创建一个列表,但将特定的关键字保存在一起。
For example:
例如:
keywords: car, bus
关键词:汽车,公交车
INPUT:
INPUT:
"xyzcarbusabccar"
OUTPUT:
OUTPUT:
["x", "y", "z", "car", "bus", "a", "b", "c", "car"]
3 个解决方案
#1
38
With re.findall
. Alternate between your keywords first.
用re.findall。首先在关键字之间进行替换。
>>> import re
>>> s = "xyzcarbusabccar"
>>> re.findall('car|bus|[a-z]', s)
['x', 'y', 'z', 'car', 'bus', 'a', 'b', 'c', 'car']
In case you have overlapping keywords, note that this solution will find the first one you encounter:
如果您有重叠的关键字,请注意此解决方案将找到您遇到的第一个:
>>> s = 'abcaratab'
>>> re.findall('car|rat|[a-z]', s)
['a', 'b', 'car', 'a', 't', 'a', 'b']
You can make the solution more general by substituting the [a-z]
part with whatever you like, \w
for example, or a simple .
to match any character.
您可以通过将[a-z]部分替换为您喜欢的任何内容,例如\ w或简单来使解决方案更加通用。匹配任何角色。
Short explanation why this works and why the regex '[a-z]|car|bus'
would not work: The regular expression engine tries the alternating options from left to right and is "eager" to return a match. That means it considers the whole alternation to match as soon as one of the options has been fully matched. At this point, it will not try any of the remaining options but stop processing and report a match immediately. With '[a-z]|car|bus'
, the engine will report a match when it sees any character in the character class [a-z] and never go on to check if 'car' or 'bus' could also be matched.
简单解释为什么这样做以及为什么正则表达式'[a-z] | car | bus'不起作用:正则表达式引擎从左到右尝试交替选项并且“急于”返回匹配。这意味着只要其中一个选项完全匹配,它就会认为整个交替匹配。此时,它不会尝试任何剩余的选项,而是停止处理并立即报告匹配。使用'[a-z] | car | bus',引擎会在看到角色类[a-z]中的任何角色时报告匹配,并且永远不会检查“汽车”或“公共汽车”是否也可以匹配。
#2
16
s = "xyzcarbusabccar"
import re
print re.findall("bus|car|\w", s)
['x', 'y', 'z', 'car', 'bus', 'a', 'b', 'c', 'car']
Or maybe \S
for any non whitespace chars:
或者也许\ S表示任何非空白字符:
s = "xyzcarbusabccar!"
import re
print re.findall("bus|car|\S", s)
['x', 'y', 'z', 'car', 'bus', 'a', 'b', 'c', 'car', '!']
Just make sure you get the order correct putting longer words first if you want the longest matches.
如果你想要最长的匹配,只要确保你的订单正确,请先输入更长的单词。
In [7]: s = "xyzcarsbusabccar!"
In [8]: re.findall("bus|car|cars|\S", s)
Out[8]: ['x', 'y', 'z', 'car', 's', 'bus', 'a', 'b', 'c', 'car', '!']
In [9]: re.findall("bus|cars|car|\S", s)
Out[9]: ['x', 'y', 'z', 'cars', 'bus', 'a', 'b', 'c', 'car', '!']
#3
0
The above solutions are really great, but if the keywords dictionary is long it can easily become messy(maybe unimplementable).
上面的解决方案真的很棒,但如果关键字字典很长,它很容易变得混乱(可能无法实现)。
I propose to store the keywords in a tree (which exploits redundancy) and is more space efficient.
我建议将关键字存储在树中(利用冗余)并且更节省空间。
If the keywords are ["art,"at","atm","bus","can","car"]
the dictionary looks like this
如果关键字是[“art”,在“,”atm“,”bus“,”can“,”car“]字典看起来像这样
.
^
/ ¦ \
/ ¦ \
a b c
^ ^ ^
/ \ \ \
r t u a
^ ^ ^ ^
/ / \ \ / \
t m /0 s n r
^ ^ ^ ^ ^
/ / \ \ \
/0 /0 /0 /0 /0
I made it binary since it was easier to draw. The node "/0"
has the significance of word end (virtual character) and "."
is the root.
我把它设为二进制,因为它更容易绘制。节点“/ 0”具有单词结尾(虚拟字符)和“。”的重要性。是根。
I implemented this simple Tree class to build the tree and necessary functions
我实现了这个简单的Tree类来构建树和必要的函数
class Tree(object):
def __init__(self, name='root', children=None):
self.name = name
self.children = {}
if children is not None:
for child in children:
self.add_child(child.name,child)
def __repr__(self):
return self.name
def add_child(self, node):
assert isinstance(node, Tree)
self.children[node.name] = node
def has_child(self,name):
return name in self.children
def get_child(self,name):
return self.children[name]
def print_tree(self,level=0):
sys.stdout.write('-' * level)
print self.name
for childTag in self.children:
self.children[childTag].print_tree(level+1)
Given the keywords we can construct the structure using code like this
给定关键字,我们可以使用这样的代码构造结构
keywords = ["car","at","atm","bus"]
keywordsTree = Tree('')
for keyword in keywords:
keywordsTreeNode = keywordsTree
for character in keyword:
if not keywordsTreeNode.has_child(character):
keywordsTreeNode.add_child(Tree(character))
keywordsTreeNode = keywordsTreeNode.get_child(character)
keywordsTreeNode.add_child(Tree('/0'))
Finally we search the input for keywords. The solution below offers for a given position in the input all of the keywords matched starting from that position.
最后,我们在输入中搜索关键字。下面的解决方案为输入中的给定位置提供从该位置开始匹配的所有关键字。
inputWords = "xyzcarbusabccar8hj/0atm"
output = []
lengthInput = len(inputWords)
for position in range(0,lengthInput):
##add by default the character
# allMathcedKeyWords = [inputWords[position]]
allMathcedKeyWords = []
keywordsTreeNode = keywordsTree
searchPosition = position
curMathcedWord = ''
while searchPosition < lengthInput and keywordsTreeNode.has_child(inputWords[searchPosition]) :
keywordsTreeNode = keywordsTreeNode.get_child(inputWords[searchPosition])
curMathcedWord = curMathcedWord + inputWords[searchPosition]
if (keywordsTreeNode.has_child("/0")):
allMathcedKeyWords.append(curMathcedWord)
searchPosition += 1
if len(allMathcedKeyWords)==0:
allMathcedKeyWords = inputWords[position]
output.append(allMathcedKeyWords)
print output
This code outputs this
此代码输出此信息
['x', 'y', 'z',
['car'],
'a', 'r',
['bus'],
'u', 's', 'a', 'b', 'c',
['car'],
'a', 'r', '8', 'h', 'j', '/', '0',
['at', 'atm'],
't', 'm']
Important for the code above is the fact that the virtual character at the end of words is two letters ("/0"
) and will never be matched (even if the combination appears in the input sequence as detailed above). Furthermore it handles any string character (for the input and keywords - also do not need to introduce escape characters as in re.findall()
)
对于上面的代码重要的是,单词末尾的虚拟字符是两个字母(“/ 0”)并且永远不会匹配(即使组合出现在输入序列中,如上所述)。此外,它处理任何字符串字符(对于输入和关键字 - 也不需要像re.findall()中那样引入转义字符)
From this output list you can decide what you want to do. If you want the solution of re.findall
find the longest matched word for a position (or based on keywords logical order) and jump ahead the number of characters that word contains.
从此输出列表中,您可以决定要执行的操作。如果你想要re.findall的解决方案找到一个位置最长的匹配单词(或基于关键字逻辑顺序)并向前跳转单词包含的字符数。
Taking the problem even further, every character in the input is a vertex and when you find a word add an edge from that position to the corresponding next vertex after the last character of the matched word. A shortest path algorithm will give you again the solution above. Structuring the output like this bring again space efficiency and opens the door to more complex algorithms.
进一步解决这个问题,输入中的每个字符都是一个顶点,当你找到一个单词时,从该位置向匹配单词的最后一个字符后面的相应下一个顶点添加一条边。最短路径算法将再次为您提供上述解决方案。像这样构建输出带来了空间效率,并为更复杂的算法打开了大门。
Example, having keywords "car"
and "art"
and art and input sequence "acart"
the resulting graphs looks like this
例如,具有关键字“car”和“art”以及艺术和输入序列“acart”,得到的图形看起来像这样
______________
¦ ¦
- a -> c -> a -> r -> t ->
¦______________¦
Complexity analysis
复杂性分析
Space : longest_word_length * number_of_letters_in_keywords
input_length + input_length * input_length (worst case-fully connected graph)
Time : input_length * longest_word_length
input_length + input_length * input_length (worst case-fully connected graph)
#1
38
With re.findall
. Alternate between your keywords first.
用re.findall。首先在关键字之间进行替换。
>>> import re
>>> s = "xyzcarbusabccar"
>>> re.findall('car|bus|[a-z]', s)
['x', 'y', 'z', 'car', 'bus', 'a', 'b', 'c', 'car']
In case you have overlapping keywords, note that this solution will find the first one you encounter:
如果您有重叠的关键字,请注意此解决方案将找到您遇到的第一个:
>>> s = 'abcaratab'
>>> re.findall('car|rat|[a-z]', s)
['a', 'b', 'car', 'a', 't', 'a', 'b']
You can make the solution more general by substituting the [a-z]
part with whatever you like, \w
for example, or a simple .
to match any character.
您可以通过将[a-z]部分替换为您喜欢的任何内容,例如\ w或简单来使解决方案更加通用。匹配任何角色。
Short explanation why this works and why the regex '[a-z]|car|bus'
would not work: The regular expression engine tries the alternating options from left to right and is "eager" to return a match. That means it considers the whole alternation to match as soon as one of the options has been fully matched. At this point, it will not try any of the remaining options but stop processing and report a match immediately. With '[a-z]|car|bus'
, the engine will report a match when it sees any character in the character class [a-z] and never go on to check if 'car' or 'bus' could also be matched.
简单解释为什么这样做以及为什么正则表达式'[a-z] | car | bus'不起作用:正则表达式引擎从左到右尝试交替选项并且“急于”返回匹配。这意味着只要其中一个选项完全匹配,它就会认为整个交替匹配。此时,它不会尝试任何剩余的选项,而是停止处理并立即报告匹配。使用'[a-z] | car | bus',引擎会在看到角色类[a-z]中的任何角色时报告匹配,并且永远不会检查“汽车”或“公共汽车”是否也可以匹配。
#2
16
s = "xyzcarbusabccar"
import re
print re.findall("bus|car|\w", s)
['x', 'y', 'z', 'car', 'bus', 'a', 'b', 'c', 'car']
Or maybe \S
for any non whitespace chars:
或者也许\ S表示任何非空白字符:
s = "xyzcarbusabccar!"
import re
print re.findall("bus|car|\S", s)
['x', 'y', 'z', 'car', 'bus', 'a', 'b', 'c', 'car', '!']
Just make sure you get the order correct putting longer words first if you want the longest matches.
如果你想要最长的匹配,只要确保你的订单正确,请先输入更长的单词。
In [7]: s = "xyzcarsbusabccar!"
In [8]: re.findall("bus|car|cars|\S", s)
Out[8]: ['x', 'y', 'z', 'car', 's', 'bus', 'a', 'b', 'c', 'car', '!']
In [9]: re.findall("bus|cars|car|\S", s)
Out[9]: ['x', 'y', 'z', 'cars', 'bus', 'a', 'b', 'c', 'car', '!']
#3
0
The above solutions are really great, but if the keywords dictionary is long it can easily become messy(maybe unimplementable).
上面的解决方案真的很棒,但如果关键字字典很长,它很容易变得混乱(可能无法实现)。
I propose to store the keywords in a tree (which exploits redundancy) and is more space efficient.
我建议将关键字存储在树中(利用冗余)并且更节省空间。
If the keywords are ["art,"at","atm","bus","can","car"]
the dictionary looks like this
如果关键字是[“art”,在“,”atm“,”bus“,”can“,”car“]字典看起来像这样
.
^
/ ¦ \
/ ¦ \
a b c
^ ^ ^
/ \ \ \
r t u a
^ ^ ^ ^
/ / \ \ / \
t m /0 s n r
^ ^ ^ ^ ^
/ / \ \ \
/0 /0 /0 /0 /0
I made it binary since it was easier to draw. The node "/0"
has the significance of word end (virtual character) and "."
is the root.
我把它设为二进制,因为它更容易绘制。节点“/ 0”具有单词结尾(虚拟字符)和“。”的重要性。是根。
I implemented this simple Tree class to build the tree and necessary functions
我实现了这个简单的Tree类来构建树和必要的函数
class Tree(object):
def __init__(self, name='root', children=None):
self.name = name
self.children = {}
if children is not None:
for child in children:
self.add_child(child.name,child)
def __repr__(self):
return self.name
def add_child(self, node):
assert isinstance(node, Tree)
self.children[node.name] = node
def has_child(self,name):
return name in self.children
def get_child(self,name):
return self.children[name]
def print_tree(self,level=0):
sys.stdout.write('-' * level)
print self.name
for childTag in self.children:
self.children[childTag].print_tree(level+1)
Given the keywords we can construct the structure using code like this
给定关键字,我们可以使用这样的代码构造结构
keywords = ["car","at","atm","bus"]
keywordsTree = Tree('')
for keyword in keywords:
keywordsTreeNode = keywordsTree
for character in keyword:
if not keywordsTreeNode.has_child(character):
keywordsTreeNode.add_child(Tree(character))
keywordsTreeNode = keywordsTreeNode.get_child(character)
keywordsTreeNode.add_child(Tree('/0'))
Finally we search the input for keywords. The solution below offers for a given position in the input all of the keywords matched starting from that position.
最后,我们在输入中搜索关键字。下面的解决方案为输入中的给定位置提供从该位置开始匹配的所有关键字。
inputWords = "xyzcarbusabccar8hj/0atm"
output = []
lengthInput = len(inputWords)
for position in range(0,lengthInput):
##add by default the character
# allMathcedKeyWords = [inputWords[position]]
allMathcedKeyWords = []
keywordsTreeNode = keywordsTree
searchPosition = position
curMathcedWord = ''
while searchPosition < lengthInput and keywordsTreeNode.has_child(inputWords[searchPosition]) :
keywordsTreeNode = keywordsTreeNode.get_child(inputWords[searchPosition])
curMathcedWord = curMathcedWord + inputWords[searchPosition]
if (keywordsTreeNode.has_child("/0")):
allMathcedKeyWords.append(curMathcedWord)
searchPosition += 1
if len(allMathcedKeyWords)==0:
allMathcedKeyWords = inputWords[position]
output.append(allMathcedKeyWords)
print output
This code outputs this
此代码输出此信息
['x', 'y', 'z',
['car'],
'a', 'r',
['bus'],
'u', 's', 'a', 'b', 'c',
['car'],
'a', 'r', '8', 'h', 'j', '/', '0',
['at', 'atm'],
't', 'm']
Important for the code above is the fact that the virtual character at the end of words is two letters ("/0"
) and will never be matched (even if the combination appears in the input sequence as detailed above). Furthermore it handles any string character (for the input and keywords - also do not need to introduce escape characters as in re.findall()
)
对于上面的代码重要的是,单词末尾的虚拟字符是两个字母(“/ 0”)并且永远不会匹配(即使组合出现在输入序列中,如上所述)。此外,它处理任何字符串字符(对于输入和关键字 - 也不需要像re.findall()中那样引入转义字符)
From this output list you can decide what you want to do. If you want the solution of re.findall
find the longest matched word for a position (or based on keywords logical order) and jump ahead the number of characters that word contains.
从此输出列表中,您可以决定要执行的操作。如果你想要re.findall的解决方案找到一个位置最长的匹配单词(或基于关键字逻辑顺序)并向前跳转单词包含的字符数。
Taking the problem even further, every character in the input is a vertex and when you find a word add an edge from that position to the corresponding next vertex after the last character of the matched word. A shortest path algorithm will give you again the solution above. Structuring the output like this bring again space efficiency and opens the door to more complex algorithms.
进一步解决这个问题,输入中的每个字符都是一个顶点,当你找到一个单词时,从该位置向匹配单词的最后一个字符后面的相应下一个顶点添加一条边。最短路径算法将再次为您提供上述解决方案。像这样构建输出带来了空间效率,并为更复杂的算法打开了大门。
Example, having keywords "car"
and "art"
and art and input sequence "acart"
the resulting graphs looks like this
例如,具有关键字“car”和“art”以及艺术和输入序列“acart”,得到的图形看起来像这样
______________
¦ ¦
- a -> c -> a -> r -> t ->
¦______________¦
Complexity analysis
复杂性分析
Space : longest_word_length * number_of_letters_in_keywords
input_length + input_length * input_length (worst case-fully connected graph)
Time : input_length * longest_word_length
input_length + input_length * input_length (worst case-fully connected graph)