I have a list of strings like such,
我有一个这样的字符串列表,
['happy_feet', 'happy_hats_for_cats', 'sad_fox_or_mad_banana','sad_pandas_and_happy_cats_for_people']
Given a keyword list like ['for', 'or', 'and']
I want to be able to parse the list into another list where if the keyword list occurs in the string, split that string into multiple parts.
给定一个关键字列表,如['for', 'or', 'and'],我希望能够将该列表解析为另一个列表,如果关键字列表出现在字符串中,则将该字符串分割为多个部分。
For example, the above set would be split into
例如,上面的集合将被拆分为
['happy_feet', 'happy_hats', 'cats', 'sad_fox', 'mad_banana', 'sad_pandas', 'happy_cats', 'people']
Currently I've split each inner string by underscore and have a for loop looking for an index of a key word, then recombining the strings by underscore. Is there a quicker way to do this?
目前,我将每个内部字符串按下划线拆分,并使用for循环查找一个关键字的索引,然后按下划线重新组合字符串。有更快捷的方法吗?
4 个解决方案
#1
6
>>> pat = re.compile("_(?:%s)_"%"|".join(sorted(split_list,key=len)))
>>> list(itertools.chain(pat.split(line) for line in data))
will give you the desired output for the example dataset provided
将为提供的示例数据集提供所需的输出吗
actually with the _
delimiters you dont really need to sort it by length so you could just do
实际上,对于_定界符,你不需要按长度排序,所以你可以这样做
>>> pat = re.compile("_(?:%s)_"%"|".join(split_list))
>>> list(itertools.chain(pat.split(line) for line in data))
#2
6
>>> [re.split(r"_(?:f?or|and)_", s) for s in l]
[['happy_feet'],
['happy_hats', 'cats'],
['sad_fox', 'mad_banana'],
['sad_pandas', 'happy_cats', 'people']]
To combine them into a single list, you can use
要将它们组合成一个列表,可以使用
result = []
for s in l:
result.extend(re.split(r"_(?:f?or|and)_", s))
#3
6
You could use a regular expression:
您可以使用正则表达式:
from itertools import chain
import re
pattern = re.compile(r'_(?:{})_'.format('|'.join([re.escape(w) for w in keywords])))
result = list(chain.from_iterable(pattern.split(w) for w in input_list))
The pattern is dynamically created from your list of keywords. The string 'happy_hats_for_cats'
is split on '_for_'
:
模式是从关键字列表中动态创建的。字符串“happy_hats_for_cats”在“_for_”上被分割:
>>> re.split(r'_for_', 'happy_hats_for_cats')
['happy_hats', 'cats']
but because we actually produced a set of alternatives (using the |
metacharacter) you get to split on any of the keywords:
但是因为我们实际上生成了一组替代方案(使用|元字符),所以您可以对任何关键字进行分割:
>>> re.split(r'_(?:for|or|and)_', 'sad_pandas_and_happy_cats_for_people')
['sad_pandas', 'happy_cats', 'people']
Each split result gives you a list of strings (just one if there was nothing to split on); using itertools.chain.from_iterable()
lets us treat all those lists as one long iterable.
每个分割结果都为您提供一个字符串列表(如果没有要分割的内容,只提供一个);通过使用itertools.chain.from_iterable(),我们可以将所有这些列表视为一个长迭代的列表。
Demo:
演示:
>>> from itertools import chain
>>> import re
>>> keywords = ['for', 'or', 'and']
>>> input_list = ['happy_feet', 'happy_hats_for_cats', 'sad_fox_or_mad_banana','sad_pandas_and_happy_cats_for_people']
>>> pattern = re.compile(r'_(?:{})_'.format('|'.join([re.escape(w) for w in keywords])))
>>> list(chain.from_iterable(pattern.split(w) for w in input_list))
['happy_feet', 'happy_hats', 'cats', 'sad_fox', 'mad_banana', 'sad_pandas', 'happy_cats', 'people']
#4
2
Another way of doing this, using only built-in method, is to replace all occurrence of what's in ['for', 'or', 'and']
in every string with a replacement string, say for example _1_
(it could be any string), then at then end of each iteration, to split over this replacement string:
另一种方法是,只使用内置方法,用替换字符串替换每个字符串中的['for', '或'and']中的所有内容,例如_1_(可以是任何字符串),然后在每次迭代结束时,对替换字符串进行分割:
l = ['happy_feet', 'happy_hats_for_cats', 'sad_fox_or_mad_banana','sad_pandas_and_happy_cats_for_people']
replacement_s = '_1_'
lookup = ['for', 'or', 'and']
lookup = [x.join('_'*2) for x in lookup] #Changing to: ['_for_', '_or_', '_and_']
results = []
for i,item in enumerate(l):
for s in lookup:
if s in item:
l[i] = l[i].replace(s,'_1_')
results.extend(l[i].split('_1_'))
OUTPUT:
输出:
['happy_feet', 'happy_hats', 'cats', 'sad_fox', 'mad_banana', 'sad_pandas', 'happy_cats', 'people']
#1
6
>>> pat = re.compile("_(?:%s)_"%"|".join(sorted(split_list,key=len)))
>>> list(itertools.chain(pat.split(line) for line in data))
will give you the desired output for the example dataset provided
将为提供的示例数据集提供所需的输出吗
actually with the _
delimiters you dont really need to sort it by length so you could just do
实际上,对于_定界符,你不需要按长度排序,所以你可以这样做
>>> pat = re.compile("_(?:%s)_"%"|".join(split_list))
>>> list(itertools.chain(pat.split(line) for line in data))
#2
6
>>> [re.split(r"_(?:f?or|and)_", s) for s in l]
[['happy_feet'],
['happy_hats', 'cats'],
['sad_fox', 'mad_banana'],
['sad_pandas', 'happy_cats', 'people']]
To combine them into a single list, you can use
要将它们组合成一个列表,可以使用
result = []
for s in l:
result.extend(re.split(r"_(?:f?or|and)_", s))
#3
6
You could use a regular expression:
您可以使用正则表达式:
from itertools import chain
import re
pattern = re.compile(r'_(?:{})_'.format('|'.join([re.escape(w) for w in keywords])))
result = list(chain.from_iterable(pattern.split(w) for w in input_list))
The pattern is dynamically created from your list of keywords. The string 'happy_hats_for_cats'
is split on '_for_'
:
模式是从关键字列表中动态创建的。字符串“happy_hats_for_cats”在“_for_”上被分割:
>>> re.split(r'_for_', 'happy_hats_for_cats')
['happy_hats', 'cats']
but because we actually produced a set of alternatives (using the |
metacharacter) you get to split on any of the keywords:
但是因为我们实际上生成了一组替代方案(使用|元字符),所以您可以对任何关键字进行分割:
>>> re.split(r'_(?:for|or|and)_', 'sad_pandas_and_happy_cats_for_people')
['sad_pandas', 'happy_cats', 'people']
Each split result gives you a list of strings (just one if there was nothing to split on); using itertools.chain.from_iterable()
lets us treat all those lists as one long iterable.
每个分割结果都为您提供一个字符串列表(如果没有要分割的内容,只提供一个);通过使用itertools.chain.from_iterable(),我们可以将所有这些列表视为一个长迭代的列表。
Demo:
演示:
>>> from itertools import chain
>>> import re
>>> keywords = ['for', 'or', 'and']
>>> input_list = ['happy_feet', 'happy_hats_for_cats', 'sad_fox_or_mad_banana','sad_pandas_and_happy_cats_for_people']
>>> pattern = re.compile(r'_(?:{})_'.format('|'.join([re.escape(w) for w in keywords])))
>>> list(chain.from_iterable(pattern.split(w) for w in input_list))
['happy_feet', 'happy_hats', 'cats', 'sad_fox', 'mad_banana', 'sad_pandas', 'happy_cats', 'people']
#4
2
Another way of doing this, using only built-in method, is to replace all occurrence of what's in ['for', 'or', 'and']
in every string with a replacement string, say for example _1_
(it could be any string), then at then end of each iteration, to split over this replacement string:
另一种方法是,只使用内置方法,用替换字符串替换每个字符串中的['for', '或'and']中的所有内容,例如_1_(可以是任何字符串),然后在每次迭代结束时,对替换字符串进行分割:
l = ['happy_feet', 'happy_hats_for_cats', 'sad_fox_or_mad_banana','sad_pandas_and_happy_cats_for_people']
replacement_s = '_1_'
lookup = ['for', 'or', 'and']
lookup = [x.join('_'*2) for x in lookup] #Changing to: ['_for_', '_or_', '_and_']
results = []
for i,item in enumerate(l):
for s in lookup:
if s in item:
l[i] = l[i].replace(s,'_1_')
results.extend(l[i].split('_1_'))
OUTPUT:
输出:
['happy_feet', 'happy_hats', 'cats', 'sad_fox', 'mad_banana', 'sad_pandas', 'happy_cats', 'people']