I have a file, a small fragment of it you can see below:
我有一个文件,你可以看到它的一小部分:
Clutch001
Albino X Pastel
Bumble Bee X Albino Lesser
Clutch002
Bee X Fire Bee
Albino Cinnamon X Albino
Mojave X Bumble Bee
Clutch003
Black Pastel X Banana Ghost Lesser
....
Number of strings between ClucthXXX and next ClutchXXX might be different but not equal to zero. I was wondering if it's possible somehow to take a specific string from a file using it as a key (in my case it would be ClutchXXX) and the text till the second occurrence of the specific string as a value for a dictionary? I want to receive such dictionary:
ClucthXXX和下一个ClutchXXX之间的字符串数可能不同但不等于零。我想知道是否有可能以某种方式从文件中使用它作为键(在我的情况下它将是ClutchXXX)和文本,直到第二次出现特定字符串作为字典的值?我想收到这样的字典:
d={'Clutch001': 'Albino X Pastel, Bumble Bee X Albino Lesser'
'Clutch002': 'Bee X Fire Bee, Albino Cinnamon X Albino, Mojave X Bumble Bee'
'Clutch003': 'Black Pastel X Banana Ghost Lesser'}
I am mostly interested in the part where we take string pattern and save it as a key and the text after as a value. Any suggestions or directions to a useful approach would be appreciated.
我最感兴趣的是我们采用字符串模式的部分,并将其保存为键,将文本作为值保存。任何有用的方法的建议或指示将不胜感激。
7 个解决方案
#1
4
from itertools import groupby
from functools import partial
key = partial(re.match, r'Clutch\d\d\d')
with open('foo.txt') as f:
groups = (', '.join(map(str.strip, g)) for k, g in groupby(f, key=key))
pprint(dict(zip(*[iter(groups)]*2)))
{'Clutch001': 'Albino X Pastel, Bumble Bee X Albino Lesser',
'Clutch002': 'Bee X Fire Bee, Albino Cinnamon X Albino, Mojave X Bumble Bee',
'Clutch003': 'Black Pastel X Banana Ghost Lesser'}
#2
3
Collect the lines in lists, storing that list in a dictionary at the same time:
收集列表中的行,同时将该列表存储在字典中:
d = {}
values = None
with open(filename) as inputfile:
for line in inputfile:
line = line.strip()
if line.startswith('Clutch'):
values = d[line] = []
else:
values.append(line)
This gives you:
这给你:
{'Clutch001': ['Albino X Pastel', 'Bumble Bee X Albino Lesser']
'Clutch002': ['Bee X Fire Bee', 'Albino Cinnamon X Albino', 'Mojave X Bumble Bee']
'Clutch003': ['Black Pastel X Banana Ghost Lesser']}
It's easy enough to turn all those lists into single strings though, after loading the file:
在加载文件后,将所有这些列表转换为单个字符串很容易:
d = {key: ', '.join(value) for key, value in d.items()}
You can also do the joining as you read the file; I'd use a generator function to process the file in groups:
您也可以在阅读文件时进行加入;我将使用生成器函数来处理组中的文件:
def per_clutch(inputfile):
clutch = None
lines = []
for line in inputfile:
line = line.strip()
if line.startswith('Clutch'):
if lines:
yield clutch, lines
clutch, lines = line, []
else:
lines.append(line)
if clutch and lines:
yield clutch, lines
then just slurp all groups into a dictionary:
然后只是将所有组塞进字典:
with open(filename) as inputfile:
d = {clutch: ', '.join(lines) for clutch, lines in per_clutch(inputfile)}
Demo of the latter:
演示后者:
>>> def per_clutch(inputfile):
... clutch = None
... lines = []
... for line in inputfile:
... line = line.strip()
... if line.startswith('Clutch'):
... if lines:
... yield clutch, lines
... clutch, lines = line, []
... else:
... lines.append(line)
... if clutch and lines:
... yield clutch, lines
...
>>> sample = '''\
... Clutch001
... Albino X Pastel
... Bumble Bee X Albino Lesser
... Clutch002
... Bee X Fire Bee
... Albino Cinnamon X Albino
... Mojave X Bumble Bee
... Clutch003
... Black Pastel X Banana Ghost Lesser
... '''.splitlines(True)
>>> {clutch: ', '.join(lines) for clutch, lines in per_clutch(sample)}
{'Clutch001': 'Albino X Pastel, Bumble Bee X Albino Lesser', 'Clutch002': 'Bee X Fire Bee, Albino Cinnamon X Albino, Mojave X Bumble Bee', 'Clutch003': 'Black Pastel X Banana Ghost Lesser'}
>>> from pprint import pprint
>>> pprint(_)
{'Clutch001': 'Albino X Pastel, Bumble Bee X Albino Lesser',
'Clutch002': 'Bee X Fire Bee, Albino Cinnamon X Albino, Mojave X Bumble Bee',
'Clutch003': 'Black Pastel X Banana Ghost Lesser'}
#3
2
As noted in comments, if "Clutch" (or whatever keyword) can be relied on not to appear in the non-keyword lines, you could use the following:
如评论中所述,如果可以依赖“离合器”(或任何关键字)不出现在非关键字行中,则可以使用以下内容:
keyword = "Clutch"
with open(filename) as inputfile:
t = inputfile.read()
d = {keyword + s[:3]: s[3:].strip().replace('\n', ', ') for s in t.split(keyword)}
This reads the whole file in to memory at once, so should be avoided if your file may get very large.
这会立即将整个文件读入内存,因此如果您的文件可能变得非常大,应该避免使用。
#4
2
You could use re.split()
to enumerate "Clutch"
parts in the file:
您可以使用re.split()枚举文件中的“Clutch”部分:
import re
tokens = iter(re.split(r'(^Clutch\d{3}\s*$)\s+', file.read(), flags=re.M))
next(tokens) # skip until the first Clutch
print({k: ', '.join(v.splitlines()) for k, v in zip(tokens, tokens)})
Output
{'Clutch001': 'Albino X Pastel, Bumble Bee X Albino Lesser',
'Clutch002': 'Bee X Fire Bee, Albino Cinnamon X Albino, Mojave X Bumble Bee',
'Clutch003': 'Black Pastel X Banana Ghost Lesser'}
#5
2
Lets file 'file.txt' contains:
让文件'file.txt'包含:
Clutch001 Albino X Pastel Bumble Bee X Albino Lesser Clutch002 Bee X Fire Bee Albino Cinnamon X Albino Mojave X Bumble Bee Clutch003 Black Pastel X Banana Ghost Lesser
To receive your dictionary try this:
要收到你的字典,试试这个:
import re
with open('file.txt', 'r') as f:
result = re.split(
r'(Clutch\d{3}).*?',
f.read(),
flags=re.DOTALL # including '\n'
)[1:] # result is ['Clutch001', '\nAlbino X Pastel\nBumble Bee X Albino Lesser\n', 'Clutch002', '\nBee X Fire Bee\nAlbino Cinnamon X Albino\nMojave X Bumble Bee\n', 'Clutch003', '\nBlack Pastel X Banana Ghost Lesser\n']
keys = result[::2] # keys is ['Clutch001', 'Clutch002', 'Clutch003']
values = result[1::2] # values is ['\nAlbino X Pastel\nBumble Bee X Albino Lesser\n', '\nBee X Fire Bee\nAlbino Cinnamon X Albino\nMojave X Bumble Bee\n', '\nBlack Pastel X Banana Ghost Lesser\n']
values = map(
lambda value: value.strip().replace('\n', ', '),
values
) # values is ['Albino X Pastel, Bumble Bee X Albino Lesser', 'Bee X Fire Bee, Albino Cinnamon X Albino, Mojave X Bumble Bee', 'Black Pastel X Banana Ghost Lesser']
d = dict(zip(keys, values)) # d is {'Clutch001': 'Albino X Pastel, Bumble Bee X Albino Lesser', 'Clutch002': 'Bee X Fire Bee, Albino Cinnamon X Albino, Mojave X Bumble Bee', 'Clutch003': 'Black Pastel X Banana Ghost Lesser'}
#6
1
Here's a version that works, more or less. I'm not sure how Pythonic it is (it can probably be squeezed and can definitely be improved):
这是一个或多或少有效的版本。我不确定它是如何Pythonic(它可能被挤压,绝对可以改进):
import re
import fileinput
d = dict()
key = ''
rx = re.compile('^Clutch\d\d\d$')
for line in fileinput.input():
line = line[0:-1]
if rx.match(line):
key = line
d[key] = ''
else:
d[key] += line
print d
for key in d:
print key, d[key]
The output (which repeats the information) is:
输出(重复信息)是:
{'Clutch001': 'Albino X PastelBumble Bee X Albino Lesser', 'Clutch002': 'Bee X Fire BeeAlbino Cinnamon X AlbinoMojave X Bumble Bee', 'Clutch003': 'Black Pastel X Banana Ghost Lesser'}
Clutch001 Albino X PastelBumble Bee X Albino Lesser
Clutch002 Bee X Fire BeeAlbino Cinnamon X AlbinoMojave X Bumble Bee
Clutch003 Black Pastel X Banana Ghost Lesser
If for some reason the first line isn't a 'clutch' line, you get an error because of the empty key.
如果由于某种原因第一行不是“离合器”行,则由于空键而出现错误。
Joining with commas, dealing with broken text files (no newline at the end) etc:
加入逗号,处理损坏的文本文件(最后没有换行符)等:
import fileinput
d = {}
for line in fileinput.input():
line = line.rstrip('\r\n') # line.strip() for leading and trailing space
if line.startswith('Clutch'):
key = line
d[key] = ''
pad = ''
else:
d[key] += pad + line
pad = ', '
print d
for key in d:
print "'%s': '%s'" % (key, d[key])
The 'pad' technique is one I like in other contexts, and it works fine here. I'm tolerably certain it wouldn't be regarded as Pythonic, though.
'pad'技术是我在其他环境中喜欢的技术,它在这里工作得很好。不过,我可以肯定它不会被视为Pythonic。
Revised sample output:
修改样本输出:
{'Clutch001': 'Albino X Pastel, Bumble Bee X Albino Lesser', 'Clutch002': 'Bee X Fire Bee, Albino Cinnamon X Albino, Mojave X Bumble Bee', 'Clutch003': 'Black Pastel X Banana Ghost Lesser'}
'Clutch001': 'Albino X Pastel, Bumble Bee X Albino Lesser'
'Clutch002': 'Bee X Fire Bee, Albino Cinnamon X Albino, Mojave X Bumble Bee'
'Clutch003': 'Black Pastel X Banana Ghost Lesser'
#7
1
Assuming the word Clutch occurs independently on its own line, the following will work:
假设Clutch这个词在它自己的行上独立发生,以下内容将起作用:
import re
d = {}
with open(filename) as f:
for line in f:
if re.match("^Clutch[0-9]+", line) :
match = line # match is the key searched for
match = match.replace('\n', ' ') # newlines are replaced
d[match] = ''
else:
line = line.replace('\n', ' ')
d[match] += line # all lines without the word 'Clutch'
# are added to the matched key
#1
4
from itertools import groupby
from functools import partial
key = partial(re.match, r'Clutch\d\d\d')
with open('foo.txt') as f:
groups = (', '.join(map(str.strip, g)) for k, g in groupby(f, key=key))
pprint(dict(zip(*[iter(groups)]*2)))
{'Clutch001': 'Albino X Pastel, Bumble Bee X Albino Lesser',
'Clutch002': 'Bee X Fire Bee, Albino Cinnamon X Albino, Mojave X Bumble Bee',
'Clutch003': 'Black Pastel X Banana Ghost Lesser'}
#2
3
Collect the lines in lists, storing that list in a dictionary at the same time:
收集列表中的行,同时将该列表存储在字典中:
d = {}
values = None
with open(filename) as inputfile:
for line in inputfile:
line = line.strip()
if line.startswith('Clutch'):
values = d[line] = []
else:
values.append(line)
This gives you:
这给你:
{'Clutch001': ['Albino X Pastel', 'Bumble Bee X Albino Lesser']
'Clutch002': ['Bee X Fire Bee', 'Albino Cinnamon X Albino', 'Mojave X Bumble Bee']
'Clutch003': ['Black Pastel X Banana Ghost Lesser']}
It's easy enough to turn all those lists into single strings though, after loading the file:
在加载文件后,将所有这些列表转换为单个字符串很容易:
d = {key: ', '.join(value) for key, value in d.items()}
You can also do the joining as you read the file; I'd use a generator function to process the file in groups:
您也可以在阅读文件时进行加入;我将使用生成器函数来处理组中的文件:
def per_clutch(inputfile):
clutch = None
lines = []
for line in inputfile:
line = line.strip()
if line.startswith('Clutch'):
if lines:
yield clutch, lines
clutch, lines = line, []
else:
lines.append(line)
if clutch and lines:
yield clutch, lines
then just slurp all groups into a dictionary:
然后只是将所有组塞进字典:
with open(filename) as inputfile:
d = {clutch: ', '.join(lines) for clutch, lines in per_clutch(inputfile)}
Demo of the latter:
演示后者:
>>> def per_clutch(inputfile):
... clutch = None
... lines = []
... for line in inputfile:
... line = line.strip()
... if line.startswith('Clutch'):
... if lines:
... yield clutch, lines
... clutch, lines = line, []
... else:
... lines.append(line)
... if clutch and lines:
... yield clutch, lines
...
>>> sample = '''\
... Clutch001
... Albino X Pastel
... Bumble Bee X Albino Lesser
... Clutch002
... Bee X Fire Bee
... Albino Cinnamon X Albino
... Mojave X Bumble Bee
... Clutch003
... Black Pastel X Banana Ghost Lesser
... '''.splitlines(True)
>>> {clutch: ', '.join(lines) for clutch, lines in per_clutch(sample)}
{'Clutch001': 'Albino X Pastel, Bumble Bee X Albino Lesser', 'Clutch002': 'Bee X Fire Bee, Albino Cinnamon X Albino, Mojave X Bumble Bee', 'Clutch003': 'Black Pastel X Banana Ghost Lesser'}
>>> from pprint import pprint
>>> pprint(_)
{'Clutch001': 'Albino X Pastel, Bumble Bee X Albino Lesser',
'Clutch002': 'Bee X Fire Bee, Albino Cinnamon X Albino, Mojave X Bumble Bee',
'Clutch003': 'Black Pastel X Banana Ghost Lesser'}
#3
2
As noted in comments, if "Clutch" (or whatever keyword) can be relied on not to appear in the non-keyword lines, you could use the following:
如评论中所述,如果可以依赖“离合器”(或任何关键字)不出现在非关键字行中,则可以使用以下内容:
keyword = "Clutch"
with open(filename) as inputfile:
t = inputfile.read()
d = {keyword + s[:3]: s[3:].strip().replace('\n', ', ') for s in t.split(keyword)}
This reads the whole file in to memory at once, so should be avoided if your file may get very large.
这会立即将整个文件读入内存,因此如果您的文件可能变得非常大,应该避免使用。
#4
2
You could use re.split()
to enumerate "Clutch"
parts in the file:
您可以使用re.split()枚举文件中的“Clutch”部分:
import re
tokens = iter(re.split(r'(^Clutch\d{3}\s*$)\s+', file.read(), flags=re.M))
next(tokens) # skip until the first Clutch
print({k: ', '.join(v.splitlines()) for k, v in zip(tokens, tokens)})
Output
{'Clutch001': 'Albino X Pastel, Bumble Bee X Albino Lesser',
'Clutch002': 'Bee X Fire Bee, Albino Cinnamon X Albino, Mojave X Bumble Bee',
'Clutch003': 'Black Pastel X Banana Ghost Lesser'}
#5
2
Lets file 'file.txt' contains:
让文件'file.txt'包含:
Clutch001 Albino X Pastel Bumble Bee X Albino Lesser Clutch002 Bee X Fire Bee Albino Cinnamon X Albino Mojave X Bumble Bee Clutch003 Black Pastel X Banana Ghost Lesser
To receive your dictionary try this:
要收到你的字典,试试这个:
import re
with open('file.txt', 'r') as f:
result = re.split(
r'(Clutch\d{3}).*?',
f.read(),
flags=re.DOTALL # including '\n'
)[1:] # result is ['Clutch001', '\nAlbino X Pastel\nBumble Bee X Albino Lesser\n', 'Clutch002', '\nBee X Fire Bee\nAlbino Cinnamon X Albino\nMojave X Bumble Bee\n', 'Clutch003', '\nBlack Pastel X Banana Ghost Lesser\n']
keys = result[::2] # keys is ['Clutch001', 'Clutch002', 'Clutch003']
values = result[1::2] # values is ['\nAlbino X Pastel\nBumble Bee X Albino Lesser\n', '\nBee X Fire Bee\nAlbino Cinnamon X Albino\nMojave X Bumble Bee\n', '\nBlack Pastel X Banana Ghost Lesser\n']
values = map(
lambda value: value.strip().replace('\n', ', '),
values
) # values is ['Albino X Pastel, Bumble Bee X Albino Lesser', 'Bee X Fire Bee, Albino Cinnamon X Albino, Mojave X Bumble Bee', 'Black Pastel X Banana Ghost Lesser']
d = dict(zip(keys, values)) # d is {'Clutch001': 'Albino X Pastel, Bumble Bee X Albino Lesser', 'Clutch002': 'Bee X Fire Bee, Albino Cinnamon X Albino, Mojave X Bumble Bee', 'Clutch003': 'Black Pastel X Banana Ghost Lesser'}
#6
1
Here's a version that works, more or less. I'm not sure how Pythonic it is (it can probably be squeezed and can definitely be improved):
这是一个或多或少有效的版本。我不确定它是如何Pythonic(它可能被挤压,绝对可以改进):
import re
import fileinput
d = dict()
key = ''
rx = re.compile('^Clutch\d\d\d$')
for line in fileinput.input():
line = line[0:-1]
if rx.match(line):
key = line
d[key] = ''
else:
d[key] += line
print d
for key in d:
print key, d[key]
The output (which repeats the information) is:
输出(重复信息)是:
{'Clutch001': 'Albino X PastelBumble Bee X Albino Lesser', 'Clutch002': 'Bee X Fire BeeAlbino Cinnamon X AlbinoMojave X Bumble Bee', 'Clutch003': 'Black Pastel X Banana Ghost Lesser'}
Clutch001 Albino X PastelBumble Bee X Albino Lesser
Clutch002 Bee X Fire BeeAlbino Cinnamon X AlbinoMojave X Bumble Bee
Clutch003 Black Pastel X Banana Ghost Lesser
If for some reason the first line isn't a 'clutch' line, you get an error because of the empty key.
如果由于某种原因第一行不是“离合器”行,则由于空键而出现错误。
Joining with commas, dealing with broken text files (no newline at the end) etc:
加入逗号,处理损坏的文本文件(最后没有换行符)等:
import fileinput
d = {}
for line in fileinput.input():
line = line.rstrip('\r\n') # line.strip() for leading and trailing space
if line.startswith('Clutch'):
key = line
d[key] = ''
pad = ''
else:
d[key] += pad + line
pad = ', '
print d
for key in d:
print "'%s': '%s'" % (key, d[key])
The 'pad' technique is one I like in other contexts, and it works fine here. I'm tolerably certain it wouldn't be regarded as Pythonic, though.
'pad'技术是我在其他环境中喜欢的技术,它在这里工作得很好。不过,我可以肯定它不会被视为Pythonic。
Revised sample output:
修改样本输出:
{'Clutch001': 'Albino X Pastel, Bumble Bee X Albino Lesser', 'Clutch002': 'Bee X Fire Bee, Albino Cinnamon X Albino, Mojave X Bumble Bee', 'Clutch003': 'Black Pastel X Banana Ghost Lesser'}
'Clutch001': 'Albino X Pastel, Bumble Bee X Albino Lesser'
'Clutch002': 'Bee X Fire Bee, Albino Cinnamon X Albino, Mojave X Bumble Bee'
'Clutch003': 'Black Pastel X Banana Ghost Lesser'
#7
1
Assuming the word Clutch occurs independently on its own line, the following will work:
假设Clutch这个词在它自己的行上独立发生,以下内容将起作用:
import re
d = {}
with open(filename) as f:
for line in f:
if re.match("^Clutch[0-9]+", line) :
match = line # match is the key searched for
match = match.replace('\n', ' ') # newlines are replaced
d[match] = ''
else:
line = line.replace('\n', ' ')
d[match] += line # all lines without the word 'Clutch'
# are added to the matched key