I have two files and I am trying to extract some values from file 1, like this:
我有两个文件,我试图从文件1中提取一些值,如下所示:
File1:
2 word1
4 word2
4 word2_1
4 word2_2
8 word5
8 word5_3
File 2:
4
8
What I want is to extract every lines starting by 4 and 8 (from file 2) and they are lots. So usually if only one line would match I would use a python dictionary, one key one element easy! But now that I have multiple element matching to the same key, my script would only extract the last one (obviously as it goes along it will erase previous ones!). So I get this is not how it works but I have no idea and I would be very happy if someone can help me start.
我想要的是提取从4和8(从文件2)开始的每一行,它们是很多。因此,通常如果只有一行匹配,我会使用python字典,一键一元素容易!但是现在我有多个元素匹配到同一个键,我的脚本只会提取最后一个(很明显,因为它会删除以前的!)。所以我知道这不是它的工作原理但我不知道如果有人可以帮助我开始我会很高兴。
Here is my "usual" code:
这是我的“通常”代码:
gene_count = {}
my_file = open('file1.txt')
for line in my_file:
columns = line.strip().split()
gene = columns[0]
count = columns[1:13]
gene_count[gene] = count
names_file = open('file2.txt')
output_file = open('output.txt', 'w')
for line in names_file:
gene = line.strip()
count = gene_count[gene]
output_file.write('{0}\t{1}\n'.format(gene,"\t".join(count)))
output_file.close()
2 个解决方案
#1
1
Make the values of your dictionary, lists, and append to them.
制作字典,列表的值以及附加到它们的值。
In general:
from collections import defaultdict
my_dict = defaultdict(lambda: [])
for x in xrange(101):
if x % 2 == 0:
my_dict['evens'].append(str(x))
else:
my_dict['odds'].append(str(x))
print 'evens:', ' '.join(my_dict['evens'])
print 'odds:', ' '.join(my_dict['odds'])
In your case, your values are lists, so add (concatenate) the lists to the lists of your dictionary:
在您的情况下,您的值是列表,因此将列表添加(连接)到您的字典列表:
from collections import defaultdict
gene_count = defaultdict(lambda: [])
my_file = open('file1.txt')
for line in my_file:
columns = line.strip().split()
gene = columns[0]
count = columns[1:13]
gene_count[gene] += count
names_file = open('file2.txt')
output_file = open('output.txt', 'w')
for line in names_file:
gene = line.strip()
count = gene_count[gene]
output_file.write('{0}\t{1}\n'.format(gene,"\t".join(count)))
output_file.close()
If what you actually want to print is the count for each gene, then replace "\t".join(count)
with len(count)
, the length of the list.
如果您实际想要打印的是每个基因的计数,则将“\ t”.join(count)替换为len(count),列表的长度。
#2
1
Have you considered using pandas
. You can load files into DataFrame
and then filter them:
你考虑过使用熊猫吗?您可以将文件加载到DataFrame中,然后对其进行过滤:
In [5]: file1 = pn.read_csv('file1',sep=' ',
names=['number','word'],
engine='python')
In [6]: file1
Out[6]:
number word
0 2 word1
1 4 word2
2 4 word2_1
3 4 word2_2
4 8 word5
5 8 word5_3
In [9]: file1[(file1.number==4) | (file1.number==8)]
Out[9]:
number word
1 4 word2
2 4 word2_1
3 4 word2_2
4 8 word5
5 8 word5_3
In [13]: foo = file1[(file1.number==4) | (file1.number==8)].append(file2[(file2.number==4) | (file2.number==8)])
Out[13]:
number word
1 4 word2
2 4 word2_1
3 4 word2_2
4 8 word5
5 8 word5_3
1 4 word2
2 4 word2_1
3 4 word2_2
4 8 word5
5 8 word5_3
In 5 you reed file, in 9 you filter file by values of numbers, in 13 you join two filtered files together.
You can then sort it and do your computation much easier then with dictionary.
在5 reed文件中,在9中你按数字值过滤文件,在13中你将两个过滤文件连接在一起。然后,您可以使用字典对其进行排序并更轻松地进行计算。
UPDATE
To filter pandas.DataFrame
by condition that column value is in some list you can use isin
giving it list or using range
for example.
更新要按列值在某个列表中的条件过滤pandas.DataFrame,您可以使用isin给它列表或使用范围。
In [46]: file1[file1.number.isin([1,2,3])]
Out[46]:
number word
0 2 word1
#1
1
Make the values of your dictionary, lists, and append to them.
制作字典,列表的值以及附加到它们的值。
In general:
from collections import defaultdict
my_dict = defaultdict(lambda: [])
for x in xrange(101):
if x % 2 == 0:
my_dict['evens'].append(str(x))
else:
my_dict['odds'].append(str(x))
print 'evens:', ' '.join(my_dict['evens'])
print 'odds:', ' '.join(my_dict['odds'])
In your case, your values are lists, so add (concatenate) the lists to the lists of your dictionary:
在您的情况下,您的值是列表,因此将列表添加(连接)到您的字典列表:
from collections import defaultdict
gene_count = defaultdict(lambda: [])
my_file = open('file1.txt')
for line in my_file:
columns = line.strip().split()
gene = columns[0]
count = columns[1:13]
gene_count[gene] += count
names_file = open('file2.txt')
output_file = open('output.txt', 'w')
for line in names_file:
gene = line.strip()
count = gene_count[gene]
output_file.write('{0}\t{1}\n'.format(gene,"\t".join(count)))
output_file.close()
If what you actually want to print is the count for each gene, then replace "\t".join(count)
with len(count)
, the length of the list.
如果您实际想要打印的是每个基因的计数,则将“\ t”.join(count)替换为len(count),列表的长度。
#2
1
Have you considered using pandas
. You can load files into DataFrame
and then filter them:
你考虑过使用熊猫吗?您可以将文件加载到DataFrame中,然后对其进行过滤:
In [5]: file1 = pn.read_csv('file1',sep=' ',
names=['number','word'],
engine='python')
In [6]: file1
Out[6]:
number word
0 2 word1
1 4 word2
2 4 word2_1
3 4 word2_2
4 8 word5
5 8 word5_3
In [9]: file1[(file1.number==4) | (file1.number==8)]
Out[9]:
number word
1 4 word2
2 4 word2_1
3 4 word2_2
4 8 word5
5 8 word5_3
In [13]: foo = file1[(file1.number==4) | (file1.number==8)].append(file2[(file2.number==4) | (file2.number==8)])
Out[13]:
number word
1 4 word2
2 4 word2_1
3 4 word2_2
4 8 word5
5 8 word5_3
1 4 word2
2 4 word2_1
3 4 word2_2
4 8 word5
5 8 word5_3
In 5 you reed file, in 9 you filter file by values of numbers, in 13 you join two filtered files together.
You can then sort it and do your computation much easier then with dictionary.
在5 reed文件中,在9中你按数字值过滤文件,在13中你将两个过滤文件连接在一起。然后,您可以使用字典对其进行排序并更轻松地进行计算。
UPDATE
To filter pandas.DataFrame
by condition that column value is in some list you can use isin
giving it list or using range
for example.
更新要按列值在某个列表中的条件过滤pandas.DataFrame,您可以使用isin给它列表或使用范围。
In [46]: file1[file1.number.isin([1,2,3])]
Out[46]:
number word
0 2 word1