Hi I have two CSV files as input, for example:
嗨我有两个CSV文件作为输入,例如:
file1 :
AK163828 chr5 s1 + e1 cttt 4
AK163828 chr5 s2 + e2 gtca 4
AK168688 chr6 s3 + e3 ggcg 4
AK168688 chr6 s4 + e4 tctg 4
file2 :
chr6s3+e3 ggcg
chr5s1+e1 cttt
chr6s4+e4 tata
chr5s2+e2 ggcg
#as you can see the file2 is randomly sorted
I want to compare column 1 of file2 with column 2, 3, 4, 5 merged of file1 and at the same time the column 2 of file2 with column 6 of file 1, and select only the matching lines.
我想比较file2的第1列和第2,第3,第4,第5列合并的file1,同时将file2的第2列与文件1的第6列进行比较,并仅选择匹配的行。
The desired output is
chr6s3+e3 ggcg
chr5s1+e1 cttt
I tried to use this code:
我试着用这个代码:
import csv
reader1 = csv.reader(open(file1), dialect='excel-tab' )
reader2 = csv.reader(open(file2), dialect='excel-tab' )
for row1, row2 in zip(reader1,reader2):
F1 = row1[1] + row1[2] + row1[3] + row1[4] + '\t' row1[5]
F2 = row2[0] + '\t' + row2[1]
print set(F1) & set(F2)
But it doesn't work. Can you help me to fix my code or give me an other way to do it? Thanks for your help!
但它不起作用。你能帮我修改一下我的代码或者给我另一种方法吗?谢谢你的帮助!
2 个解决方案
#1
3
Quick and dirty:
快而脏:
import csv
file1 = 'C:/Users/Me/Desktop/file1'
file2 = 'C:/Users/Me/Desktop/file2'
reader1 = csv.reader(open(file1))
reader2 = csv.reader(open(file2))
F1 = set(''.join(row1[0].strip().split()[1:6]) for row1 in reader1)
F2 = set(''.join(row2[0].strip().split()) for row2 in reader2)
for sequence in F1.intersection(F2):
print(sequence[:-4]),
print('\t'),
print(sequence[-4:])
Output:
输出:
chr6s3+e3 ggcg
chr5s1+e1 cttt
#2
1
How about this:
这个怎么样:
import csv
reader1 = csv.reader(open('file1.tsv'), dialect='excel-tab' )
reader2 = csv.reader(open('file2.tsv'), dialect='excel-tab' )
keys = set()
for row in reader1:
keys.add((''.join(row[1:5]), row[5]))
for row in reader2:
if (row[0], row[1]) in keys:
print '\t'.join(row)
By the way: the format you're using (dialect='excel-tab'
) is usually called TSV, and not CSV, although it is a variant of CSV. You also have to make sure your values are separated by tabs and not by spaces, like in your post. I guess they are, and you only have spaces beacause of Stack Overflow formatting issues?
顺便说一句:您使用的格式(dialect ='excel-tab')通常称为TSV,而不是CSV,尽管它是CSV的变体。您还必须确保您的值由制表符分隔,而不是按空格分隔,例如在帖子中。我猜他们是,并且你只有空间因为Stack Overflow格式化问题?
#1
3
Quick and dirty:
快而脏:
import csv
file1 = 'C:/Users/Me/Desktop/file1'
file2 = 'C:/Users/Me/Desktop/file2'
reader1 = csv.reader(open(file1))
reader2 = csv.reader(open(file2))
F1 = set(''.join(row1[0].strip().split()[1:6]) for row1 in reader1)
F2 = set(''.join(row2[0].strip().split()) for row2 in reader2)
for sequence in F1.intersection(F2):
print(sequence[:-4]),
print('\t'),
print(sequence[-4:])
Output:
输出:
chr6s3+e3 ggcg
chr5s1+e1 cttt
#2
1
How about this:
这个怎么样:
import csv
reader1 = csv.reader(open('file1.tsv'), dialect='excel-tab' )
reader2 = csv.reader(open('file2.tsv'), dialect='excel-tab' )
keys = set()
for row in reader1:
keys.add((''.join(row[1:5]), row[5]))
for row in reader2:
if (row[0], row[1]) in keys:
print '\t'.join(row)
By the way: the format you're using (dialect='excel-tab'
) is usually called TSV, and not CSV, although it is a variant of CSV. You also have to make sure your values are separated by tabs and not by spaces, like in your post. I guess they are, and you only have spaces beacause of Stack Overflow formatting issues?
顺便说一句:您使用的格式(dialect ='excel-tab')通常称为TSV,而不是CSV,尽管它是CSV的变体。您还必须确保您的值由制表符分隔,而不是按空格分隔,例如在帖子中。我猜他们是,并且你只有空间因为Stack Overflow格式化问题?