在python中过滤CSV文件

I have downloaded this csv file, which creates a spreadsheet of gene information. What is important is that in the HLA-* columns, there is gene information. If the gene is too low of a resolution e.g. DQB1*03 then the row should be deleted. If the data is too high resoltuion e.g. DQB1*03:02:01, then the :01 tag at the end needs to be removed. So, ideally I want to proteins to be in the format DQB1*03:02, so that it has two levels of resolution after DQB1*. How can I tell python to look for these formats, and ignore the data stored in them.e.g.

我已经下载了这个csv文件,它创建了一个基因信息的电子表格。重要的是,在HLA- *列中,有基因信息。如果基因分辨率太低,例如DQB1 * 03然后该行应删除。如果数据太高,例如DQB1 * 03:02:01,然后需要删除末尾的:01标签。因此,理想情况下我希望蛋白质的格式为DQB1 * 03:02,因此它在DQB1 *之后具有两级分辨率。如何告诉python查找这些格式,并忽略存储在其中的数据。

if (csvCell is of format DQB1*03:02:01):   delete the :01 # but do this in a general formatelif (csvCell is of format DQB1*03):   delete rowelse:   goto next line

UPDATE: Edited code I referenced

更新:我引用的编辑代码

import csvimport reimport syscsvdictreader = csv.DictReader(open('mhc.csv','r+b'), delimiter=',')csvdictwriter = csv.DictWriter(file('mhc_fixed.csv','r+b'), fieldnames=csvdictreader.fieldnames, delimiter=',')csvdictwriter.writeheader()targets = [name for name in csvdictreader.fieldnames if name.startswith('HLA-D')]for rowfields in csvdictreader:  keep = True  for field in targets:    value = rowfields[field]    if re.match(r'^\w+\*\d\d$', value):      keep = False      break # quit processing target fields    elif re.match(r'^(\w+)\*(\d+):(\d+):(\d+):(\d+)$', value):      rowfields[field] = re.sub(r'^(\w+)\*(\d+):(\d+):(\d+):(\d+)$',r'\1*\2:\3', value)    else: # reduce gene resolution if too high              # by only keeping first two alles if three are present      rowfields[field] = re.sub(r'^(\w+)\*(\d+):(\d+):(\d+)$',r'\1*\2:\3', value)  if keep:     csvdictwriter.writerow(rowfields)

2 个解决方案

#1

Here's something that I think will do what you want. It's not as simple as Peter's answer because it uses Python's csv module to process the file. It could probably be rewritten and simplified to just treat the file as a plain text as his does, but that should be easy.

这是我认为会做你想做的事情。它并不像Peter的回答那么简单,因为它使用Python的csv模块来处理文件。它可能会被重写和简化,只是将文件视为纯文本,但这应该很容易。

import csvimport reimport syscsvdictreader = csv.DictReader(sys.stdin, delimiter=',')csvdictwriter = csv.DictWriter(sys.stdout, fieldnames=csvdictreader.fieldnames, delimiter=',')csvdictwriter.writeheader()targets = [name for name in csvdictreader.fieldnames if name.startswith('HLA-')]for rowfields in csvdictreader:    keep = True    for field in targets:        value = rowfields[field]        if re.match(r'^DQB1\*\d\d$', value): # gene resolution too low?            keep = False            break # quit processing target fields        else: # reduce gene resolution if too high              # by only keeping first two alles if three are present            rowfields[field] = re.sub(r'^DQB1\*(\d\d):(\d\d):(\d\d)$',                                      r'DQB1*\1:\2', value)    if keep:        csvdictwriter.writerow(rowfields)

The hardest part for me was determining what you wanted to do.

对我来说最困难的部分是确定你想做什么。

#2

Here's an ultra-simple filter:

这是一个超简单的过滤器:

import sysfor line in sys.stdin:  line = line.replace( ',DQB1*03:02:01,', ',DQB1*03:02,' )  if line.find( ',DQB1*03,' ) == -1:    sys.stdout.write( line )

Or, if you want to use regular expressions

或者,如果您想使用正则表达式

import reimport sysfor line in sys.stdin:  line = re.sub( ',DQB1\\*03:02:01,', ',DQB1*03:02,', line )  if re.search( ',DQB1\\*03,', line ) == None:    sys.stdout.write( line )

Run it as

运行它

python script.py < data.csv

#1

import csvimport reimport syscsvdictreader = csv.DictReader(sys.stdin, delimiter=',')csvdictwriter = csv.DictWriter(sys.stdout, fieldnames=csvdictreader.fieldnames, delimiter=',')csvdictwriter.writeheader()targets = [name for name in csvdictreader.fieldnames if name.startswith('HLA-')]for rowfields in csvdictreader:    keep = True    for field in targets:        value = rowfields[field]        if re.match(r'^DQB1\*\d\d$', value): # gene resolution too low?            keep = False            break # quit processing target fields        else: # reduce gene resolution if too high              # by only keeping first two alles if three are present            rowfields[field] = re.sub(r'^DQB1\*(\d\d):(\d\d):(\d\d)$',                                      r'DQB1*\1:\2', value)    if keep:        csvdictwriter.writerow(rowfields)