使用python读取csv中的特定列

时间:2021-08-17 20:29:55

I have a csv file that look like this:

我有一个看起来像这样的csv文件:

+-----+-----+-----+-----+-----+-----+-----+-----+
| AAA | bbb | ccc | DDD | eee | FFF | GGG | hhh |
+-----+-----+-----+-----+-----+-----+-----+-----+
|   1 |   2 |   3 |   4 |  50 |   3 |  20 |   4 |
|   2 |   1 |   3 |   5 |  24 |   2 |  23 |   5 |
|   4 |   1 |   3 |   6 |  34 |   1 |  22 |   5 |
|   2 |   1 |   3 |   5 |  24 |   2 |  23 |   5 |
|   2 |   1 |   3 |   5 |  24 |   2 |  23 |   5 |
+-----+-----+-----+-----+-----+-----+-----+-----+

...

...

How can I only read the columns "AAA,DDD,FFF,GGG" in python and skip the headers? The output I want is a list of tuples that looks like this: [(1,4,3,20),(2,5,2,23),(4,6,1,22)]. I'm thinking to write these data to a SQLdatabase later.

我怎么才能在python中读取“AAA,DDD,FFF,GGG”列并跳过标题?我想要的输出是一个如下所示的元组列表:[(1,4,3,20),(2,5,2,23),(4,6,1,22)]。我想稍后将这些数据写入SQL数据库。

I referred to this post:Read specific columns from a csv file with csv module?. But I don't think it is helpful in my case. Since my .csv is pretty big with whole bunch of columns, I hope I can tell python the column names I want, so python can read the specific columns row by row for me.

我参考过这篇文章:用csv模块从csv文件中读取特定列?但我不认为这对我的情况有帮助。由于我的.csv对于大量的列非常大,我希望我能告诉python我想要的列名,所以python可以逐行读取特定的列。

7 个解决方案

#1


5  

def read_csv(file, columns, type_name="Row"):
  try:
    row_type = namedtuple(type_name, columns)
  except ValueError:
    row_type = tuple
  rows = iter(csv.reader(file))
  header = rows.next()
  mapping = [header.index(x) for x in columns]
  for row in rows:
    row = row_type(*[row[i] for i in mapping])
    yield row

Example:

例:

>>> import csv
>>> from collections import namedtuple
>>> from StringIO import StringIO
>>> def read_csv(file, columns, type_name="Row"):
...   try:
...     row_type = namedtuple(type_name, columns)
...   except ValueError:
...     row_type = tuple
...   rows = iter(csv.reader(file))
...   header = rows.next()
...   mapping = [header.index(x) for x in columns]
...   for row in rows:
...     row = row_type(*[row[i] for i in mapping])
...     yield row
... 
>>> testdata = """\
... AAA,bbb,ccc,DDD,eee,FFF,GGG,hhh
... 1,2,3,4,50,3,20,4
... 2,1,3,5,24,2,23,5
... 4,1,3,6,34,1,22,5
... 2,1,3,5,24,2,23,5
... 2,1,3,5,24,2,23,5
... """
>>> testfile = StringIO(testdata)
>>> for row in read_csv(testfile, "AAA GGG DDD".split()):
...   print row
... 
Row(AAA='1', GGG='20', DDD='4')
Row(AAA='2', GGG='23', DDD='5')
Row(AAA='4', GGG='22', DDD='6')
Row(AAA='2', GGG='23', DDD='5')
Row(AAA='2', GGG='23', DDD='5')

#2


5  

I realize the answer has been accepted, but if you really want to read specific named columns from a csv file, you should use a DictReader (if you're not using Pandas that is).

我意识到答案已被接受,但如果你真的想从csv文件中读取特定的命名列,你应该使用DictReader(如果你没有使用Pandas)。

import csv
from StringIO import StringIO

columns = 'AAA,DDD,FFF,GGG'.split(',')


testdata ='''\
AAA,bbb,ccc,DDD,eee,FFF,GGG,hhh
1,2,3,4,50,3,20,4
2,1,3,5,24,2,23,5
4,1,3,6,34,1,22,5
2,1,3,5,24,2,23,5
2,1,3,5,24,2,23,5
'''

reader = csv.DictReader(StringIO(testdata))

desired_cols = (tuple(row[col] for col in columns) for row in reader)

Output:

输出:

>>> list(desired_cols)
[('1', '4', '3', '20'),
 ('2', '5', '2', '23'),
 ('4', '6', '1', '22'),
 ('2', '5', '2', '23'),
 ('2', '5', '2', '23')]

#3


1  

import csv

DESIRED_COLUMNS = ('AAA','DDD','FFF','GGG')

f = open("myfile.csv")
reader = csv.reader(f)

headers = None
results = []
for row in reader:
    if not headers:
        headers = []
        for i, col in enumerate(row):
        if col in DESIRED_COLUMNS:
            # Store the index of the cols of interest
            headers.append(i)

    else:
        results.append(tuple([row[i] for i in headers]))

print results

#4


1  

Context: For this type of work you should use the amazing python petl library. That will save you a lot of work and potential frustration from doing things 'manually' with the standard csv module. AFAIK, the only people who still use the csv module are those who have not yet discovered better tools for working with tabular data (pandas, petl, etc.), which is fine, but if you plan to work with a lot of data in your career from various strange sources, learning something like petl is one of the best investments you can make. To get started should only take 30 minutes after you've done pip install petl. The documentation is excellent.

上下文:对于这种类型的工作,你应该使用惊人的python petl库。通过使用标准csv模块“手动”执行操作,这将为您节省大量工作和潜在的挫败感。 AFAIK,唯一仍然使用csv模块的人是那些还没有发现更好的工具来处理表格数据(pandas,petl等)的人,这很好,但如果你打算使用大量的数据你的职业生涯来自各种奇怪的来源,学习像petl这样的东西是你可以做的最好的投资之一。要开始使用,只需要在完成pip install petl后30分钟。文档非常好。

Answer: Let's say you have the first table in a csv file (you can also load directly from the database using petl). Then you would simply load it and do the following.

答:假设您在csv文件中有第一个表(您也可以使用petl直接从数据库加载)。然后你只需加载它并执行以下操作。

from petl import fromcsv, look, cut, tocsv    

    #Load the table
    table1 = fromcsv('table1.csv')
    # Alter the colums
    table2 = cut(table1, 'Song_Name','Artist_ID')
    #have a quick look to make sure things are ok.  Prints a nicely formatted table to your console
    print look(table2)
    # Save to new file
    tocsv(table2, 'new.csv')

#5


0  

If your files and requirements are relatively simple and set, then once you know the desired columns, I would likely use split() to divide each data line into a list of column entries:

如果您的文件和要求相对简单并设置了,那么一旦您知道了所需的列,我就可以使用split()将每个数据行划分为列条目列表:

alist = aline.split('|')

I would then use the desired column indices to get the column entries from the list, process each with strip() to remove the whitespace, convert it to the desired format (it looks like your data has integer values), and create the tuples.

然后,我将使用所需的列索引从列表中获取列条目,使用strip()处理每个列条目以删除空格,将其转换为所需的格式(看起来您的数据具有整数值),并创建元组。

As I said, I am assuming that your requirements are relatively fixed. The more complicated or the more they are likely to change, the more likely that it will be worth your time to pick up and use a library made for manipulating this type of data.

正如我所说,我假设您的要求相对固定。它们越复杂或越可能发生变化,就越有可能值得花时间去挑选和使用用于操纵此类数据的库。

#6


0  

All other answers are good, but I think it would be better to not load all data at the same time because the csv file could be really huge. I suggest using a generator.

所有其他答案都很好,但我认为最好不要同时加载所有数据,因为csv文件可能非常庞大。我建议使用发电机。

def read_csv(f, cols):
    reader = csv.reader(f)
    for row in reader:
        if len(row) == 1:
            columns = row[0].split()
            yield (columns[c] for c in cols)

Which can be used for a for loop after

哪个可以用于for循环之后

with open('path/to/test.csv', 'rb') as f:
    for bbb, ccc in read_csv(f, [1, 2]):
        print bbb, ccc

Of course you can enhance this function to receive the column's name instead of the index. To do so, just mix Brad M answer and mine.

当然,您可以增强此功能以接收列的名称而不是索引。要做到这一点,只需混合布拉德M回答我的。

#7


0  

I think it will help.

我认为这会有所帮助。

CSV

CSV

1997,Ford,E350,"ac, abs, moon",3000.00
1999,Chevy,"Venture ""Extended Edition""","",4900.00
1996,Jeep,Grand Cherokee,"MUST SELL! air, moon roof, loaded",4799.00

code

import csv   

def get_csv(file_name, names=None, usecols=None, mode='r', encoding="utf8",
            quoting=csv.QUOTE_ALL,
            delimiter=',',
            as_obj=False):

    class RowObject:
        def __init__(self, **entries):
            self.__dict__.update(entries)

    with open(file_name, mode=mode, encoding=encoding) as csvfile:
        data_reader = csv.reader(csvfile, quoting=quoting, delimiter=delimiter)
        for row in data_reader:
            if usecols and names:
                q = dict(zip(names, (row[i] for i in usecols)))
                yield q if not as_obj else RowObject(**q)
            elif usecols and not names:
                yield list(row[i] for i in usecols)
            elif names and not usecols:
                q = dict(zip(names, (row[k] for k, i in enumerate(row))))
                yield q if not as_obj else RowObject(**q)
            else:
                yield row

example

filename = "/csv_exe/csv.csv"
vs = get_csv(filename, names=('f1', 'f2', 'f3', 'f4', 'f5'))
for item in vs:
    print(item)

result

结果

{'f1': '1997', 'f4': 'ac, abs, moon', 'f3': 'E350', 'f2': 'Ford', 'f5': '3000.00'}
{'f1': '1999', 'f4': '', 'f3': 'Venture "Extended Edition"', 'f2': 'Chevy', 'f5': '4900.00'}
{'f1': '1996', 'f4': 'MUST SELL! air, moon roof, loaded', 'f3': 'Grand Cherokee', 'f2': 'Jeep', 'f5': '4799.00'}

example2

例题

vs = get_csv(filename, names=('f1', 'f2'), usecols=(0, 4))

result2

RESULT2

{'f1': '1997', 'f2': '3000.00'}
{'f1': '1999', 'f2': '4900.00'}
{'f1': '1996', 'f2': '4799.00'}

example3

示例3

vs = get_csv(filename, names=('f1', 'f2'), usecols=(0, 2), as_obj=True)

result3

result3

<__main__.get_csv.<locals>.RowObject object at 0x01408ED0>
<__main__.get_csv.<locals>.RowObject object at 0x01408E90>
<__main__.get_csv.<locals>.RowObject object at 0x01408F10>

for item in vs:
    print(item.f2)

E350
Venture "Extended Edition"
Grand Cheroke

#1


5  

def read_csv(file, columns, type_name="Row"):
  try:
    row_type = namedtuple(type_name, columns)
  except ValueError:
    row_type = tuple
  rows = iter(csv.reader(file))
  header = rows.next()
  mapping = [header.index(x) for x in columns]
  for row in rows:
    row = row_type(*[row[i] for i in mapping])
    yield row

Example:

例:

>>> import csv
>>> from collections import namedtuple
>>> from StringIO import StringIO
>>> def read_csv(file, columns, type_name="Row"):
...   try:
...     row_type = namedtuple(type_name, columns)
...   except ValueError:
...     row_type = tuple
...   rows = iter(csv.reader(file))
...   header = rows.next()
...   mapping = [header.index(x) for x in columns]
...   for row in rows:
...     row = row_type(*[row[i] for i in mapping])
...     yield row
... 
>>> testdata = """\
... AAA,bbb,ccc,DDD,eee,FFF,GGG,hhh
... 1,2,3,4,50,3,20,4
... 2,1,3,5,24,2,23,5
... 4,1,3,6,34,1,22,5
... 2,1,3,5,24,2,23,5
... 2,1,3,5,24,2,23,5
... """
>>> testfile = StringIO(testdata)
>>> for row in read_csv(testfile, "AAA GGG DDD".split()):
...   print row
... 
Row(AAA='1', GGG='20', DDD='4')
Row(AAA='2', GGG='23', DDD='5')
Row(AAA='4', GGG='22', DDD='6')
Row(AAA='2', GGG='23', DDD='5')
Row(AAA='2', GGG='23', DDD='5')

#2


5  

I realize the answer has been accepted, but if you really want to read specific named columns from a csv file, you should use a DictReader (if you're not using Pandas that is).

我意识到答案已被接受,但如果你真的想从csv文件中读取特定的命名列,你应该使用DictReader(如果你没有使用Pandas)。

import csv
from StringIO import StringIO

columns = 'AAA,DDD,FFF,GGG'.split(',')


testdata ='''\
AAA,bbb,ccc,DDD,eee,FFF,GGG,hhh
1,2,3,4,50,3,20,4
2,1,3,5,24,2,23,5
4,1,3,6,34,1,22,5
2,1,3,5,24,2,23,5
2,1,3,5,24,2,23,5
'''

reader = csv.DictReader(StringIO(testdata))

desired_cols = (tuple(row[col] for col in columns) for row in reader)

Output:

输出:

>>> list(desired_cols)
[('1', '4', '3', '20'),
 ('2', '5', '2', '23'),
 ('4', '6', '1', '22'),
 ('2', '5', '2', '23'),
 ('2', '5', '2', '23')]

#3


1  

import csv

DESIRED_COLUMNS = ('AAA','DDD','FFF','GGG')

f = open("myfile.csv")
reader = csv.reader(f)

headers = None
results = []
for row in reader:
    if not headers:
        headers = []
        for i, col in enumerate(row):
        if col in DESIRED_COLUMNS:
            # Store the index of the cols of interest
            headers.append(i)

    else:
        results.append(tuple([row[i] for i in headers]))

print results

#4


1  

Context: For this type of work you should use the amazing python petl library. That will save you a lot of work and potential frustration from doing things 'manually' with the standard csv module. AFAIK, the only people who still use the csv module are those who have not yet discovered better tools for working with tabular data (pandas, petl, etc.), which is fine, but if you plan to work with a lot of data in your career from various strange sources, learning something like petl is one of the best investments you can make. To get started should only take 30 minutes after you've done pip install petl. The documentation is excellent.

上下文:对于这种类型的工作,你应该使用惊人的python petl库。通过使用标准csv模块“手动”执行操作,这将为您节省大量工作和潜在的挫败感。 AFAIK,唯一仍然使用csv模块的人是那些还没有发现更好的工具来处理表格数据(pandas,petl等)的人,这很好,但如果你打算使用大量的数据你的职业生涯来自各种奇怪的来源,学习像petl这样的东西是你可以做的最好的投资之一。要开始使用,只需要在完成pip install petl后30分钟。文档非常好。

Answer: Let's say you have the first table in a csv file (you can also load directly from the database using petl). Then you would simply load it and do the following.

答:假设您在csv文件中有第一个表(您也可以使用petl直接从数据库加载)。然后你只需加载它并执行以下操作。

from petl import fromcsv, look, cut, tocsv    

    #Load the table
    table1 = fromcsv('table1.csv')
    # Alter the colums
    table2 = cut(table1, 'Song_Name','Artist_ID')
    #have a quick look to make sure things are ok.  Prints a nicely formatted table to your console
    print look(table2)
    # Save to new file
    tocsv(table2, 'new.csv')

#5


0  

If your files and requirements are relatively simple and set, then once you know the desired columns, I would likely use split() to divide each data line into a list of column entries:

如果您的文件和要求相对简单并设置了,那么一旦您知道了所需的列,我就可以使用split()将每个数据行划分为列条目列表:

alist = aline.split('|')

I would then use the desired column indices to get the column entries from the list, process each with strip() to remove the whitespace, convert it to the desired format (it looks like your data has integer values), and create the tuples.

然后,我将使用所需的列索引从列表中获取列条目,使用strip()处理每个列条目以删除空格,将其转换为所需的格式(看起来您的数据具有整数值),并创建元组。

As I said, I am assuming that your requirements are relatively fixed. The more complicated or the more they are likely to change, the more likely that it will be worth your time to pick up and use a library made for manipulating this type of data.

正如我所说,我假设您的要求相对固定。它们越复杂或越可能发生变化,就越有可能值得花时间去挑选和使用用于操纵此类数据的库。

#6


0  

All other answers are good, but I think it would be better to not load all data at the same time because the csv file could be really huge. I suggest using a generator.

所有其他答案都很好,但我认为最好不要同时加载所有数据,因为csv文件可能非常庞大。我建议使用发电机。

def read_csv(f, cols):
    reader = csv.reader(f)
    for row in reader:
        if len(row) == 1:
            columns = row[0].split()
            yield (columns[c] for c in cols)

Which can be used for a for loop after

哪个可以用于for循环之后

with open('path/to/test.csv', 'rb') as f:
    for bbb, ccc in read_csv(f, [1, 2]):
        print bbb, ccc

Of course you can enhance this function to receive the column's name instead of the index. To do so, just mix Brad M answer and mine.

当然,您可以增强此功能以接收列的名称而不是索引。要做到这一点,只需混合布拉德M回答我的。

#7


0  

I think it will help.

我认为这会有所帮助。

CSV

CSV

1997,Ford,E350,"ac, abs, moon",3000.00
1999,Chevy,"Venture ""Extended Edition""","",4900.00
1996,Jeep,Grand Cherokee,"MUST SELL! air, moon roof, loaded",4799.00

code

import csv   

def get_csv(file_name, names=None, usecols=None, mode='r', encoding="utf8",
            quoting=csv.QUOTE_ALL,
            delimiter=',',
            as_obj=False):

    class RowObject:
        def __init__(self, **entries):
            self.__dict__.update(entries)

    with open(file_name, mode=mode, encoding=encoding) as csvfile:
        data_reader = csv.reader(csvfile, quoting=quoting, delimiter=delimiter)
        for row in data_reader:
            if usecols and names:
                q = dict(zip(names, (row[i] for i in usecols)))
                yield q if not as_obj else RowObject(**q)
            elif usecols and not names:
                yield list(row[i] for i in usecols)
            elif names and not usecols:
                q = dict(zip(names, (row[k] for k, i in enumerate(row))))
                yield q if not as_obj else RowObject(**q)
            else:
                yield row

example

filename = "/csv_exe/csv.csv"
vs = get_csv(filename, names=('f1', 'f2', 'f3', 'f4', 'f5'))
for item in vs:
    print(item)

result

结果

{'f1': '1997', 'f4': 'ac, abs, moon', 'f3': 'E350', 'f2': 'Ford', 'f5': '3000.00'}
{'f1': '1999', 'f4': '', 'f3': 'Venture "Extended Edition"', 'f2': 'Chevy', 'f5': '4900.00'}
{'f1': '1996', 'f4': 'MUST SELL! air, moon roof, loaded', 'f3': 'Grand Cherokee', 'f2': 'Jeep', 'f5': '4799.00'}

example2

例题

vs = get_csv(filename, names=('f1', 'f2'), usecols=(0, 4))

result2

RESULT2

{'f1': '1997', 'f2': '3000.00'}
{'f1': '1999', 'f2': '4900.00'}
{'f1': '1996', 'f2': '4799.00'}

example3

示例3

vs = get_csv(filename, names=('f1', 'f2'), usecols=(0, 2), as_obj=True)

result3

result3

<__main__.get_csv.<locals>.RowObject object at 0x01408ED0>
<__main__.get_csv.<locals>.RowObject object at 0x01408E90>
<__main__.get_csv.<locals>.RowObject object at 0x01408F10>

for item in vs:
    print(item.f2)

E350
Venture "Extended Edition"
Grand Cheroke