使用csv模块读取csv文件中的特定列?

I'm trying to parse through a csv file and extract the data from only specific columns.

我正在尝试解析csv文件，并从特定的列中提取数据。

Example csv:

例csv:

ID | Name | Address | City | State | Zip | Phone | OPEID | IPEDS |
10 | C... | 130 W.. | Mo.. | AL... | 3.. | 334.. | 01023 | 10063 |

I'm trying to capture only specific columns, say ID, Name, Zip and Phone.

我只是试图捕获特定的列，比如ID, Name, Zip和Phone。

Code I've looked at has led me to believe I can call the specific column by its corresponding number, so ie: Name would correspond to 2 and iterating through each row using row[2] would produce all the items in column 2. Only it doesn't.

我看过的代码让我相信我可以通过它的对应数字来调用特定的列，因此ie: Name将对应于2，并使用[2]行遍历每一行将生成列2中的所有项。只是它不。

Here's what I've done so far:

以下是我到目前为止所做的:

import sys, argparse, csv
from settings import *

# command arguments
parser = argparse.ArgumentParser(description='csv to postgres',\
 fromfile_prefix_chars="@" )
parser.add_argument('file', help='csv file to import', action='store')
args = parser.parse_args()
csv_file = args.file

# open csv file
with open(csv_file, 'rb') as csvfile:

    # get number of columns
    for line in csvfile.readlines():
        array = line.split(',')
        first_item = array[0]

    num_columns = len(array)
    csvfile.seek(0)

    reader = csv.reader(csvfile, delimiter=' ')
        included_cols = [1, 2, 6, 7]

    for row in reader:
            content = list(row[i] for i in included_cols)
            print content

and I'm expecting that this will print out only the specific columns I want for each row except it doesn't, I get the last column only.

我希望它只打印出我想要的每一行的特定列，除了它没有，我只得到最后一列。

7 个解决方案

#1

128

The only way you would be getting the last column from this code is if you don't include your print statement in your for loop.

从这段代码中获取最后一列的唯一方法是，如果不在for循环中包含print语句。

This is most likely the end of your code:

这很可能是您代码的结尾:

for row in reader:
    content = list(row[i] for i in included_cols)
print content

You want it to be this:

你希望它是这样的:

for row in reader:
        content = list(row[i] for i in included_cols)
        print content

Now that we have covered your mistake, I would like to take this time to introduce you to the pandas module.

既然我们已经讨论了您的错误，我想借此机会向您介绍熊猫模块。

Pandas is spectacular for dealing with csv files, and the following code would be all you need to read a csv and save an entire column into a variable:

熊猫在处理csv文件时非常出色，下面的代码将是您读取csv并将整个列保存到变量中的所有代码:

import pandas as pd
df = pd.read_csv(csv_file)
saved_column = df.column_name #you can also use df['column_name']

so if you wanted to save all of the info in your column Names into a variable, this is all you need to do:

所以，如果你想把你列名中的所有信息保存到一个变量中，这就是你需要做的:

names = df.Names

It's a great module and I suggest you look into it. If for some reason your print statement was in for loop and it was still only printing out the last column, which shouldn't happen, but let me know if my assumption was wrong. Your posted code has a lot of indentation errors so it was hard to know what was supposed to be where. Hope this was helpful!

这是一个很棒的模块，我建议你研究一下。如果出于某种原因，你的print语句是for循环的，它仍然只打印出最后一列，这不应该发生，但是如果我的假设是错误的，请告诉我。您发布的代码有很多缩进错误，所以很难知道应该在哪里。希望这是有帮助的!

#2

import csv
from collections import defaultdict

columns = defaultdict(list) # each value in each column is appended to a list

with open('file.txt') as f:
    reader = csv.DictReader(f) # read rows into a dictionary format
    for row in reader: # read a row as {column1: value1, column2: value2,...}
        for (k,v) in row.items(): # go over each column name and value 
            columns[k].append(v) # append the value into the appropriate list
                                 # based on column name k

print(columns['name'])
print(columns['phone'])
print(columns['street'])

With a file like

这样的一个文件

name,phone,street
Bob,0893,32 Silly
James,000,400 McHilly
Smithers,4442,23 Looped St.

Will output

将输出

>>> 
['Bob', 'James', 'Smithers']
['0893', '000', '4442']
['32 Silly', '400 McHilly', '23 Looped St.']

Or alternatively if you want numerical indexing for the columns:

或者，如果您想对列进行数值索引:

with open('file.txt') as f:
    reader = csv.reader(f)
    reader.next()
    for row in reader:
        for (i,v) in enumerate(row):
            columns[i].append(v)
print(columns[0])

>>> 
['Bob', 'James', 'Smithers']

To change the deliminator add delimiter=" " to the appropriate instantiation, i.e reader = csv.reader(f,delimiter=" ")

若要更改分馏器，请向适当的实例化添加分隔符=" "。e读者= csv。读者(f,分隔符= " ")

#3

You can use numpy.loadtext(filename). For example if this is your database .csv:

您可以使用numpy.loadtext(文件名)。例如，如果这是你的数据库。csv:

ID | Name | Address | City | State | Zip | Phone | OPEID | IPEDS |
10 | Adam | 130 W.. | Mo.. | AL... | 3.. | 334.. | 01023 | 10063 |
10 | Carl | 130 W.. | Mo.. | AL... | 3.. | 334.. | 01023 | 10063 |
10 | Adolf | 130 W.. | Mo.. | AL... | 3.. | 334.. | 01023 | 10063 |
10 | Den | 130 W.. | Mo.. | AL... | 3.. | 334.. | 01023 | 10063 |

And you want the Name column:

你想要名字栏:

import numpy as np 
b=np.loadtxt(r'filepath\name.csv',dtype=str,delimiter='|',skiprows=1,usecols=(1,))

>>> b
array([' Adam ', ' Carl ', ' Adolf ', ' Den '], 
      dtype='|S7')

More easily you can use genfromtext:

更容易使用genfromtext:

b = np.genfromtxt(r'filepath\name.csv', delimiter='|', names=True,dtype=None)
>>> b['Name']
array([' Adam ', ' Carl ', ' Adolf ', ' Den '], 
      dtype='|S7')

#4

Use pandas:

使用熊猫:

import pandas as pd
my_csv = pd.read_csv(filename)
column = my_csv.column_name
# you can also use my_csv['column_name']

A bit more memory-friendly solution, if you really need those bytes (throws away unneeded columns at parse time):

如果您确实需要这些字节(在解析时丢弃不需要的列)，那么可以提供一个更好的内存友好的解决方案:

my_filtered_csv = pd.read_csv(filename, usecols=['col1', 'col3', 'col7'])

P.S. I'm just aggregating what other's have said in a simple manner. Actual answers are taken from here and here.

附注:我只是用一种简单的方式总结别人说过的话。实际的答案从这里和这里。

#5

With pandas you can use read_csv with usecols parameter:

对于熊猫你可以使用read_csv与usecols参数:

df = pd.read_csv(filename, usecols=['col1', 'col3', 'col7'])

Example:

例子:

import pandas as pd
import io

s = '''
total_bill,tip,sex,smoker,day,time,size
16.99,1.01,Female,No,Sun,Dinner,2
10.34,1.66,Male,No,Sun,Dinner,3
21.01,3.5,Male,No,Sun,Dinner,3
'''

df = pd.read_csv(io.StringIO(s), usecols=['total_bill', 'day', 'size'])
print(df)

   total_bill  day  size
0       16.99  Sun     2
1       10.34  Sun     3
2       21.01  Sun     3

#6

Context: For this type of work you should use the amazing python petl library. That will save you a lot of work and potential frustration from doing things 'manually' with the standard csv module. AFAIK, the only people who still use the csv module are those who have not yet discovered better tools for working with tabular data (pandas, petl, etc.), which is fine, but if you plan to work with a lot of data in your career from various strange sources, learning something like petl is one of the best investments you can make. To get started should only take 30 minutes after you've done pip install petl. The documentation is excellent.

上下文:对于这种类型的工作，您应该使用令人惊叹的python petl库。这将为您节省大量的工作和使用标准的csv模块“手动”操作带来的潜在挫折。AFAIK,唯一的人仍使用csv模块是那些还没有发现更好的工具来处理表格数据(petl,熊猫,等等),这是很好,但是如果你打算使用大量的数据从各种奇怪的来源,在你的职业生涯学习类似petl是最好的投资之一。开始应该只花30分钟后，您完成了pip安装petl。文档是优秀的。

Answer: Let's say you have the first table in a csv file (you can also load directly from the database using petl). Then you would simply load it and do the following.

答:假设您有csv文件中的第一个表(也可以使用petl从数据库直接加载)。然后您只需加载它并执行以下操作。

from petl import fromcsv, look, cut, tocsv 

#Load the table
table1 = fromcsv('table1.csv')
# Alter the colums
table2 = cut(table1, 'Song_Name','Artist_ID')
#have a quick look to make sure things are ok. Prints a nicely formatted table to your console
print look(table2)
# Save to new file
tocsv(table2, 'new.csv')

#7

To fetch column name, instead of using readlines() better use readline() to avoid loop & reading the complete file & storing it in the array.

要获取列名，最好使用readline()而不是readline()来避免循环和读取完整文件并将其存储在数组中。

with open(csv_file, 'rb') as csvfile:

    # get number of columns

    line = csvfile.readline()

    first_item = line.split(',')

#1

128