将多个CSV文件读入Python熊猫数据存储器

The general use case behind the question is to read multiple CSV log files from a target directory into a single Python Pandas DataFrame for quick turnaround statistical analysis & charting. The idea for utilizing Pandas vs MySQL is to conduct this data import or append + stat analysis periodically throughout the day.

问题背后的一般用例是将目标目录中的多个CSV日志文件读入单个Python熊猫数据存储器，以便进行快速的循环统计分析和图表处理。使用熊猫vs MySQL的想法是每天定期进行数据导入或追加+ stat分析。

The script below attempts to read all of the CSV (same file layout) files into a single Pandas dataframe & adds a year column associated with each file read.

下面的脚本尝试将所有的CSV(相同的文件布局)文件读入到单个熊猫数据存储器中，并添加一个与每个读取文件相关联的年度列。

The problem with the script is it now only reads the very last file in the directory instead of the desired outcome being all files within the targeted directory.

# Assemble all of the data files into a single DataFrame & add a year field
# 2010 is the last available year
years = range(1880, 2011)

for year in years:
    path ='C:\\Documents and Settings\\Foo\\My Documents\\pydata-book\\pydata-book-master`\\ch02\\names\\yob%d.txt' % year
    frame = pd.read_csv(path, names=columns)

    frame['year'] = year
    pieces.append(frame)

# Concatenates everything into a single Dataframe
names = pd.concat(pieces, ignore_index=True)

# Expected row total should be 1690784
names
<class 'pandas.core.frame.DataFrame'>
Int64Index: 33838 entries, 0 to 33837
Data columns:
name      33838  non-null values
sex       33838  non-null values
births    33838  non-null values
year      33838  non-null values
dtypes: int64(2), object(2)

# Start aggregating the data at the year & gender level using groupby or pivot
total_births = names.pivot_table('births', rows='year', cols='sex', aggfunc=sum)
# Prints pivot table
total_births.tail()

Out[35]:
sex     F   M
year        
2010    1759010     1898382

3 个解决方案

#1

The append method on an instance of a DataFrame does not function the same as the append method on an instance of a list. Dataframe.append() does not occur in-place and instead returns a new object.

DataFrame实例上的append方法的功能与列表实例上的append方法不同。append()不会就地发生，而是返回一个新对象。

years = range(1880, 2011)

names = pd.DataFrame()
for year in years:
    path ='C:\\Documents and Settings\\Foo\\My Documents\\pydata-book\\pydata-book-master`\\ch02\\names\\yob%d.txt' % year
    frame = pd.read_csv(path, names=columns)

    frame['year'] = year
    names = names.append(frame, ignore_index=True)

or you can use concat:

或者你也可以用concat:

years = range(1880, 2011)

names = pd.DataFrame()
for year in years:
    path ='C:\\Documents and Settings\\Foo\\My Documents\\pydata-book\\pydata-book-master`\\ch02\\names\\yob%d.txt' % year
    frame = pd.read_csv(path, names=columns)

    frame['year'] = year
    names = pd.concat(names, frame, ignore_index=True)

#2

I could not get either one of the above answers to work. The first answer was close, but the line space between the second and third lines after the for weren't right. I used the below code snippet in Canopy. Also, for those who are interested... this problem came from an example in "Python for Data Analysis". (An enjoyable book so far otherwise)

我找不到上面任何一个答案。第一个答案很接近，但是第二行和第三行之间的行间距不对。我在Canopy中使用了下面的代码片段。同时，对于那些感兴趣的人……这个问题来自“用于数据分析的Python”中的一个示例。(到目前为止，这是一本令人愉快的书)

import pandas as pd

years = range(1880,2011)
columns = ['name','sex','births']
names = pd.DataFrame()

for year in years:
    path = 'C:/PythonData/pydata-book-master/pydata-book-master/ch02/names/yob%d.txt' % year
    frame = pd.read_csv(path, names=columns)
    frame['year'] = year
    names = names.append(frame,ignore_index=True)

#3

-3

remove the line space between:

删除线间距:

    frame = pd.read_csv(path, names=columns)

    frame['year'] = year

so it reads

因此,读

    for year in years:
        path ='C:\\Documents and Settings\\Foo\\My Documents\\pydata-book\\pydata-book-master`\\ch02\\names\\yob%d.txt' % year
        frame = pd.read_csv(path, names=columns)
        frame['year'] = year
        names = pd.append(names, frame, ignore_index=True)

#1