I am trying to use Pandas' df.to_sql and SQlite3 in python to put about 2GB of data with about 16million rows in a database. My strategy has been to chunk the original CSV into smaller dataframes, perform some operations on them, and then throw them into an SQL database.
我试图在python中使用Pandas的df.to_sql和SQlite3在数据库中放入大约2GB的数据和大约1600万行。我的策略是将原始CSV分成较小的数据帧,对它们执行一些操作,然后将它们放入SQL数据库。
As I run this code, it starts out fast but quickly slows down. After about 3 million rows it slows down to such a degree as it doesn't seem like it will finish in any realistic amount of time. What is the cause of this and what can I do about it? My code is below:
当我运行此代码时,它会快速启动但很快就会变慢。在大约300万行之后,它减慢到这样的程度,因为它似乎不会在任何实际的时间内完成。这是什么原因,我该怎么办呢?我的代码如下:
def chunk_read_CSV_to_db(database, table, filepath, chunksize, delimiter=','):
start = dt.datetime.now()
conn = sqlite3.connect(database)
index_start = 1
j=0
for df in pd.read_csv(filepath, chunksize=chunksize, iterator=True, encoding='utf-8', sep=delimiter):
j+=1
print '{} seconds: complete {} rows'.format((dt.datetime.now() -start).seconds, j*chunksize)
df.to_sql(name=table, con=conn, flavor='sqlite', if_exists='append')
conn.close()
db_name = 'store_data.db'
f9 = 'xrf_str_geo_ta4_1511.txt'
chunksize = 20000
chunk_read_CSV_to_db(os.path.join(fp, db_name), os.path.splitext(f9)[0], os.path.join(fp, f9), chunksize = chunksize, delimiter='\t')
2 个解决方案
#1
0
I switched over to sqlalchemy and had no problems with time after that. There is no noticeable slowdown. The code is below.
我切换到sqlalchemy,之后没有问题。没有明显的放缓。代码如下。
def chunk_read_CSV_to_db(database, table, filepath, chunksize, delimiter=',', index=False):
start = dt.datetime.now()
index_start = 1
j=0
for df in pd.read_csv(filepath, chunksize=chunksize, iterator=True, encoding='utf-8', sep=delimiter):
j+=1
print '{} seconds: complete {} rows'.format((dt.datetime.now() -start).seconds, j*chunksize)
df.to_sql(table, db, flavor='sqlite', if_exists='append', index=index)
db = create_engine('sqlite:///store_data.db')
meta = MetaData(bind=db)
table_pop = Table('xrf_str_geo_ta4_1511', meta,
Column('TDLINX',Integer, nullable=True),
Column('GEO_ID',Integer, nullable=True),
Column('PERCINCL', Numeric, nullable=True)
)
chunksize = 20000
chunk_read_CSV_to_db(db,'xrf_str_geo_ta4_1511', os.path.join(fp, f9), chunksize = chunksize, delimiter='\t')
#2
0
So I know this answer will no longer be relevant to the author, but I stumbled across it because I had exactly the same problem and wanted to share my answer.
所以我知道这个答案将不再与作者相关,但我偶然发现它,因为我有完全相同的问题,并希望分享我的答案。
I was trying to load ~900 .csv files into an sql database one by one, using the append method. The loading started fast, but slowed down exponentially and never finished running. This made me suspect there was something going on wrong involving indexing (i.e. pandas was somehow re-indexing things every time I was appending data) because that's the only thing I could think of to explain the slowdown (memory seemed to be fine).
我试图使用append方法逐个将〜900 .csv文件加载到sql数据库中。加载开始很快,但是以指数方式减速并且从未完成运行。这让我怀疑是否存在涉及索引的错误(即每次我添加数据时大熊猫都会以某种方式重新编制索引),因为这是我能想到解释减速的唯一方法(内存似乎很好)。
Eventually I started using the sqlite3 .index and .dbinfo methods through the command line to look at databases created through pandas compared to those compared through sqlite3 directly. What I found is that pandas databases had 1 index compared to 0 when processed through sqlite3. Also, the schema size was way bigger.
最后,我开始通过命令行使用sqlite3 .index和.dbinfo方法来查看通过pandas创建的数据库,而不是直接通过sqlite3进行比较。我发现,当通过sqlite3处理时,pandas数据库有1个索引,而0则为0。此外,架构大小更大。
Now, pandas to_sql method has an index argument. It says that this argument simply adds the dataframe index as a column in the database (which sounds innocuous enough). But it turns out that it also uses that column as a database index, and it seems like if you're using the append method then maybe it recalculates this index every time (or something). Regardless, when I set the index argument to False, .dbinfo shows 0 indexes in the resulting dataframe, and my problem disappeared - all the data was processed in a very short time.
现在,pandas to_sql方法有一个索引参数。它说这个参数只是将数据帧索引添加为数据库中的一列(听起来非常无害)。但事实证明它也使用该列作为数据库索引,看起来如果你使用append方法,那么它可能每次都重新计算这个索引(或者其他东西)。无论如何,当我将index参数设置为False时,.dbinfo在结果数据帧中显示0个索引,并且我的问题消失了 - 所有数据都在很短的时间内处理完毕。
So the solution would be:
所以解决方案是:
df.to_sql(name=table, con=conn, flavor='sqlite', if_exists='append', index = False)
#1
0
I switched over to sqlalchemy and had no problems with time after that. There is no noticeable slowdown. The code is below.
我切换到sqlalchemy,之后没有问题。没有明显的放缓。代码如下。
def chunk_read_CSV_to_db(database, table, filepath, chunksize, delimiter=',', index=False):
start = dt.datetime.now()
index_start = 1
j=0
for df in pd.read_csv(filepath, chunksize=chunksize, iterator=True, encoding='utf-8', sep=delimiter):
j+=1
print '{} seconds: complete {} rows'.format((dt.datetime.now() -start).seconds, j*chunksize)
df.to_sql(table, db, flavor='sqlite', if_exists='append', index=index)
db = create_engine('sqlite:///store_data.db')
meta = MetaData(bind=db)
table_pop = Table('xrf_str_geo_ta4_1511', meta,
Column('TDLINX',Integer, nullable=True),
Column('GEO_ID',Integer, nullable=True),
Column('PERCINCL', Numeric, nullable=True)
)
chunksize = 20000
chunk_read_CSV_to_db(db,'xrf_str_geo_ta4_1511', os.path.join(fp, f9), chunksize = chunksize, delimiter='\t')
#2
0
So I know this answer will no longer be relevant to the author, but I stumbled across it because I had exactly the same problem and wanted to share my answer.
所以我知道这个答案将不再与作者相关,但我偶然发现它,因为我有完全相同的问题,并希望分享我的答案。
I was trying to load ~900 .csv files into an sql database one by one, using the append method. The loading started fast, but slowed down exponentially and never finished running. This made me suspect there was something going on wrong involving indexing (i.e. pandas was somehow re-indexing things every time I was appending data) because that's the only thing I could think of to explain the slowdown (memory seemed to be fine).
我试图使用append方法逐个将〜900 .csv文件加载到sql数据库中。加载开始很快,但是以指数方式减速并且从未完成运行。这让我怀疑是否存在涉及索引的错误(即每次我添加数据时大熊猫都会以某种方式重新编制索引),因为这是我能想到解释减速的唯一方法(内存似乎很好)。
Eventually I started using the sqlite3 .index and .dbinfo methods through the command line to look at databases created through pandas compared to those compared through sqlite3 directly. What I found is that pandas databases had 1 index compared to 0 when processed through sqlite3. Also, the schema size was way bigger.
最后,我开始通过命令行使用sqlite3 .index和.dbinfo方法来查看通过pandas创建的数据库,而不是直接通过sqlite3进行比较。我发现,当通过sqlite3处理时,pandas数据库有1个索引,而0则为0。此外,架构大小更大。
Now, pandas to_sql method has an index argument. It says that this argument simply adds the dataframe index as a column in the database (which sounds innocuous enough). But it turns out that it also uses that column as a database index, and it seems like if you're using the append method then maybe it recalculates this index every time (or something). Regardless, when I set the index argument to False, .dbinfo shows 0 indexes in the resulting dataframe, and my problem disappeared - all the data was processed in a very short time.
现在,pandas to_sql方法有一个索引参数。它说这个参数只是将数据帧索引添加为数据库中的一列(听起来非常无害)。但事实证明它也使用该列作为数据库索引,看起来如果你使用append方法,那么它可能每次都重新计算这个索引(或者其他东西)。无论如何,当我将index参数设置为False时,.dbinfo在结果数据帧中显示0个索引,并且我的问题消失了 - 所有数据都在很短的时间内处理完毕。
So the solution would be:
所以解决方案是:
df.to_sql(name=table, con=conn, flavor='sqlite', if_exists='append', index = False)