如何使用pyodbc从CSV加速批量插入到MS SQL Server

时间:2022-05-02 11:49:45

Below is my code that I'd like some help with. I am having to run it over 1,300,000 rows meaning it takes up to 40 minutes to insert ~300,000 rows.

以下是我想要帮助的代码。我不得不运行超过1,300,000行,这意味着插入~300,000行需要40分钟。

I figure bulk insert is the route to go to speed it up? Or is it because I'm iterating over the rows via for data in reader: portion?

我认为批量插入是加速它的路线?或者是因为我正在通过读取器中的数据迭代行:部分?

#Opens the prepped csv file
with open (os.path.join(newpath,outfile), 'r') as f:
    #hooks csv reader to file
    reader = csv.reader(f)
    #pulls out the columns (which match the SQL table)
    columns = next(reader)
    #trims any extra spaces
    columns = [x.strip(' ') for x in columns]
    #starts SQL statement
    query = 'bulk insert into SpikeData123({0}) values ({1})'
    #puts column names in SQL query 'query'
    query = query.format(','.join(columns), ','.join('?' * len(columns)))

    print 'Query is: %s' % query
    #starts curser from cnxn (which works)
    cursor = cnxn.cursor()
    #uploads everything by row
    for data in reader:
        cursor.execute(query, data)
        cursor.commit()

I am dynamically picking my column headers on purpose (as I would like to create the most pythonic code possible).

我是故意动态选择我的列标题(因为我想创建最可能的pythonic代码)。

SpikeData123 is the table name.

SpikeData123是表名。

3 个解决方案

#1


20  

BULK INSERT will almost certainly be much faster than reading the source file row-by-row and doing a regular INSERT for each row. However, both BULK INSERT and BCP have a significant limitation regarding CSV files in that they cannot handle text qualifiers (ref: here). That is, if your CSV file does not have qualified text strings in it ...

BULK INSERT几乎肯定会比逐行读取源文件并为每行执行常规INSERT快得多。但是,BULK INSERT和BCP都对CSV文件有很大的限制,因为它们无法处理文本限定符(参见此处)。也就是说,如果您的CSV文件中没有合格的文本字符串...

1,Gord Thompson,2015-04-15
2,Bob Loblaw,2015-04-07

... then you can BULK INSERT it, but if it contains text qualifiers (because some text values contains commas) ...

...然后你可以BULK INSERT它,但如果它包含文本限定符(因为一些文本值包含逗号)...

1,"Thompson, Gord",2015-04-15
2,"Loblaw, Bob",2015-04-07

... then BULK INSERT cannot handle it. Still, it might be faster overall to pre-process such a CSV file into a pipe-delimited file ...

...然后BULK INSERT无法处理它。但是,将这样的CSV文件预处理成管道分隔文件可能会更快整体...

1|Thompson, Gord|2015-04-15
2|Loblaw, Bob|2015-04-07

... or a tab-delimited file (where represents the tab character) ...

...或制表符分隔文件(其中→表示制表符)...

1→Thompson, Gord→2015-04-15
2→Loblaw, Bob→2015-04-07

... and then BULK INSERT that file. For the latter (tab-delimited) file the BULK INSERT code would look something like this:

...然后BULK INSERT那个文件。对于后者(制表符分隔)文件,BULK INSERT代码看起来像这样:

import pypyodbc
conn_str = "DSN=myDb_SQLEXPRESS;"
cnxn = pypyodbc.connect(conn_str)
crsr = cnxn.cursor()
sql = """
BULK INSERT myDb.dbo.SpikeData123
FROM 'C:\\__tmp\\biTest.txt' WITH (
    FIELDTERMINATOR='\\t',
    ROWTERMINATOR='\\n'
    );
"""
crsr.execute(sql)
cnxn.commit()
crsr.close()
cnxn.close()

Note: As mentioned in a comment, executing a BULK INSERT statement is only applicable if the SQL Server instance can directly read the source file. For cases where the source file is on a remote client, see this answer.

注意:如注释中所述,执行BULK INSERT语句仅适用于SQL Server实例可以直接读取源文件的情况。对于源文件位于远程客户端的情况,请参阅此答案。

#2


16  

As noted in a comment to another answer, the T-SQL BULK INSERT command will only work if the file to be imported is on the same machine as the SQL Server instance or is in an SMB/CIFS network location that the SQL Server instance can read. Thus it may not be applicable in the case where the source file is on a remote client.

如对另一个答案的评论中所述,T-SQL BULK INSERT命令仅在要导入的文件与SQL Server实例位于同一台计算机上或位于SQL Server实例可以的SMB / CIFS网络位置时才有效。读。因此,它可能不适用于源文件位于远程客户端上的情况。

pyodbc 4.0.19 added a Cursor#fast_executemany feature which may be helpful in that case. fast_executemany is "off" by default, and the following test code ...

pyodbc 4.0.19添加了一个Cursor#fast_executemany功能,在这种情况下可能会有所帮助。 fast_executemany默认为“off”,以下测试代码......

cnxn = pyodbc.connect(conn_str, autocommit=True)
crsr = cnxn.cursor()
crsr.execute("TRUNCATE TABLE fast_executemany_test")

sql = "INSERT INTO fast_executemany_test (txtcol) VALUES (?)"
params = [(f'txt{i:06d}',) for i in range(1000)]
t0 = time.time()
crsr.executemany(sql, params)
print(f'{time.time() - t0:.1f} seconds')

... took approximately 22 seconds to execute on my test machine. Simply adding crsr.fast_executemany = True ...

...在我的测试机器上执行大约需要22秒。只需添加crsr.fast_executemany = True ...

cnxn = pyodbc.connect(conn_str, autocommit=True)
crsr = cnxn.cursor()
crsr.execute("TRUNCATE TABLE fast_executemany_test")

crsr.fast_executemany = True  # new in pyodbc 4.0.19

sql = "INSERT INTO fast_executemany_test (txtcol) VALUES (?)"
params = [(f'txt{i:06d}',) for i in range(1000)]
t0 = time.time()
crsr.executemany(sql, params)
print(f'{time.time() - t0:.1f} seconds')

... reduced the execution time to just over 1 second.

...将执行时间缩短到1秒以上。

#3


1  

yes bulk insert is right path for loading large files into a DB. At a glance I would say that the reason it takes so long is as you mentioned you are looping over each row of data from the file which effectively means are removing the benefits of using a bulk insert and making it like a normal insert. Just remember that as it's name implies that it is used to insert chucks of data. I would remove loop and try again.

是批量插入是将大文件加载到DB的正确路径。一眼就能看出它花费这么长时间的原因就像你提到的那样,你正在循环文件中的每一行数据,这实际上意味着消除了使用批量插入并使其像普通插入一样的好处。请记住,因为它的名字暗示它用于插入数据夹头。我会删除循环,然后再试一次。

Also I'd double check your syntax for bulk insert as it doesn't look correct to me. check the sql that is generated by pyodbc as I have a feeling that it might only be executing a normal insert

另外,我会仔细检查批量插入的语法,因为它看起来不正确。检查pyodbc生成的sql,因为我觉得它可能只是执行一个普通的插入

Alternatively if it is still slow I would try using bulk insert directly from sql and either load the whole file into a temp table with bulk insert then insert the relevant column into the right tables. or use a mix of bulk insert and bcp to get the specific columns inserted or OPENROWSET.

或者,如果它仍然很慢,我会尝试直接从sql使用批量插入,并将整个文件加载到带有批量插入的临时表中,然后将相关列插入到右表中。或使用批量插入和bcp的混合来插入特定列或OPENROWSET。

#1


20  

BULK INSERT will almost certainly be much faster than reading the source file row-by-row and doing a regular INSERT for each row. However, both BULK INSERT and BCP have a significant limitation regarding CSV files in that they cannot handle text qualifiers (ref: here). That is, if your CSV file does not have qualified text strings in it ...

BULK INSERT几乎肯定会比逐行读取源文件并为每行执行常规INSERT快得多。但是,BULK INSERT和BCP都对CSV文件有很大的限制,因为它们无法处理文本限定符(参见此处)。也就是说,如果您的CSV文件中没有合格的文本字符串...

1,Gord Thompson,2015-04-15
2,Bob Loblaw,2015-04-07

... then you can BULK INSERT it, but if it contains text qualifiers (because some text values contains commas) ...

...然后你可以BULK INSERT它,但如果它包含文本限定符(因为一些文本值包含逗号)...

1,"Thompson, Gord",2015-04-15
2,"Loblaw, Bob",2015-04-07

... then BULK INSERT cannot handle it. Still, it might be faster overall to pre-process such a CSV file into a pipe-delimited file ...

...然后BULK INSERT无法处理它。但是,将这样的CSV文件预处理成管道分隔文件可能会更快整体...

1|Thompson, Gord|2015-04-15
2|Loblaw, Bob|2015-04-07

... or a tab-delimited file (where represents the tab character) ...

...或制表符分隔文件(其中→表示制表符)...

1→Thompson, Gord→2015-04-15
2→Loblaw, Bob→2015-04-07

... and then BULK INSERT that file. For the latter (tab-delimited) file the BULK INSERT code would look something like this:

...然后BULK INSERT那个文件。对于后者(制表符分隔)文件,BULK INSERT代码看起来像这样:

import pypyodbc
conn_str = "DSN=myDb_SQLEXPRESS;"
cnxn = pypyodbc.connect(conn_str)
crsr = cnxn.cursor()
sql = """
BULK INSERT myDb.dbo.SpikeData123
FROM 'C:\\__tmp\\biTest.txt' WITH (
    FIELDTERMINATOR='\\t',
    ROWTERMINATOR='\\n'
    );
"""
crsr.execute(sql)
cnxn.commit()
crsr.close()
cnxn.close()

Note: As mentioned in a comment, executing a BULK INSERT statement is only applicable if the SQL Server instance can directly read the source file. For cases where the source file is on a remote client, see this answer.

注意:如注释中所述,执行BULK INSERT语句仅适用于SQL Server实例可以直接读取源文件的情况。对于源文件位于远程客户端的情况,请参阅此答案。

#2


16  

As noted in a comment to another answer, the T-SQL BULK INSERT command will only work if the file to be imported is on the same machine as the SQL Server instance or is in an SMB/CIFS network location that the SQL Server instance can read. Thus it may not be applicable in the case where the source file is on a remote client.

如对另一个答案的评论中所述,T-SQL BULK INSERT命令仅在要导入的文件与SQL Server实例位于同一台计算机上或位于SQL Server实例可以的SMB / CIFS网络位置时才有效。读。因此,它可能不适用于源文件位于远程客户端上的情况。

pyodbc 4.0.19 added a Cursor#fast_executemany feature which may be helpful in that case. fast_executemany is "off" by default, and the following test code ...

pyodbc 4.0.19添加了一个Cursor#fast_executemany功能,在这种情况下可能会有所帮助。 fast_executemany默认为“off”,以下测试代码......

cnxn = pyodbc.connect(conn_str, autocommit=True)
crsr = cnxn.cursor()
crsr.execute("TRUNCATE TABLE fast_executemany_test")

sql = "INSERT INTO fast_executemany_test (txtcol) VALUES (?)"
params = [(f'txt{i:06d}',) for i in range(1000)]
t0 = time.time()
crsr.executemany(sql, params)
print(f'{time.time() - t0:.1f} seconds')

... took approximately 22 seconds to execute on my test machine. Simply adding crsr.fast_executemany = True ...

...在我的测试机器上执行大约需要22秒。只需添加crsr.fast_executemany = True ...

cnxn = pyodbc.connect(conn_str, autocommit=True)
crsr = cnxn.cursor()
crsr.execute("TRUNCATE TABLE fast_executemany_test")

crsr.fast_executemany = True  # new in pyodbc 4.0.19

sql = "INSERT INTO fast_executemany_test (txtcol) VALUES (?)"
params = [(f'txt{i:06d}',) for i in range(1000)]
t0 = time.time()
crsr.executemany(sql, params)
print(f'{time.time() - t0:.1f} seconds')

... reduced the execution time to just over 1 second.

...将执行时间缩短到1秒以上。

#3


1  

yes bulk insert is right path for loading large files into a DB. At a glance I would say that the reason it takes so long is as you mentioned you are looping over each row of data from the file which effectively means are removing the benefits of using a bulk insert and making it like a normal insert. Just remember that as it's name implies that it is used to insert chucks of data. I would remove loop and try again.

是批量插入是将大文件加载到DB的正确路径。一眼就能看出它花费这么长时间的原因就像你提到的那样,你正在循环文件中的每一行数据,这实际上意味着消除了使用批量插入并使其像普通插入一样的好处。请记住,因为它的名字暗示它用于插入数据夹头。我会删除循环,然后再试一次。

Also I'd double check your syntax for bulk insert as it doesn't look correct to me. check the sql that is generated by pyodbc as I have a feeling that it might only be executing a normal insert

另外,我会仔细检查批量插入的语法,因为它看起来不正确。检查pyodbc生成的sql,因为我觉得它可能只是执行一个普通的插入

Alternatively if it is still slow I would try using bulk insert directly from sql and either load the whole file into a temp table with bulk insert then insert the relevant column into the right tables. or use a mix of bulk insert and bcp to get the specific columns inserted or OPENROWSET.

或者,如果它仍然很慢,我会尝试直接从sql使用批量插入,并将整个文件加载到带有批量插入的临时表中,然后将相关列插入到右表中。或使用批量插入和bcp的混合来插入特定列或OPENROWSET。