MySQL ResultSets are by default retrieved completely from the server before any work can be done. In cases of huge result sets this becomes unusable. I would like instead to actually retrieve the rows one by one from the server.
默认情况下,在完成任何工作之前,MySQL resultset完全从服务器中检索。在大量结果集的情况下,这将变得不可用。我想从服务器上逐个检索行。
In Java, following the instructions here (under "ResultSet"), I create a statement like this:
在Java中,按照这里的说明(在“ResultSet”下),我创建这样的语句:
stmt = conn.createStatement(java.sql.ResultSet.TYPE_FORWARD_ONLY,
java.sql.ResultSet.CONCUR_READ_ONLY);
stmt.setFetchSize(Integer.MIN_VALUE);
This works nicely in Java. My question is: is there a way to do the same in python?
这在Java中工作得很好。我的问题是:是否有一种方法可以在python中实现同样的功能?
One thing I tried is to limit the query to a 1000 rows at a time, like this:
我尝试过的一件事是将查询限制为一次1000行,如下所示:
start_row = 0
while True:
cursor = conn.cursor()
cursor.execute("SELECT item FROM items LIMIT %d,1000" % start_row)
rows = cursor.fetchall()
if not rows:
break
start_row += 1000
# Do something with rows...
However, this seems to get slower the higher start_row is.
然而,这似乎比start_row更慢。
And no, using fetchone()
instead of fetchall()
doesn't change anything.
不,使用fetchone()而不是fetchall()不会改变什么。
Clarification:
The naive code I use to reproduce this problem looks like this:
我用来复制这个问题的简单代码是这样的:
import MySQLdb
conn = MySQLdb.connect(user="user", passwd="password", db="mydb")
cur = conn.cursor()
print "Executing query"
cur.execute("SELECT * FROM bigtable");
print "Starting loop"
row = cur.fetchone()
while row is not None:
print ", ".join([str(c) for c in row])
row = cur.fetchone()
cur.close()
conn.close()
On a ~700,000 rows table, this code runs quickly. But on a ~9,000,000 rows table it prints "Executing Query" and then hangs for a long long time. That is why it makes no difference if I use fetchone()
or fetchall()
.
在一个大约70万行的表上,这段代码运行得很快。但在一个大约9,000,000行表上,它会打印“执行查询”,然后长时间挂起。这就是为什么使用fetchone()或fetchall()没有区别。
5 个解决方案
#1
49
I think you have to connect passing cursorclass = MySQLdb.cursors.SSCursor
:
我认为您必须连接传递的cursorclass = mysqldb .cursors. s头文件:
MySQLdb.connect(user="user",
passwd="password",
db="mydb",
cursorclass = MySQLdb.cursors.SSCursor
)
The default cursor fetches all the data at once, even if you don't use fetchall
.
默认游标一次获取所有数据,即使不使用fetchall。
Edit: SSCursor
or any other cursor class that supports server side resultsets - check the module docs on MySQLdb.cursors
.
编辑:s头文件或支持服务器端结果集的任何其他游标类——检查MySQLdb.cursors上的模块文档。
#2
17
The limit/offset solution runs in quadratic time because mysql has to rescan the rows to find the offset. As you suspected, the default cursor stores the entire result set on the client, which may consume a lot of memory.
由于mysql需要重新扫描行来查找偏移量,所以限制/偏移量的解决方案在二次时间内运行。如您所怀疑的,默认游标将整个结果集存储在客户机上,这可能消耗大量内存。
Instead you can use a server side cursor, which keeps the query running and fetches results as necessary. The cursor class can be customized by supplying a default to the connection call itself, or by supplying a class to the cursor method each time.
相反,您可以使用服务器端游标,该游标保持查询运行,并根据需要获取结果。可以通过向连接调用本身提供默认值来定制游标类,或者每次向游标方法提供一个类。
from MySQLdb import cursors
cursor = conn.cursor(cursors.SSCursor)
But that's not the whole story. In addition to storing the mysql result, the default client-side cursor actually fetches every row regardless. This behavior is undocumented, and very unfortunate. It means full python objects are created for all rows, which consumes far more memory than the original mysql result.
但这还不是全部。除了存储mysql结果之外,默认的客户端游标实际上也会读取每一行。这种行为是没有记录的,而且非常不幸。这意味着为所有行创建完整的python对象,这将消耗比原始mysql结果多得多的内存。
In most cases, a result stored on the client wrapped as an iterator would yield the best speed with reasonable memory usage. But you'll have to roll your own if you want that.
在大多数情况下,作为迭代器包装在客户端上的结果将产生最佳速度,并具有合理的内存使用。但如果你想要的话,你得自己动手。
#3
7
Did you try this version of fetchone? Or something different?
你试过这个版本的fetchone吗?或不同的东西吗?
row = cursor.fetchone()
while row is not None:
# process
row = cursor.fetchone()
Also, did you try this?
还有,你试过这个吗?
row = cursor.fetchmany(size=1)
while row is not None:
# process
row = cursor.fetchmany( size=1 )
Not all drivers support these, so you may have gotten errors or found them too slow.
并不是所有的驱动程序都支持这些功能,所以您可能会发现错误或者发现它们太慢。
Edit.
编辑。
When it hangs on execute, you're waiting for the database. That's not a row-by-row Python thing; that's a MySQL thing.
当它挂起执行时,您正在等待数据库。这不是逐行Python;这是一个MySQL的事情。
MySQL prefers to fetch all rows as part of it's own cache management. This is turned off by providing a the fetch_size of Integer.MIN_VALUE (-2147483648L).
MySQL更喜欢将所有行作为自己的缓存管理的一部分。通过提供一个Integer的fetch_size来关闭它。MIN_VALUE(-2147483648升)。
The question is, what part of the Python DBAPI becomes the equivalent of the JDBC fetch_size?
问题是,Python DBAPI的哪些部分与JDBC fetch_size等价?
I think it might be the arraysize attribute of the cursor. Try
我认为它可能是光标的arraysize属性。试一试
cursor.arraysize=-2**31
And see if that forces MySQL to stream the result set instead of caching it.
看看这是否迫使MySQL对结果集进行流处理,而不是缓存它。
#4
2
I found the best results mixing a bit from some of the other answers.
我发现最好的结果混合了一些其他的答案。
This included setting cursorclass=MySQLdb.cursors.SSDictCursor
(for MySQLdb) or pymysql.cursors.SSDictCursor
(for PyMySQL) as part of the connection settings. This will let the server hold the query/results (the "SS" stands for server side as opposed to the default cursor which brings the results client side) and build a dictionary out of each row (e.g. {'id': 1, 'name': 'Cookie Monster'}).
这包括设置cursorclass = MySQLdb.cursors。SSDictCursor(用于MySQLdb)或pymysql.游标。SSDictCursor(用于PyMySQL)作为连接设置的一部分。这将使服务器保持查询/结果(“SS”代表服务器端,而不是带来结果客户端的默认游标),并从每一行(例如{'id': 1, 'name': 'Cookie Monster'})构建字典。
Then to loop through the rows, there was an infinite loop in both Python 2.7 and 3.4 caused by while rows is not None
because even when cur.fetchmany(size=10000)
was called and there were no results left, the method returned an empty list ([]
) instead of None.
然后对行进行循环,在Python 2.7和3.4中都有一个无限循环,因为即使是cur.fetchmany(size=10000),也没有结果,方法返回一个空列表([]),而不是None。
Actual example:
实际的例子:
query = """SELECT * FROM my_table"""
conn = pymysql.connect(host=MYSQL_CREDENTIALS['host'], user=MYSQL_CREDENTIALS['user'],
passwd=MYSQL_CREDENTIALS['passwd'], charset='utf8', cursorclass = pymysql.cursors.SSDictCursor)
cur = conn.cursor()
results = cur.execute(query)
rows = cur.fetchmany(size=100)
while rows:
for row in rows:
process(row)
rows = cur.fetchmany(size=100)
cur.close()
conn.close()
#5
1
Try to use MySQLdb.cursors.SSDictCursor
尝试使用MySQLdb.cursors.SSDictCursor
con = MySQLdb.connect(host=host,
user=user,
passwd=pwd,
charset=charset,
port=port,
cursorclass=MySQLdb.cursors.SSDictCursor);
cur = con.cursor()
cur.execute("select f1, f2 from table")
for row in cur:
print row['f1'], row['f2']
#1
49
I think you have to connect passing cursorclass = MySQLdb.cursors.SSCursor
:
我认为您必须连接传递的cursorclass = mysqldb .cursors. s头文件:
MySQLdb.connect(user="user",
passwd="password",
db="mydb",
cursorclass = MySQLdb.cursors.SSCursor
)
The default cursor fetches all the data at once, even if you don't use fetchall
.
默认游标一次获取所有数据,即使不使用fetchall。
Edit: SSCursor
or any other cursor class that supports server side resultsets - check the module docs on MySQLdb.cursors
.
编辑:s头文件或支持服务器端结果集的任何其他游标类——检查MySQLdb.cursors上的模块文档。
#2
17
The limit/offset solution runs in quadratic time because mysql has to rescan the rows to find the offset. As you suspected, the default cursor stores the entire result set on the client, which may consume a lot of memory.
由于mysql需要重新扫描行来查找偏移量,所以限制/偏移量的解决方案在二次时间内运行。如您所怀疑的,默认游标将整个结果集存储在客户机上,这可能消耗大量内存。
Instead you can use a server side cursor, which keeps the query running and fetches results as necessary. The cursor class can be customized by supplying a default to the connection call itself, or by supplying a class to the cursor method each time.
相反,您可以使用服务器端游标,该游标保持查询运行,并根据需要获取结果。可以通过向连接调用本身提供默认值来定制游标类,或者每次向游标方法提供一个类。
from MySQLdb import cursors
cursor = conn.cursor(cursors.SSCursor)
But that's not the whole story. In addition to storing the mysql result, the default client-side cursor actually fetches every row regardless. This behavior is undocumented, and very unfortunate. It means full python objects are created for all rows, which consumes far more memory than the original mysql result.
但这还不是全部。除了存储mysql结果之外,默认的客户端游标实际上也会读取每一行。这种行为是没有记录的,而且非常不幸。这意味着为所有行创建完整的python对象,这将消耗比原始mysql结果多得多的内存。
In most cases, a result stored on the client wrapped as an iterator would yield the best speed with reasonable memory usage. But you'll have to roll your own if you want that.
在大多数情况下,作为迭代器包装在客户端上的结果将产生最佳速度,并具有合理的内存使用。但如果你想要的话,你得自己动手。
#3
7
Did you try this version of fetchone? Or something different?
你试过这个版本的fetchone吗?或不同的东西吗?
row = cursor.fetchone()
while row is not None:
# process
row = cursor.fetchone()
Also, did you try this?
还有,你试过这个吗?
row = cursor.fetchmany(size=1)
while row is not None:
# process
row = cursor.fetchmany( size=1 )
Not all drivers support these, so you may have gotten errors or found them too slow.
并不是所有的驱动程序都支持这些功能,所以您可能会发现错误或者发现它们太慢。
Edit.
编辑。
When it hangs on execute, you're waiting for the database. That's not a row-by-row Python thing; that's a MySQL thing.
当它挂起执行时,您正在等待数据库。这不是逐行Python;这是一个MySQL的事情。
MySQL prefers to fetch all rows as part of it's own cache management. This is turned off by providing a the fetch_size of Integer.MIN_VALUE (-2147483648L).
MySQL更喜欢将所有行作为自己的缓存管理的一部分。通过提供一个Integer的fetch_size来关闭它。MIN_VALUE(-2147483648升)。
The question is, what part of the Python DBAPI becomes the equivalent of the JDBC fetch_size?
问题是,Python DBAPI的哪些部分与JDBC fetch_size等价?
I think it might be the arraysize attribute of the cursor. Try
我认为它可能是光标的arraysize属性。试一试
cursor.arraysize=-2**31
And see if that forces MySQL to stream the result set instead of caching it.
看看这是否迫使MySQL对结果集进行流处理,而不是缓存它。
#4
2
I found the best results mixing a bit from some of the other answers.
我发现最好的结果混合了一些其他的答案。
This included setting cursorclass=MySQLdb.cursors.SSDictCursor
(for MySQLdb) or pymysql.cursors.SSDictCursor
(for PyMySQL) as part of the connection settings. This will let the server hold the query/results (the "SS" stands for server side as opposed to the default cursor which brings the results client side) and build a dictionary out of each row (e.g. {'id': 1, 'name': 'Cookie Monster'}).
这包括设置cursorclass = MySQLdb.cursors。SSDictCursor(用于MySQLdb)或pymysql.游标。SSDictCursor(用于PyMySQL)作为连接设置的一部分。这将使服务器保持查询/结果(“SS”代表服务器端,而不是带来结果客户端的默认游标),并从每一行(例如{'id': 1, 'name': 'Cookie Monster'})构建字典。
Then to loop through the rows, there was an infinite loop in both Python 2.7 and 3.4 caused by while rows is not None
because even when cur.fetchmany(size=10000)
was called and there were no results left, the method returned an empty list ([]
) instead of None.
然后对行进行循环,在Python 2.7和3.4中都有一个无限循环,因为即使是cur.fetchmany(size=10000),也没有结果,方法返回一个空列表([]),而不是None。
Actual example:
实际的例子:
query = """SELECT * FROM my_table"""
conn = pymysql.connect(host=MYSQL_CREDENTIALS['host'], user=MYSQL_CREDENTIALS['user'],
passwd=MYSQL_CREDENTIALS['passwd'], charset='utf8', cursorclass = pymysql.cursors.SSDictCursor)
cur = conn.cursor()
results = cur.execute(query)
rows = cur.fetchmany(size=100)
while rows:
for row in rows:
process(row)
rows = cur.fetchmany(size=100)
cur.close()
conn.close()
#5
1
Try to use MySQLdb.cursors.SSDictCursor
尝试使用MySQLdb.cursors.SSDictCursor
con = MySQLdb.connect(host=host,
user=user,
passwd=pwd,
charset=charset,
port=port,
cursorclass=MySQLdb.cursors.SSDictCursor);
cur = con.cursor()
cur.execute("select f1, f2 from table")
for row in cur:
print row['f1'], row['f2']