I am translating a code from perl to python. Even if it works exactly the same, there is a part of the code that is 5x slower in python than in perl and I cannot figure out why.
我正在将代码从perl转换为python。即使它的工作方式完全相同,但有一部分代码在python中比在perl中慢5倍,我无法弄清楚原因。
Both perl and python are in the same machine, as well as the mysql database.
perl和python都在同一台机器上,以及mysql数据库。
The code queries a db to download all columns of a table and then process each row. There are more than 5 million rows to process and the big issue is in retrieving the data from the database to the python processing.
代码查询数据库以下载表的所有列,然后处理每一行。有超过500万行要处理,最大的问题是从数据库中检索数据到python处理。
Here I attach the two code samples: Python:
这里我附上两个代码示例:Python:
import os
import mysql.connector **<--- import mySqlDb**
import time
outDict = dict()
## DB parameters
db = mysql.connector.connect **<----- mySqlDb.connect( ...**
(host=dbhost,
user=username, # your username
passwd=passw, # your password
db=database) # name of the data base
cur = db.cursor(prepared=True)
sql = "select chr,pos,lengthofrepeat,copyNum,region from db.Table_simpleRepeat;"
cur.execute(sql)
print('\t eDiVa public omics start')
s = time.time()
sz = 1000
rows = cur.fetchall()
for row in rows:
## process out dict
print time.time() - s
cur.close()
db.close()
While here comes the Perl equivalent script:
虽然这里有Perl等效脚本:
use strict;
use Digest::MD5 qw(md5);
use DBI;
use threads;
use threads::shared;
my $dbh = DBI->connect('dbi:mysql:'.$database.';host='.$dbhost.'',$username,$pass)
or die "Connection Error!!\n";
my $sql = "select chr,pos,lengthofrepeat,copyNum,region from db.Table_simpleRepeat\;";
## prepare statement and query
my $stmt = $dbh->prepare($sql);
$stmt->execute or die "SQL Error!!\n";
my $c = 0;
#process query result
while (my @res = $stmt->fetchrow_array)
{
$edivaStr{ $res[0].";".$res[1] } = $res[4].",".$res[2];
$c +=1;
}
print($c."\n");
## close DB connection
$dbh->disconnect();
The runtime for these two scripts is:
这两个脚本的运行时是:
- ~40s for the Perl script
- Perl脚本大约40秒
- ~200s for the Python script
- Python脚本大约200s
I cannot figure out why this happens [I tried using fetchone() or fetchmany() to see if there are memory issues but the runtime at most reduces 10% from the 200s].
我无法弄清楚为什么会发生这种情况[我尝试使用fetchone()或fetchmany()来查看是否存在内存问题,但运行时最多只能从200秒减少10%]。
My main problem is understanding why there is such a relevant performance difference between the two functionally equivalent code blocks.
我的主要问题是理解为什么在两个功能等效的代码块之间存在这样的相关性能差异。
Any idea about how can I verify what is happening would be greatly appreciated.
任何关于如何验证正在发生的事情的想法将不胜感激。
Thanks!
谢谢!
UPDATE ABOUT SOLUTION
Peeyush'comment could be an answer and I'd like him to post it because it allowed me to find a solution.
Peeyush的评论可能是一个答案,我希望他发布它,因为它让我找到了解决方案。
The problem is the python connector. I just changed that for mySqlDb module which is a C compiled module. That made the python code slightly faster than the perl code.
问题是python连接器。我刚刚为mySqlDb模块更改了它,这是一个C编译模块。这使得python代码比perl代码略快。
I added the changes in the python code with a <---- "" to show how easy it has been to gain performance.
我在python代码中添加了一个带有<----“”的更改,以显示获得性能的难易程度。
4 个解决方案
#1
4
the cursor.fetchall
means you load all your data in memory at once, instead of doing it slowly when needed.
cursor.fetchall意味着您一次将所有数据加载到内存中,而不是在需要时慢慢加载。
Replace
更换
row = cur.fetchall()
for row in rows:
by
通过
for row in cur:
#2
2
The problem is the python connector. I just changed that for mySqlDb module which is a C compiled module. That made the python code slightly faster than the perl code.
问题是python连接器。我刚刚为mySqlDb模块更改了它,这是一个C编译模块。这使得python代码比perl代码略快。
I added the changes in the python code with a <---- "" to show how easy it has been to gain performance
我在python代码中添加了一个带有<----“”的更改,以显示获得性能的难易程度
#3
0
I encounter the same problem. With Python cx_Oracle, here's my environment performance stats -- Python takes very long to connect to Oracle DB.
我遇到了同样的问题。使用Python cx_Oracle,这是我的环境性能统计数据 - Python需要很长时间才能连接到Oracle DB。
- connect to DB, elaps:0.38108
- 连接到DB,经过:0.38108
- run query, elaps:0.00092
- 运行查询,elaps:0.00092
- get filename from table, elaps:8e-05
- 从表中获取文件名,elaps:8e-05
- run query to read BLOB, elaps:0.00058
- 运行查询以读取BLOB,elaps:0.00058
- decompress data and write to file, elaps:0.00187
- 解压缩数据并写入文件,经过:0.00187
- close DB connection, elaps:0.00009
- 关闭DB连接,经过:0.00009
- Over all, elaps:0.38476
- 总之,流逝:0.38476
- same function in Perl, elaps:0.00213
- Perl中的相同函数,elaps:0.00213
#4
-1
Python for
loops are quite slow. You should look into an alternative to treat your query.
From python wiki : https://wiki.python.org/moin/PythonSpeed/PerformanceTips#Loops
Python for循环非常慢。您应该研究一种替代方法来处理您的查询。来自python wiki:https://wiki.python.org/moin/PythonSpeed/PerformanceTips#Loops
#1
4
the cursor.fetchall
means you load all your data in memory at once, instead of doing it slowly when needed.
cursor.fetchall意味着您一次将所有数据加载到内存中,而不是在需要时慢慢加载。
Replace
更换
row = cur.fetchall()
for row in rows:
by
通过
for row in cur:
#2
2
The problem is the python connector. I just changed that for mySqlDb module which is a C compiled module. That made the python code slightly faster than the perl code.
问题是python连接器。我刚刚为mySqlDb模块更改了它,这是一个C编译模块。这使得python代码比perl代码略快。
I added the changes in the python code with a <---- "" to show how easy it has been to gain performance
我在python代码中添加了一个带有<----“”的更改,以显示获得性能的难易程度
#3
0
I encounter the same problem. With Python cx_Oracle, here's my environment performance stats -- Python takes very long to connect to Oracle DB.
我遇到了同样的问题。使用Python cx_Oracle,这是我的环境性能统计数据 - Python需要很长时间才能连接到Oracle DB。
- connect to DB, elaps:0.38108
- 连接到DB,经过:0.38108
- run query, elaps:0.00092
- 运行查询,elaps:0.00092
- get filename from table, elaps:8e-05
- 从表中获取文件名,elaps:8e-05
- run query to read BLOB, elaps:0.00058
- 运行查询以读取BLOB,elaps:0.00058
- decompress data and write to file, elaps:0.00187
- 解压缩数据并写入文件,经过:0.00187
- close DB connection, elaps:0.00009
- 关闭DB连接,经过:0.00009
- Over all, elaps:0.38476
- 总之,流逝:0.38476
- same function in Perl, elaps:0.00213
- Perl中的相同函数,elaps:0.00213
#4
-1
Python for
loops are quite slow. You should look into an alternative to treat your query.
From python wiki : https://wiki.python.org/moin/PythonSpeed/PerformanceTips#Loops
Python for循环非常慢。您应该研究一种替代方法来处理您的查询。来自python wiki:https://wiki.python.org/moin/PythonSpeed/PerformanceTips#Loops