pandas to_sql截断我的数据

时间:2021-05-08 23:51:06

I was using df.to_sql(con=con_mysql, name='testdata', if_exists='replace', flavor='mysql') to export a data frame into mysql. However, I discovered that the columns with long string content (such as url) is truncated to 63 digits. I received the following warning from ipython notebook when I exported:

我使用df.to_sql(con = con_mysql,name ='testdata',if_exists ='replace',flavor ='mysql')将数据框导出到mysql中。但是,我发现具有长字符串内容的列(例如url)被截断为63位数。我在导出时从ipython笔记本收到以下警告:

/usr/local/lib/python2.7/site-packages/pandas/io/sql.py:248: Warning: Data truncated for column 'url' at row 3 cur.executemany(insert_query, data)

/usr/local/lib/python2.7/site-packages/pandas/io/sql.py:248:警告:第3行的列'url'截断数据cur.executemany(insert_query,data)

There were other warnings in the same style for different rows.

对于不同的行,还存在相同样式的其他警告。

Is there anything I can tweak to export the full data properly? I could set up the correct data schema in mysql and then export to that. But I am hoping a tweak can just make it work straight from python.

有什么我可以调整以正确导出完整数据吗?我可以在mysql中设置正确的数据模式,然后导出到该模式。但是我希望调整可以让它直接从python中运行。

2 个解决方案

#1


9  

If you are using pandas 0.13.1 or older, this limit of 63 digits is indeed hardcoded, because of this line in the code: https://github.com/pydata/pandas/blob/v0.13.1/pandas/io/sql.py#L278

如果您使用的是0.13.1或更高版本的pandas,这个63位数的限制确实是硬编码的,因为代码中的这一行:https://github.com/pydata/pandas/blob/v0.13.1/pandas/io/ sql.py#L278

As a workaround, you could maybe monkeypatch that function get_sqltype:

作为一种解决方法,你可以monkeypatch函数get_sqltype:

from pandas.io import sql

def get_sqltype(pytype, flavor):
    sqltype = {'mysql': 'VARCHAR (63)',    # <-- change this value to something sufficient higher
               'sqlite': 'TEXT'}

    if issubclass(pytype, np.floating):
        sqltype['mysql'] = 'FLOAT'
        sqltype['sqlite'] = 'REAL'
    if issubclass(pytype, np.integer):
        sqltype['mysql'] = 'BIGINT'
        sqltype['sqlite'] = 'INTEGER'
    if issubclass(pytype, np.datetime64) or pytype is datetime:
        sqltype['mysql'] = 'DATETIME'
        sqltype['sqlite'] = 'TIMESTAMP'
    if pytype is datetime.date:
        sqltype['mysql'] = 'DATE'
        sqltype['sqlite'] = 'TIMESTAMP'
    if issubclass(pytype, np.bool_):
        sqltype['sqlite'] = 'INTEGER'

    return sqltype[flavor]

sql.get_sqltype = get_sqltype

And then just using your code should work:

然后只需使用您的代码应该工作:

df.to_sql(con=con_mysql, name='testdata', if_exists='replace', flavor='mysql')

Starting from pandas 0.14, the sql module is uses sqlalchemy under the hood, and strings are converted to the sqlalchemy TEXT type, wich is converted to the mysql TEXT type (and not VARCHAR), and this will also allow you to store larger strings than 63 digits:

从pandas 0.14开始,sql模块使用sqlalchemy,字符串转换为sqlalchemy TEXT类型,转换为mysql TEXT类型(而不是VARCHAR),这也允许你存储比63位数:

engine = sqlalchemy.create_engine('mysql://scott:tiger@localhost/foo')
df.to_sql('testdata', engine, if_exists='replace')

Only if you still use the a DBAPI connection instead of a sqlalchemy engine, the issue remains, but this option is deprecated and it is recommended to provide an sqlalchemy engine to to_sql.

只有当您仍然使用DBAPI连接而不是sqlalchemy引擎时,问题仍然存在,但不推荐使用此选项,建议为to_sql提供sqlalchemy引擎。

#2


5  

Inspired by @joris's answer, I decided to hard code the change into the panda's source and re-compile.

受@ joris的回答启发,我决定将更改硬编码到熊猫的源代码并重新编译。

cd /usr/local/lib/python2.7/dist-packages/pandas-0.14.1-py2.7-linux-x86_64.egg/pandas/io
sudo pico sql.py

changed line 871

改变了第871行

'mysql': 'VARCHAR (63)',

to

'mysql': 'VARCHAR (255)',

then recompiled the just that file

然后重新编译该文件

sudo python -m py_compile sql.py

sudo python -m py_compile sql.py

restarted my script and _to_sql() function wrote a table. (I expected that the recompile would have broken pandas, but seems to have not.)

重启我的脚本和_to_sql()函数写了一个表。 (我预计重组会打破熊猫,但似乎没有。)

here is my script to write a dataframe to mysql, for reference.

这是我的脚本,将数据帧写入mysql,以供参考。

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sqlalchemy 
from sqlalchemy import create_engine
df = pd.read_csv('10k.csv')
## ... dataframe munging
df = df.where(pd.notnull(df), None) # workaround for NaN bug
engine = create_engine('mysql://user:password@localhost:3306/dbname')
con = engine.connect().connection
df.to_sql("issues", con, 'mysql', if_exists='replace', index=True, index_label=None)

#1


9  

If you are using pandas 0.13.1 or older, this limit of 63 digits is indeed hardcoded, because of this line in the code: https://github.com/pydata/pandas/blob/v0.13.1/pandas/io/sql.py#L278

如果您使用的是0.13.1或更高版本的pandas,这个63位数的限制确实是硬编码的,因为代码中的这一行:https://github.com/pydata/pandas/blob/v0.13.1/pandas/io/ sql.py#L278

As a workaround, you could maybe monkeypatch that function get_sqltype:

作为一种解决方法,你可以monkeypatch函数get_sqltype:

from pandas.io import sql

def get_sqltype(pytype, flavor):
    sqltype = {'mysql': 'VARCHAR (63)',    # <-- change this value to something sufficient higher
               'sqlite': 'TEXT'}

    if issubclass(pytype, np.floating):
        sqltype['mysql'] = 'FLOAT'
        sqltype['sqlite'] = 'REAL'
    if issubclass(pytype, np.integer):
        sqltype['mysql'] = 'BIGINT'
        sqltype['sqlite'] = 'INTEGER'
    if issubclass(pytype, np.datetime64) or pytype is datetime:
        sqltype['mysql'] = 'DATETIME'
        sqltype['sqlite'] = 'TIMESTAMP'
    if pytype is datetime.date:
        sqltype['mysql'] = 'DATE'
        sqltype['sqlite'] = 'TIMESTAMP'
    if issubclass(pytype, np.bool_):
        sqltype['sqlite'] = 'INTEGER'

    return sqltype[flavor]

sql.get_sqltype = get_sqltype

And then just using your code should work:

然后只需使用您的代码应该工作:

df.to_sql(con=con_mysql, name='testdata', if_exists='replace', flavor='mysql')

Starting from pandas 0.14, the sql module is uses sqlalchemy under the hood, and strings are converted to the sqlalchemy TEXT type, wich is converted to the mysql TEXT type (and not VARCHAR), and this will also allow you to store larger strings than 63 digits:

从pandas 0.14开始,sql模块使用sqlalchemy,字符串转换为sqlalchemy TEXT类型,转换为mysql TEXT类型(而不是VARCHAR),这也允许你存储比63位数:

engine = sqlalchemy.create_engine('mysql://scott:tiger@localhost/foo')
df.to_sql('testdata', engine, if_exists='replace')

Only if you still use the a DBAPI connection instead of a sqlalchemy engine, the issue remains, but this option is deprecated and it is recommended to provide an sqlalchemy engine to to_sql.

只有当您仍然使用DBAPI连接而不是sqlalchemy引擎时,问题仍然存在,但不推荐使用此选项,建议为to_sql提供sqlalchemy引擎。

#2


5  

Inspired by @joris's answer, I decided to hard code the change into the panda's source and re-compile.

受@ joris的回答启发,我决定将更改硬编码到熊猫的源代码并重新编译。

cd /usr/local/lib/python2.7/dist-packages/pandas-0.14.1-py2.7-linux-x86_64.egg/pandas/io
sudo pico sql.py

changed line 871

改变了第871行

'mysql': 'VARCHAR (63)',

to

'mysql': 'VARCHAR (255)',

then recompiled the just that file

然后重新编译该文件

sudo python -m py_compile sql.py

sudo python -m py_compile sql.py

restarted my script and _to_sql() function wrote a table. (I expected that the recompile would have broken pandas, but seems to have not.)

重启我的脚本和_to_sql()函数写了一个表。 (我预计重组会打破熊猫,但似乎没有。)

here is my script to write a dataframe to mysql, for reference.

这是我的脚本,将数据帧写入mysql,以供参考。

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sqlalchemy 
from sqlalchemy import create_engine
df = pd.read_csv('10k.csv')
## ... dataframe munging
df = df.where(pd.notnull(df), None) # workaround for NaN bug
engine = create_engine('mysql://user:password@localhost:3306/dbname')
con = engine.connect().connection
df.to_sql("issues", con, 'mysql', if_exists='replace', index=True, index_label=None)