Python/Django/MySQL“不正确的字符串值”错误。

时间:2022-06-08 00:46:13

I'm running a Django 1.4.2/Python 2.7.3/MySQL 5.5.28 site. One of the features of the site is that the admin can send an email to the server which calls a Python script via procmail that parses the email and tosses it into the DB. I maintain two versions of the site - a development and a production site. Both sites use different but identical vitualenvs (I even deleted them both and reinstalled all packages just to make sure).

我运行的是Django 1.4.2/Python 2.7.3/MySQL 5.5.28站点。该站点的一个特性是管理员可以向服务器发送电子邮件,服务器通过procmail调用Python脚本,该脚本解析电子邮件并将其扔到DB中。我维护两个版本的站点——开发和生产站点。两个网站都使用不同但相同的vitualenv(我甚至把它们都删除了,并重新安装了所有的包,以便确定)。

I'm experiencing a weird issue. The exact same script succeeds on the dev server and fails on the production server. It fails with this error:

我遇到了一个奇怪的问题。相同的脚本在dev服务器上成功,在生产服务器上失败。它失败于以下错误:

...django/db/backends/mysql/base.py:114: Warning: Incorrect string value: '\x92t kno...' for column 'message' at row 1

I'm well aware of the unicode issues Django has, and I know there are a ton of questions here on SO about this error, but I made sure to setup the database as UTF-8 from the beginning:

我很了解Django存在的unicode问题,我知道这里有很多关于这个错误的问题,但是我从一开始就确定将数据库设置为UTF-8:

mysql> show variables like "character_set_database";
+------------------------+-------+
| Variable_name          | Value |
+------------------------+-------+
| character_set_database | utf8  |
+------------------------+-------+
1 row in set (0.00 sec)

mysql> show variables like "collation_database";
+--------------------+-----------------+
| Variable_name      | Value           |
+--------------------+-----------------+
| collation_database | utf8_general_ci |
+--------------------+-----------------+
1 row in set (0.00 sec)

Additionally, I know that each column can have its own charset, but the message column is indeed UTF-8:

另外,我知道每个列都可以有自己的字符集,但是消息列实际上是UTF-8:

mysql> show full columns in listserv_post;
+------------+--------------+-----------------+------+-----+---------+----------------+---------------------------------+---------+
| Field      | Type         | Collation       | Null | Key | Default | Extra          | Privileges                      | Comment |
+------------+--------------+-----------------+------+-----+---------+----------------+---------------------------------+---------+
| id         | int(11)      | NULL            | NO   | PRI | NULL    | auto_increment | select,insert,update,references |         |
| thread_id  | int(11)      | NULL            | NO   | MUL | NULL    |                | select,insert,update,references |         |
| timestamp  | datetime     | NULL            | NO   |     | NULL    |                | select,insert,update,references |         |
| from_name  | varchar(100) | utf8_general_ci | NO   |     | NULL    |                | select,insert,update,references |         |
| from_email | varchar(75)  | utf8_general_ci | NO   |     | NULL    |                | select,insert,update,references |         |
| message    | longtext     | utf8_general_ci | NO   |     | NULL    |                | select,insert,update,references |         |
+------------+--------------+-----------------+------+-----+---------+----------------+---------------------------------+---------+
6 rows in set (0.00 sec)

Does anyone have any idea why I'm getting this error? Why is it happening under the production config but not the dev config?

有人知道我为什么会犯这个错误吗?为什么是在产品配置下而不是在开发配置下发生的?

Thanks!

谢谢!

[edit 1]
To be clear, the data are the same as well. I send a single email to the server, and procmail sends it off. This is what the .procmailrc looks like:

[编辑1]需要说明的是,数据也是一样的。我向服务器发送了一封电子邮件,procmail将其发送出去。

VERBOSE=off
:0
{
    :0c
    | <path>/dev/ein/scripts/process_new_mail.py dev > outputdev

    :0
    | <path>/prd/ein/scripts/process_new_mail.py prd > outputprd
}

There are 2 copies of process_new_mail.py, but that's just because it's version controlled so that I can maintain two separate environments. If I diff the two output files (which contain the message received), they're identical.

process_new_mail有两个副本。py,但那只是因为它是版本控制的,所以我可以维护两个独立的环境。如果我分割两个输出文件(其中包含接收到的消息),它们是相同的。

[edit 2] I actually just discovered that both dev and prd configs are failing. The difference is that the dev config fails silently (maybe having to do with the DEBUG setting?). The problem is that there are some unicode characters in one of the messages, and Django is choking on them for some reason. I'm making progress....

[编辑2]我刚刚发现dev和prd configs都失败了。不同的是,dev配置会无声地失败(可能与调试设置有关?)问题是其中一条消息中有一些unicode字符,Django因为某些原因而阻塞了这些字符。我取得进展....

I've tried editing the code to explicitly encode the message as ASCII and UTF-8, but it's still not working. I'm getting closer, though.

我尝试过编辑代码以显式地将消息编码为ASCII和UTF-8,但它仍然不起作用。然而,我越来越近了。

1 个解决方案

#1


1  

I fixed it! The problem was that I wasn't parsing the email correctly with respect to the charsets. My fixed email parsing code comes from this post and this post:

我把它修好了!问题是我没有正确地解析电子邮件的字符集。我固定的邮件解析代码来自这篇文章和这篇文章:

#get the charset of an email
#courtesy http://ginstrom.com/scribbles/2007/11/19/parsing-multilingual-email-with-python/
def get_charset(message, default='ascii'):
    if message.get_content_charset():
        return message.get_content_charset()

    if message.get_charset():
        return message.get_charset()

    return default

#courtesy https://*.com/questions/7166922/extracting-the-body-of-an-email-from-mbox-file-decoding-it-to-plain-text-regard
def get_body(message):
    body = None

    #Walk through the parts of the email to find the text body.
    if message.is_multipart():
        for part in message.walk():
            #If part is multipart, walk through the subparts.
            if part.is_multipart():
                for subpart in part.walk():
                    if subpart.get_content_type() == 'text/plain':
                        #Get the subpart payload (i.e., the message body).
                        charset = get_charset(subpart, get_charset(message))
                        body = unicode(subpart.get_payload(decode=True), charset)
            #Part isn't multipart so get the email body.
            elif part.get_content_type() == 'text/plain':
                charset = get_charset(subpart, get_charset(message))
                body = unicode(part.get_payload(decode=True), charset)
    #If this isn't a multi-part message then get the payload (i.e., the message body).
    elif message.get_content_type() == 'text/plain':
        charset = get_charset(subpart, get_charset(message))
        body = unicode(message.get_payload(decode=True), charset)

    return body

Thanks very much for the help!

非常感谢您的帮助!

#1


1  

I fixed it! The problem was that I wasn't parsing the email correctly with respect to the charsets. My fixed email parsing code comes from this post and this post:

我把它修好了!问题是我没有正确地解析电子邮件的字符集。我固定的邮件解析代码来自这篇文章和这篇文章:

#get the charset of an email
#courtesy http://ginstrom.com/scribbles/2007/11/19/parsing-multilingual-email-with-python/
def get_charset(message, default='ascii'):
    if message.get_content_charset():
        return message.get_content_charset()

    if message.get_charset():
        return message.get_charset()

    return default

#courtesy https://*.com/questions/7166922/extracting-the-body-of-an-email-from-mbox-file-decoding-it-to-plain-text-regard
def get_body(message):
    body = None

    #Walk through the parts of the email to find the text body.
    if message.is_multipart():
        for part in message.walk():
            #If part is multipart, walk through the subparts.
            if part.is_multipart():
                for subpart in part.walk():
                    if subpart.get_content_type() == 'text/plain':
                        #Get the subpart payload (i.e., the message body).
                        charset = get_charset(subpart, get_charset(message))
                        body = unicode(subpart.get_payload(decode=True), charset)
            #Part isn't multipart so get the email body.
            elif part.get_content_type() == 'text/plain':
                charset = get_charset(subpart, get_charset(message))
                body = unicode(part.get_payload(decode=True), charset)
    #If this isn't a multi-part message then get the payload (i.e., the message body).
    elif message.get_content_type() == 'text/plain':
        charset = get_charset(subpart, get_charset(message))
        body = unicode(message.get_payload(decode=True), charset)

    return body

Thanks very much for the help!

非常感谢您的帮助!