pgSQL“ERROR:解析电子邮件时编码“UTF8”:0x86”的字节序列无效

时间:2022-11-28 11:46:00

I am migrating my ticket system to pgSQL. I allow for e-mail replies where PHP parses each e-mail into it's components and then message is then stored in a pgSQL table named inbox .

我正在将我的票据系统迁移到pgSQL。我允许电子邮件回复,PHP将每个电子邮件解析到它的组件中,然后将消息存储在名为inbox的pgSQL表中。

The first e-mail parsed and then saved successfully. There were no errors. Now I am receiving the error message

第一封电子邮件解析后成功保存。没有错误。现在我正在接收错误消息

invalid byte sequence for encoding "UTF8": 0x86

I've confirmed the database is using UTF8 encoding: - SHOW SERVER_ENCODING gives a result of UTF8 - SHOW CLIENT_ENCODING originally wasn't UTF8. I set this to UTF8.

我已经确认数据库正在使用UTF8编码:- SHOW SERVER_ENCODING给出了UTF8的一个结果- SHOW CLIENT_ENCODING最初不是UTF8。我将它设置为UTF8。

The error persists.

错误依然存在。

email_queue.php contains various PHP classes and functions to receive and send e-mails. The command "file email_queue.php" gives the result:

email_queue。php包含用于接收和发送电子邮件的各种php类和函数。email_queue命令“文件。php”给出了结果:

email_queue.php: PHP script, UTF-8 Unicode text, with very long lines

email_queue_receive.php uses the classes and functions for receive e-mails. This file includes email_queue.php for the functionality. The command "file email_queue_receive.php" gives the result:

email_queue_receive。php使用接收电子邮件的类和函数。这个文件包括email_queue。php的功能。email_queue_receive命令“文件。php”给出了结果:

email_queue_receive.php: PHP script, ASCII text

From searches I've done ASCII is a valid UTF8.

从我做过的搜索中,ASCII是一个有效的UTF8。

I haven't yet found a thread specific to this error as a result of parsing e-mail.

由于解析电子邮件,我还没有找到一个特定于此错误的线程。

2 个解决方案

#1


1  

(Daniel's right, just elaborating):

(丹尼尔是正确的,只是阐述):

0x86 can't be the first byte in a utf-8 sequence.

0x86不能是utf-8序列中的第一个字节。

Possible explanations include:

可能的解释包括:

  • The email is not utf-8 encoded
  • 电子邮件不是utf-8编码的。
  • The email is utf-8 encoded but the utf-8 in the email is malformed
  • 电子邮件是utf-8编码,但电子邮件中的utf-8格式错误
  • A string is being cut at an invalid byte offset in a utf-8 sequence by non-utf-8-aware substring code
  • 非utf-8感知子字符串代码正在以utf-8序列中的无效字节偏移量切割字符串
  • your app is mishandling encodings in the MIME parts
  • 您的应用程序对MIME部分的编码处理不当
  • ...

In general, you're going to have problems inserting email into PostgreSQL because PostgreSQL is very strict about text encoding correctness, wheras mail clients produce and accept all sorts of horrible garbage. You will need to either sanitize the incoming mail (using encoding guessing, stripping suspect parts/chars, etc) or store it in raw byte sequence form as bytea.

一般来说,将电子邮件插入PostgreSQL将会有问题,因为PostgreSQL对文本编码的正确性非常严格,而邮件客户端会生成并接受各种可怕的垃圾。您将需要对传入的邮件进行清理(使用编码猜测、剥离可疑部分/字符等),或者将其存储为原始字节序列形式的bytea。

I strongly recommending storing as bytea because:

我强烈建议以茶叶的形式储存,因为:

  • One MIME message can contain parts in different encodings
  • 一个MIME消息可以包含不同编码的部分。
  • MIME parts like email attachments can be sent containing NULL bytes if they don't have a Content-Transfer-Encoding, though most clients won't do so and will base64-encode them. PostgreSQL's text type cannot store null bytes.
  • 像电子邮件附件这样的MIME部分如果没有内容传递编码,可以发送包含空字节的字节,尽管大多数客户端不会这样做,并且会对它们进行base64编码。PostgreSQL的文本类型不能存储空字节。

Of course, that depends a lot on what you're processing. You might prefer to store as text and discard parts that can't be decoded using their declared text encoding.

当然,这在很大程度上取决于你在处理什么。您可能更喜欢以文本形式存储,并丢弃不能使用声明的文本编码解码的部分。

#2


3  

PostgreSQL is strict on encoding, but the email infrastructure is not. As the doc for PHP's iconv_mime_decode indicates:

PostgreSQL在编码方面很严格,但电子邮件基础设施却不严格。正如PHP的iconv_mime_decode的文档所示:

ICONV_MIME_DECODE_STRICT If set, the given header is decoded in full conformance with the standards defined in » RFC2047. This option is disabled by default because there are a lot of broken mail user agents that don't follow the specification and don't produce correct MIME headers.

ICONV_MIME_DECODE_STRICT如果设置,给定的头被解码,完全符合»RFC2047中定义的标准。默认情况下禁用此选项,因为有许多损坏的邮件用户代理不遵循规范,也不会生成正确的MIME头部。

There are also MIME parts in email bodies that violate the character advertised in the Content-Type declaration. An invalid mail will be accepted by SMTP servers as long it can be routed to a recipient, so senders are not made aware of the problem, it's the recipient that has to deal with it.

在电子邮件主体中,也有违反内容类型声明中所宣传的字符的MIME部分。一个无效的邮件将被SMTP服务器接受,因为它可以被路由到收件人,所以发送者没有意识到问题,它是必须处理的收件人。

As a consequence, any part of an email message that has to be inserted into a database text field must be sanitized beforehand. See for example Remove non-utf8 characters from string on how to do it.

因此,必须插入到数据库文本字段中的电子邮件消息的任何部分都必须事先进行清理。例如,请参见如何从字符串中删除非utf8字符。

#1


1  

(Daniel's right, just elaborating):

(丹尼尔是正确的,只是阐述):

0x86 can't be the first byte in a utf-8 sequence.

0x86不能是utf-8序列中的第一个字节。

Possible explanations include:

可能的解释包括:

  • The email is not utf-8 encoded
  • 电子邮件不是utf-8编码的。
  • The email is utf-8 encoded but the utf-8 in the email is malformed
  • 电子邮件是utf-8编码,但电子邮件中的utf-8格式错误
  • A string is being cut at an invalid byte offset in a utf-8 sequence by non-utf-8-aware substring code
  • 非utf-8感知子字符串代码正在以utf-8序列中的无效字节偏移量切割字符串
  • your app is mishandling encodings in the MIME parts
  • 您的应用程序对MIME部分的编码处理不当
  • ...

In general, you're going to have problems inserting email into PostgreSQL because PostgreSQL is very strict about text encoding correctness, wheras mail clients produce and accept all sorts of horrible garbage. You will need to either sanitize the incoming mail (using encoding guessing, stripping suspect parts/chars, etc) or store it in raw byte sequence form as bytea.

一般来说,将电子邮件插入PostgreSQL将会有问题,因为PostgreSQL对文本编码的正确性非常严格,而邮件客户端会生成并接受各种可怕的垃圾。您将需要对传入的邮件进行清理(使用编码猜测、剥离可疑部分/字符等),或者将其存储为原始字节序列形式的bytea。

I strongly recommending storing as bytea because:

我强烈建议以茶叶的形式储存,因为:

  • One MIME message can contain parts in different encodings
  • 一个MIME消息可以包含不同编码的部分。
  • MIME parts like email attachments can be sent containing NULL bytes if they don't have a Content-Transfer-Encoding, though most clients won't do so and will base64-encode them. PostgreSQL's text type cannot store null bytes.
  • 像电子邮件附件这样的MIME部分如果没有内容传递编码,可以发送包含空字节的字节,尽管大多数客户端不会这样做,并且会对它们进行base64编码。PostgreSQL的文本类型不能存储空字节。

Of course, that depends a lot on what you're processing. You might prefer to store as text and discard parts that can't be decoded using their declared text encoding.

当然,这在很大程度上取决于你在处理什么。您可能更喜欢以文本形式存储,并丢弃不能使用声明的文本编码解码的部分。

#2


3  

PostgreSQL is strict on encoding, but the email infrastructure is not. As the doc for PHP's iconv_mime_decode indicates:

PostgreSQL在编码方面很严格,但电子邮件基础设施却不严格。正如PHP的iconv_mime_decode的文档所示:

ICONV_MIME_DECODE_STRICT If set, the given header is decoded in full conformance with the standards defined in » RFC2047. This option is disabled by default because there are a lot of broken mail user agents that don't follow the specification and don't produce correct MIME headers.

ICONV_MIME_DECODE_STRICT如果设置,给定的头被解码,完全符合»RFC2047中定义的标准。默认情况下禁用此选项,因为有许多损坏的邮件用户代理不遵循规范,也不会生成正确的MIME头部。

There are also MIME parts in email bodies that violate the character advertised in the Content-Type declaration. An invalid mail will be accepted by SMTP servers as long it can be routed to a recipient, so senders are not made aware of the problem, it's the recipient that has to deal with it.

在电子邮件主体中,也有违反内容类型声明中所宣传的字符的MIME部分。一个无效的邮件将被SMTP服务器接受,因为它可以被路由到收件人,所以发送者没有意识到问题,它是必须处理的收件人。

As a consequence, any part of an email message that has to be inserted into a database text field must be sanitized beforehand. See for example Remove non-utf8 characters from string on how to do it.

因此,必须插入到数据库文本字段中的电子邮件消息的任何部分都必须事先进行清理。例如,请参见如何从字符串中删除非utf8字符。