编码“UTF8”的无效字节序列

时间:2022-09-06 11:45:53

I'm trying to import some data into my database. So I've created a temporary table,

我想把一些数据导入数据库。我创建了一个临时表,

create temporary table tmp(pc varchar(10), lat decimal(18,12), lon decimal(18,12), city varchar(100), prov varchar(2));

And now I'm trying to import the data,

现在我要导入数据,

 copy tmp from '/home/mark/Desktop/Canada.csv' delimiter ',' csv

But then I get the error,

但是我得到了误差,

ERROR:  invalid byte sequence for encoding "UTF8": 0xc92c

How do I fix that? Do I need to change the encoding of my entire database (if so, how?) or can I change just the encoding of my tmp table? Or should I attempt to change the encoding of the file?

我怎么解决这个问题?我是否需要改变整个数据库的编码(如果是,如何?)或者我是否可以改变我的tmp表的编码?或者我是否应该尝试更改文件的编码?

15 个解决方案

#1


83  

If you need to store UTF8 data in your database, you need a database that accepts UTF8. You can check the encoding of your database in pgAdmin. Just right-click the database, and select "Properties".

如果需要在数据库中存储UTF8数据,则需要一个接受UTF8的数据库。您可以在pgAdmin中检查数据库的编码。右键单击数据库,并选择“Properties”。

But that error seems to be telling you there's some invalid UTF8 data in your source file. That means that the copy utility has detected or guessed that you're feeding it a UTF8 file.

但是这个错误似乎告诉你源文件中有一些无效的UTF8数据。这意味着复制实用程序已经检测到或猜测您正在给它提供一个UTF8文件。

If you're running under some variant of Unix, you can check the encoding (more or less) with the file utility.

如果您在Unix的某个变体下运行,您可以使用文件实用程序检查编码(或多或少)。

$ file yourfilename
yourfilename: UTF-8 Unicode English text

(I think that will work on Macs in the terminal, too.) Not sure how to do that under Windows.

(我认为这也适用于终端的mac电脑。)不知道如何在Windows下操作。

If you use that same utility on a file that came from Windows systems (that is, a file that's not encoded in UTF8), it will probably show something like this:

如果您在来自Windows系统的文件上使用相同的实用程序(也就是说,在UTF8中没有编码的文件),它可能会显示如下内容:

$ file yourfilename
yourfilename: ASCII text, with CRLF line terminators

If things stay weird, you might try to convert your input data to a known encoding, to change your client's encoding, or both. (We're really stretching the limits of my knowledge about encodings.)

如果情况不太好,您可以尝试将您的输入数据转换为已知的编码,以改变您的客户端编码,或者两者都可以。(我们真的在扩展我对编码知识的限制。)

You can use the iconv utility to change encoding of the input data.

您可以使用iconv实用程序来更改输入数据的编码。

iconv -f original_charset -t utf-8 originalfile > newfile

You can change psql (the client) encoding following the instructions on Character Set Support. On that page, search for the phrase "To enable automatic character set conversion".

您可以根据字符集支持的说明更改psql(客户端)编码。在该页面上,搜索短语“启用自动字符集转换”。

#2


36  

psql=# copy tmp from '/path/to/file.csv' with delimiter ',' csv header encoding 'windows-1251';

Adding encoding option worked in my case.

在我的例子中添加编码选项。

#3


9  

Apparently I can just set the encoding on the fly,

显然,我可以把编码设置在苍蝇上,

 set client_encoding to 'latin1'

And then re-run the query. Not sure what encoding I should be using though.

然后重新运行查询。不知道我应该用什么编码。


latin1 made the characters legible, but most of the accented characters were in upper-case where they shouldn't have been. I assumed this was due to a bad encoding, but I think its actually the data that was just bad. I ended up keeping the latin1 encoding, but pre-processing the data and fixed the casing issues.

latin1使这些字符清晰可辨,但大多数重音字符都是大写的,不应该是大写的。我认为这是由于编码不好,但我认为它实际上是糟糕的数据。最后,我保留了latin1编码,但预处理数据并修复了外壳问题。

#4


5  

This error means that records encoding in the file is different with respect to the connection. In this case iconv may return the error, sometimes even despite //IGNORE flag:

这个错误意味着文件中的记录编码与连接的不同。在这种情况下,iconv可能会返回错误,有时甚至忽略//忽略标志:

iconv -f ASCII -t utf-8//IGNORE < b.txt > /a.txt

iconv -f ASCII -t utf-8//忽略< b。txt > / a.txt

iconv: illegal input sequence at position (some number)

iconv:非法输入序列的位置(一些数字)

The trick is to find incorrect characters and replace it. To do it on Linux use "vim" editor:

诀窍是找到不正确的字符并替换它。在Linux上使用“vim”编辑器:

vim (your text file), press "ESC": button and type ":goto (number returned by iconv)"

vim(您的文本文件),按“ESC”:按钮和类型“:goto (iconv返回的数字)”

To find non ASCII characters you may use the following command:

要找到非ASCII字符,可以使用以下命令:

grep --color='auto' -P "[\x80-\xFF]"

grep -颜色= '汽车' - p[\ x80 - \ xFF]”

If you remove incorrect characters please check if you really need to convert your file: probably the problem is already solved.

如果您删除了不正确的字符,请检查您是否真的需要转换您的文件:可能问题已经解决了。

#5


4  

It depends on what type of machine/encoding generated your import file.

这取决于生成您的导入文件的机器/编码类型。

If you're getting it from an English or Western European version of Windows, your best bet is probably setting it to 'WIN1252'. If you are getting it from a different source, consult the list of character encodings here:

如果你是从一个英国或西欧版本的Windows中得到它,你最好的选择可能是将它设置为“WIN1252”。如果您是从不同的来源获得的,请参考这里的字符编码列表:

http://www.postgresql.org/docs/8.3/static/multibyte.html

http://www.postgresql.org/docs/8.3/static/multibyte.html

If you're getting it from a Mac, you may have to run it through the "iconv" utility first to convert it from MacRoman to UTF-8.

如果你是从Mac电脑上得到的,你可能需要先通过“iconv”工具,把它从MacRoman转换成UTF-8。

#6


4  

Well I was facing the same problem. And what solved my problem is this:

我也遇到了同样的问题。解决了我的问题的是:

In excel click on Save as. From save as type, choose .csv Click on Tools. Then choose web options from drop down list. Under Encoding tab, save the document as Unicode(UTF-8). Click OK. Save the file. DONE !

在excel中单击Save as。从保存为类型,选择.csv单击工具。然后从下拉列表中选择web选项。在编码选项卡下,将文档保存为Unicode(UTF-8)。单击OK。保存文件。完成了!

#7


2  

follow the below steps to solve this issue in pgadmin:

按照以下步骤在pgadmin中解决这个问题:

  1. SET client_encoding = 'ISO_8859_5';

    设置client_encoding =“ISO_8859_5”;

  2. COPY tablename(column names) FROM 'D:/DB_BAK/csvfilename.csv' WITH DELIMITER ',' CSV ;

    从“D:/DB_BAK/csvfilename”中复制tablename(列名)。有分隔符的csv, csv;

#8


2  

I had the same problem, and found a nice solution here: http://blog.e-shell.org/134

我遇到了同样的问题,并在这里找到了一个很好的解决方案:http://blog.e-shell.org/134。

This is caused by a mismatch in your database encodings, surely because the database from where you got the SQL dump was encoded as SQL_ASCII while the new one is encoded as UTF8. .. Recode is a small tool from the GNU project that let you change on-the-fly the encoding of a given file.

这是由于数据库编码的不匹配造成的,这肯定是因为从您获得SQL转储的地方数据库被编码为SQL_ASCII,而新的数据库编码为UTF8。Recode是GNU项目中的一个小工具,它允许您动态地更改给定文件的编码。

So I just recoded the dumpfile before playing it back:

所以在回放之前,我先把这个文件重新编码:

postgres> gunzip -c /var/backups/pgall_b1.zip | recode iso-8859-1..u8 | psql test

In Debian or Ubuntu systems, recode can be installed via package.

在Debian或Ubuntu系统中,可通过包安装recode。

#9


1  

You can replace the backslash character with, for example a pipe character, with sed.

您可以用sed替换反斜杠字符,例如一个管道字符。

sed -i -- 's/\\/|/g' filename.txt

#10


1  

copy tablename from 'filepath\filename' DELIMITERS '=' ENCODING 'WIN1252';

you can try this to handle UTF8 encoding.

您可以尝试使用它来处理UTF8编码。

#11


1  

If you are ok with discarding nonconvertible characters, you can use -c flag

如果您可以丢弃不可转换的字符,您可以使用-c标志。

iconv -c -t utf8 filename.csv > filename.utf8.csv

and then copy them to your table

然后把它们复制到你的桌子上。

#12


0  

This error may occur if input data contain escape character itself. By default escape character is "\" symbol, so if your input text contain "\" character - try to change the default value using ESCAPE option.

如果输入数据包含转义字符本身,则会出现此错误。默认转义字符是“\”符号,因此,如果您的输入文本包含“\”字符—尝试使用escape选项改变默认值。

#13


0  

It is also very possible with this error that the field is encrypted in place. Be sure you are looking at the right table, in some cases administrators will create an unencrypted view that you can use instead. I recently encountered a very similar issue.

这个错误也很有可能被加密了。确保您正在查看正确的表,在某些情况下,管理员将创建一个未加密的视图,您可以使用它。我最近遇到了一个非常类似的问题。

#14


0  

I got the same error when I was trying to copy a csv generated by Excel to a Postgres table (all on a Mac). This is how I resolved it:

当我试图复制Excel生成的csv文件(全部在Mac上)时,我也犯了同样的错误。我就是这样解决的:

1) Open the File in Atom (the IDE that I use)

1)在Atom中打开文件(我使用的IDE)

2) Make an insignificant change in the file. Save the file. Undo the change. Save again.

2)在文件中做一个无关紧要的更改。保存文件。撤销更改。再次保存。

Presto! Copy command worked now.

您看!复制命令现在工作。

(I think Atom saved it in a format which worked)

(我认为Atom以一种有效的格式保存了它)

#15


0  

For python, you need to use

对于python,您需要使用。

Class pg8000.types.Bytea (str) Bytea is a str-derived class that is mapped to a PostgreSQL byte array.

类pg8000.types。Bytea (str) Bytea是一个被映射到PostgreSQL字节数组的字符串派生类。

or

Pg8000.Binary (value) Construct an object holding binary data.

Pg8000。二进制(值)构造一个持有二进制数据的对象。

#1


83  

If you need to store UTF8 data in your database, you need a database that accepts UTF8. You can check the encoding of your database in pgAdmin. Just right-click the database, and select "Properties".

如果需要在数据库中存储UTF8数据,则需要一个接受UTF8的数据库。您可以在pgAdmin中检查数据库的编码。右键单击数据库,并选择“Properties”。

But that error seems to be telling you there's some invalid UTF8 data in your source file. That means that the copy utility has detected or guessed that you're feeding it a UTF8 file.

但是这个错误似乎告诉你源文件中有一些无效的UTF8数据。这意味着复制实用程序已经检测到或猜测您正在给它提供一个UTF8文件。

If you're running under some variant of Unix, you can check the encoding (more or less) with the file utility.

如果您在Unix的某个变体下运行,您可以使用文件实用程序检查编码(或多或少)。

$ file yourfilename
yourfilename: UTF-8 Unicode English text

(I think that will work on Macs in the terminal, too.) Not sure how to do that under Windows.

(我认为这也适用于终端的mac电脑。)不知道如何在Windows下操作。

If you use that same utility on a file that came from Windows systems (that is, a file that's not encoded in UTF8), it will probably show something like this:

如果您在来自Windows系统的文件上使用相同的实用程序(也就是说,在UTF8中没有编码的文件),它可能会显示如下内容:

$ file yourfilename
yourfilename: ASCII text, with CRLF line terminators

If things stay weird, you might try to convert your input data to a known encoding, to change your client's encoding, or both. (We're really stretching the limits of my knowledge about encodings.)

如果情况不太好,您可以尝试将您的输入数据转换为已知的编码,以改变您的客户端编码,或者两者都可以。(我们真的在扩展我对编码知识的限制。)

You can use the iconv utility to change encoding of the input data.

您可以使用iconv实用程序来更改输入数据的编码。

iconv -f original_charset -t utf-8 originalfile > newfile

You can change psql (the client) encoding following the instructions on Character Set Support. On that page, search for the phrase "To enable automatic character set conversion".

您可以根据字符集支持的说明更改psql(客户端)编码。在该页面上,搜索短语“启用自动字符集转换”。

#2


36  

psql=# copy tmp from '/path/to/file.csv' with delimiter ',' csv header encoding 'windows-1251';

Adding encoding option worked in my case.

在我的例子中添加编码选项。

#3


9  

Apparently I can just set the encoding on the fly,

显然,我可以把编码设置在苍蝇上,

 set client_encoding to 'latin1'

And then re-run the query. Not sure what encoding I should be using though.

然后重新运行查询。不知道我应该用什么编码。


latin1 made the characters legible, but most of the accented characters were in upper-case where they shouldn't have been. I assumed this was due to a bad encoding, but I think its actually the data that was just bad. I ended up keeping the latin1 encoding, but pre-processing the data and fixed the casing issues.

latin1使这些字符清晰可辨,但大多数重音字符都是大写的,不应该是大写的。我认为这是由于编码不好,但我认为它实际上是糟糕的数据。最后,我保留了latin1编码,但预处理数据并修复了外壳问题。

#4


5  

This error means that records encoding in the file is different with respect to the connection. In this case iconv may return the error, sometimes even despite //IGNORE flag:

这个错误意味着文件中的记录编码与连接的不同。在这种情况下,iconv可能会返回错误,有时甚至忽略//忽略标志:

iconv -f ASCII -t utf-8//IGNORE < b.txt > /a.txt

iconv -f ASCII -t utf-8//忽略< b。txt > / a.txt

iconv: illegal input sequence at position (some number)

iconv:非法输入序列的位置(一些数字)

The trick is to find incorrect characters and replace it. To do it on Linux use "vim" editor:

诀窍是找到不正确的字符并替换它。在Linux上使用“vim”编辑器:

vim (your text file), press "ESC": button and type ":goto (number returned by iconv)"

vim(您的文本文件),按“ESC”:按钮和类型“:goto (iconv返回的数字)”

To find non ASCII characters you may use the following command:

要找到非ASCII字符,可以使用以下命令:

grep --color='auto' -P "[\x80-\xFF]"

grep -颜色= '汽车' - p[\ x80 - \ xFF]”

If you remove incorrect characters please check if you really need to convert your file: probably the problem is already solved.

如果您删除了不正确的字符,请检查您是否真的需要转换您的文件:可能问题已经解决了。

#5


4  

It depends on what type of machine/encoding generated your import file.

这取决于生成您的导入文件的机器/编码类型。

If you're getting it from an English or Western European version of Windows, your best bet is probably setting it to 'WIN1252'. If you are getting it from a different source, consult the list of character encodings here:

如果你是从一个英国或西欧版本的Windows中得到它,你最好的选择可能是将它设置为“WIN1252”。如果您是从不同的来源获得的,请参考这里的字符编码列表:

http://www.postgresql.org/docs/8.3/static/multibyte.html

http://www.postgresql.org/docs/8.3/static/multibyte.html

If you're getting it from a Mac, you may have to run it through the "iconv" utility first to convert it from MacRoman to UTF-8.

如果你是从Mac电脑上得到的,你可能需要先通过“iconv”工具,把它从MacRoman转换成UTF-8。

#6


4  

Well I was facing the same problem. And what solved my problem is this:

我也遇到了同样的问题。解决了我的问题的是:

In excel click on Save as. From save as type, choose .csv Click on Tools. Then choose web options from drop down list. Under Encoding tab, save the document as Unicode(UTF-8). Click OK. Save the file. DONE !

在excel中单击Save as。从保存为类型,选择.csv单击工具。然后从下拉列表中选择web选项。在编码选项卡下,将文档保存为Unicode(UTF-8)。单击OK。保存文件。完成了!

#7


2  

follow the below steps to solve this issue in pgadmin:

按照以下步骤在pgadmin中解决这个问题:

  1. SET client_encoding = 'ISO_8859_5';

    设置client_encoding =“ISO_8859_5”;

  2. COPY tablename(column names) FROM 'D:/DB_BAK/csvfilename.csv' WITH DELIMITER ',' CSV ;

    从“D:/DB_BAK/csvfilename”中复制tablename(列名)。有分隔符的csv, csv;

#8


2  

I had the same problem, and found a nice solution here: http://blog.e-shell.org/134

我遇到了同样的问题,并在这里找到了一个很好的解决方案:http://blog.e-shell.org/134。

This is caused by a mismatch in your database encodings, surely because the database from where you got the SQL dump was encoded as SQL_ASCII while the new one is encoded as UTF8. .. Recode is a small tool from the GNU project that let you change on-the-fly the encoding of a given file.

这是由于数据库编码的不匹配造成的,这肯定是因为从您获得SQL转储的地方数据库被编码为SQL_ASCII,而新的数据库编码为UTF8。Recode是GNU项目中的一个小工具,它允许您动态地更改给定文件的编码。

So I just recoded the dumpfile before playing it back:

所以在回放之前,我先把这个文件重新编码:

postgres> gunzip -c /var/backups/pgall_b1.zip | recode iso-8859-1..u8 | psql test

In Debian or Ubuntu systems, recode can be installed via package.

在Debian或Ubuntu系统中,可通过包安装recode。

#9


1  

You can replace the backslash character with, for example a pipe character, with sed.

您可以用sed替换反斜杠字符,例如一个管道字符。

sed -i -- 's/\\/|/g' filename.txt

#10


1  

copy tablename from 'filepath\filename' DELIMITERS '=' ENCODING 'WIN1252';

you can try this to handle UTF8 encoding.

您可以尝试使用它来处理UTF8编码。

#11


1  

If you are ok with discarding nonconvertible characters, you can use -c flag

如果您可以丢弃不可转换的字符,您可以使用-c标志。

iconv -c -t utf8 filename.csv > filename.utf8.csv

and then copy them to your table

然后把它们复制到你的桌子上。

#12


0  

This error may occur if input data contain escape character itself. By default escape character is "\" symbol, so if your input text contain "\" character - try to change the default value using ESCAPE option.

如果输入数据包含转义字符本身,则会出现此错误。默认转义字符是“\”符号,因此,如果您的输入文本包含“\”字符—尝试使用escape选项改变默认值。

#13


0  

It is also very possible with this error that the field is encrypted in place. Be sure you are looking at the right table, in some cases administrators will create an unencrypted view that you can use instead. I recently encountered a very similar issue.

这个错误也很有可能被加密了。确保您正在查看正确的表,在某些情况下,管理员将创建一个未加密的视图,您可以使用它。我最近遇到了一个非常类似的问题。

#14


0  

I got the same error when I was trying to copy a csv generated by Excel to a Postgres table (all on a Mac). This is how I resolved it:

当我试图复制Excel生成的csv文件(全部在Mac上)时,我也犯了同样的错误。我就是这样解决的:

1) Open the File in Atom (the IDE that I use)

1)在Atom中打开文件(我使用的IDE)

2) Make an insignificant change in the file. Save the file. Undo the change. Save again.

2)在文件中做一个无关紧要的更改。保存文件。撤销更改。再次保存。

Presto! Copy command worked now.

您看!复制命令现在工作。

(I think Atom saved it in a format which worked)

(我认为Atom以一种有效的格式保存了它)

#15


0  

For python, you need to use

对于python,您需要使用。

Class pg8000.types.Bytea (str) Bytea is a str-derived class that is mapped to a PostgreSQL byte array.

类pg8000.types。Bytea (str) Bytea是一个被映射到PostgreSQL字节数组的字符串派生类。

or

Pg8000.Binary (value) Construct an object holding binary data.

Pg8000。二进制(值)构造一个持有二进制数据的对象。