从包含Unicode字符的文件中读取

时间:2021-12-01 07:01:31

I have this huge file which contains unicode strings at the beginning (first ~10,000 character or so) I don't care about the unicode part, parts I'm interested aren't unicode but whenever I try to read those parts I get '=', and if I were to load the entire file to char array and write to to some temporary file (without altering the data) with ofstream I get incorrect data actually all I get is a text file filled with Í If I were to remove the unicode part manually everything works fine, So it seems ifstream cannot deal with streams which contains unicode data, but if this assumption is true, is there any way to work on this file introducing a new library to my project?

我有一个巨大的文件,在开头包含unicode字符串(第一个~10,000个字符左右)我不关心unicode部分,我感兴趣的部分不是unicode但是每当我尝试读取那些部分时我得到的' =',如果我要将整个文件加载到char数组并使用ofstream写入一些临时文件(不更改数据)我得到的数据不正确我实际得到的是一个填充了Í的文本文件如果我要删除unicode部分手动一切正常,所以似乎ifstream不能处理包含unicode数据的流,但如果这个假设是真的,有没有办法处理这个文件,为我的项目引入一个新的库?

Thanks,

EDIT: Here's a sample code, program reads from this file which contains characters (some, not all) that can't be represented in ASCII.

编辑:这是一个示例代码,程序读取此文件,其中包含无法用ASCII表示的字符(一些,而不是全部)。

ifstream inFile("somefile");
inFile.seekg(0,ios_base::end);
size_t size = inFile.tellg();
inFile.seekg(0,ios_base::beg);
char *book = new  char[size];
inFile.read(book,size);
for (int i = 0; i < size; i++) {
  cout << book[i] << " " << i << endl; //book[i] will always be '='
}
ofstream outFile("TEST.txt");
outFile.write(book,size);
outFile.close();

2 个解决方案

#1


4  

Keith Thompson's question is very important. Depending on which Unicode encoding, writing a small C routine that reads (and discards) the Unicode characters can be trivial, or slightly more complex.

基思汤普森的问题非常重要。根据哪种Unicode编码,编写读取(和丢弃)Unicode字符的小C例程可能很简单,或者稍微复杂一些。

Supposing the encoding is UTF-8, you will have a problem determining when to stop discarding because ASCII is a subset of UTF-8, so any time you encounter an ASCII char, you might be tempted to say "this is it, we're back in ASCII land" and the next char still might be still outside the ASCII range.

假设编码为UTF-8,您将无法确定何时停止丢弃,因为ASCII是UTF-8的子集,所以每当遇到ASCII字符时,您可能会想说“就是这样,我们”重新回到ASCII域“并且下一个字符仍然可能仍在ASCII范围之外。

So you need to read the file and determine where the last character>127 is. Anything after that is plain ASCII -- hopefully.

因此,您需要读取文件并确定最后一个字符> 127的位置。之后的任何东西都是纯ASCII - 希望如此。

#2


0  

A text file is generally in just one encoding utf-8, utf-16 (big or little endian) or utf-32 (big or little) or ASCII or other ANSI code pages. Mixing of encoding is only possible in some custom ways.

文本文件通常只有一个编码utf-8,utf-16(大或小端)或utf-32(大或小)或ASCII或其他ANSI代码页。只能以某些自定义方式混合编码。

That said, you will have to read both the data that you need and that you don't in the same encoding. If you know the format is utf-8 you could, depending on what you are going to do with the data, read the file as a binary file into char buffer piece by piece. Then you could API(s) like strnextc (on windows. equivalent API must be available on other platforms) to move character by character on the buffer. Once you reach the end - you could move the balance to the front of the buffer and load the rest of the buffer from the file.

也就是说,您必须同时读取所需数据和不使用相同编码的数据。如果您知道格式为utf-8,您可以根据您要对数据执行的操作,将文件作为二进制文件逐个读取到char缓冲区中。然后你就可以使用像strnextc这样的API(在Windows上。等效API必须在其他平台上可用)来逐个字符地移动缓冲区。到达终点后 - 您可以将余额移动到缓冲区的前面,并从文件中加载剩余的缓冲区。

In fact you could use the above approach in general for any encoding. But for utf-16, you could try using wifstream - provided the endianess of the file and the platform you would be running on is the same. And you need to check if the implementation of wifstream is good at handling change in endiness and is able to take care of BOM (byte order mark) - 2 byte sequence ("FE FF" or "FF FE") that is generally present at the beginning of a file - leave alone surrogate pairs.

实际上,您可以将上述方法用于任何编码。但对于utf-16,您可以尝试使用wifstream - 只要文件的endianess和您将运行的平台是相同的。并且您需要检查wifstream的实现是否善于处理持久性变化并且能够处理BOM(字节顺序标记) - 通常出现在的2字节序列(“FE FF”或“FF FE”)文件的开头 - 留下代理对。

#1


4  

Keith Thompson's question is very important. Depending on which Unicode encoding, writing a small C routine that reads (and discards) the Unicode characters can be trivial, or slightly more complex.

基思汤普森的问题非常重要。根据哪种Unicode编码,编写读取(和丢弃)Unicode字符的小C例程可能很简单,或者稍微复杂一些。

Supposing the encoding is UTF-8, you will have a problem determining when to stop discarding because ASCII is a subset of UTF-8, so any time you encounter an ASCII char, you might be tempted to say "this is it, we're back in ASCII land" and the next char still might be still outside the ASCII range.

假设编码为UTF-8,您将无法确定何时停止丢弃,因为ASCII是UTF-8的子集,所以每当遇到ASCII字符时,您可能会想说“就是这样,我们”重新回到ASCII域“并且下一个字符仍然可能仍在ASCII范围之外。

So you need to read the file and determine where the last character>127 is. Anything after that is plain ASCII -- hopefully.

因此,您需要读取文件并确定最后一个字符> 127的位置。之后的任何东西都是纯ASCII - 希望如此。

#2


0  

A text file is generally in just one encoding utf-8, utf-16 (big or little endian) or utf-32 (big or little) or ASCII or other ANSI code pages. Mixing of encoding is only possible in some custom ways.

文本文件通常只有一个编码utf-8,utf-16(大或小端)或utf-32(大或小)或ASCII或其他ANSI代码页。只能以某些自定义方式混合编码。

That said, you will have to read both the data that you need and that you don't in the same encoding. If you know the format is utf-8 you could, depending on what you are going to do with the data, read the file as a binary file into char buffer piece by piece. Then you could API(s) like strnextc (on windows. equivalent API must be available on other platforms) to move character by character on the buffer. Once you reach the end - you could move the balance to the front of the buffer and load the rest of the buffer from the file.

也就是说,您必须同时读取所需数据和不使用相同编码的数据。如果您知道格式为utf-8,您可以根据您要对数据执行的操作,将文件作为二进制文件逐个读取到char缓冲区中。然后你就可以使用像strnextc这样的API(在Windows上。等效API必须在其他平台上可用)来逐个字符地移动缓冲区。到达终点后 - 您可以将余额移动到缓冲区的前面,并从文件中加载剩余的缓冲区。

In fact you could use the above approach in general for any encoding. But for utf-16, you could try using wifstream - provided the endianess of the file and the platform you would be running on is the same. And you need to check if the implementation of wifstream is good at handling change in endiness and is able to take care of BOM (byte order mark) - 2 byte sequence ("FE FF" or "FF FE") that is generally present at the beginning of a file - leave alone surrogate pairs.

实际上,您可以将上述方法用于任何编码。但对于utf-16,您可以尝试使用wifstream - 只要文件的endianess和您将运行的平台是相同的。并且您需要检查wifstream的实现是否善于处理持久性变化并且能够处理BOM(字节顺序标记) - 通常出现在的2字节序列(“FE FF”或“FF FE”)文件的开头 - 留下代理对。