如何在Linux中使用包含非ascii字符串的wchar_t*来打开文件?

时间:2022-03-08 20:13:34

Environment: Gcc/G++ Linux

环境:Gcc / G + + Linux

I have a non-ascii file in file system and I'm going to open it.

我有一个非ascii文件的文件系统,我将打开它。

Now I have a wchar_t*, but I don't know how to open it. (my trusted fopen only opens char* file)

现在我有一个wchar_t*,但是我不知道怎么打开它。(我信任的fopen只打开char*文件)

Please help. Thanks a lot.

请帮助。非常感谢。

6 个解决方案

#1


12  

There are two possible answers:

有两个可能的答案:

If you want to make sure all Unicode filenames are representable, you can hard-code the assumption that the filesystem uses UTF-8 filenames. This is the "modern" Linux desktop-app approach. Just convert your strings from wchar_t (UTF-32) to UTF-8 with library functions (iconv would work well) or your own implementation (but lookup the specs so you don't get it horribly wrong like Shelwien did), then use fopen.

如果您希望确保所有Unicode文件名都是可表示的,那么您可以硬编码该文件系统使用UTF-8文件名的假设。这是“现代”的Linux桌面应用程序。只需将您的字符串从wchar_t (UTF-32)转换为UTF-8,使用库函数(iconv可以很好地工作)或您自己的实现(但是查找规范,这样您就不会像Shelwien那样得到严重的错误),然后使用fopen。

If you want to do things the more standards-oriented way, you should use wcsrtombs to convert the wchar_t string to a multibyte char string in the locale's encoding (which hopefully is UTF-8 anyway on any modern system) and use fopen. Note that this requires that you previously set the locale with setlocale(LC_CTYPE, "") or setlocale(LC_ALL, "").

如果您想要做一些更符合标准的事情,您应该使用wcsr将wchar_t字符串转换为本地语言编码中的多字节字符字符串(希望在任何现代系统中都是UTF-8),并使用fopen。请注意,这要求您之前设置的区域设置为setlocale(LC_CTYPE,“”)或setlocale(LC_ALL,“”)。

And finally, not exactly an answer but a recommendation:

最后,不是一个确切的答案,而是一个建议:

Storing filenames as wchar_t strings is probably a horrible mistake. You should instead store filenames as abstract byte strings, and only convert those to wchar_t just-in-time for displaying them in the user interface (if it's even necessary for that; many UI toolkits use plain byte strings themselves and do the interpretation as characters for you). This way you eliminate a lot of possible nasty corner cases, and you never encounter a situation where some files are inaccessible due to their names.

存储文件名作为wchar_t字符串可能是一个可怕的错误。相反,应该将文件名作为抽象的字节字符串存储,并且只将它们转换为wchar_t,以便在用户界面中显示它们(如果有必要的话);许多UI工具包使用简单的字节字符串本身,并以字符的形式为您进行解释。通过这种方式,您可以消除许多可能出现的严重问题,并且您永远不会遇到这样的情况:某些文件由于其名称而无法访问。

#2


3  

Linux is not UTF-8, but it's your only choice for filenames anyway

(Files can have anything you want inside them.)

(文件可以在里面放任何你想要的东西。)


With respect to filenames, linux does not really have a string encoding to worry about. Filenames are byte strings that need to be null-terminated.

关于文件名,linux并没有真正需要担心的字符串编码。Filenames是需要以null结尾的字节字符串。

This doesn't precisely mean that Linux is UTF-8, but it does mean that it's not compatible with wide characters as they could have a zero in a byte that's not the end byte.

这并不意味着Linux是UTF-8,但它确实意味着它与宽字符不兼容,因为它们在字节中可以有一个0,而不是end字节。

But UTF-8 preserves the no-nulls-except-at-the-end model, so I have to believe that the practical approach is "convert to UTF-8" for filenames.

但是UTF-8保留了“无空”的模式,所以我不得不相信,实际的方法是“转换为UTF-8”作为文件名。

The content of files is a matter for standards above the Linux kernel level, so here there isn't anything Linux-y that you can or want to do. The content of files will be solely the concern of the programs that read and write them. Linux just stores and returns the byte stream, and it can have all the embedded nuls you want.

文件的内容与Linux内核级别之上的标准有关,所以这里没有您可以或想要做的任何Linux-y。文件的内容将完全是那些读和写程序的关注点。Linux只是存储和返回字节流,它可以拥有您想要的所有嵌入的nuls。

#3


1  

Convert wchar string to utf8 char string, then use fopen.

将wchar字符串转换为utf8 char字符串,然后使用fopen。

typedef unsigned int   uint;
typedef unsigned short word;
typedef unsigned char  byte;

int UTF16to8( wchar_t* w, char* s ) {
  uint  c;
  word* p = (word*)w;
  byte* q = (byte*)s; byte* q0 = q;
  while( 1 ) {
    c = *p++;
    if( c==0 ) break;
    if( c<0x080 ) *q++ = c; else 
      if( c<0x800 ) *q++ = 0xC0+(c>>6), *q++ = 0x80+(c&63); else 
        *q++ = 0xE0+(c>>12), *q++ = 0x80+((c>>6)&63), *q++ = 0x80+(c&63);
  }
  *q = 0;
  return q-q0;
}

int UTF8to16( char* s, wchar_t* w ) {
  uint  cache,wait,c;
  byte* p = (byte*)s;
  word* q = (word*)w; word* q0 = q;
  while(1) {
    c = *p++;
    if( c==0 ) break;
    if( c<0x80 ) cache=c,wait=0; else
      if( (c>=0xC0) && (c<=0xE0) ) cache=c&31,wait=1; else 
        if( (c>=0xE0) ) cache=c&15,wait=2; else
          if( wait ) (cache<<=6)+=c&63,wait--;
    if( wait==0 ) *q++=cache;
  }
  *q = 0;
  return q-q0;
}

#4


0  

Check out this document

看看这个文档

http://www.firstobject.com/wchar_t-string-on-linux-osx-windows.htm

http://www.firstobject.com/wchar_t-string-on-linux-osx-windows.htm

I think Linux follows POSIX standard, which treats all file names as UTF-8.

我认为Linux遵循POSIX标准,它将所有文件命名为UTF-8。

#5


0  

I take it it's the name of the file that contains non-ascii characters, not the file itself, when you say "non-ascii file in file system". It doesn't really matter what the file contains.

我认为它是包含非ascii字符的文件的名称,而不是文件本身,当你说“文件系统中的非ascii文件”时。文件包含什么并不重要。

You can do this with normal fopen, but you'll have to match the encoding the filesystem uses.

您可以使用正常的fopen来实现这一点,但是您必须匹配文件系统使用的编码。

It depends on what version of Linux and what filesystem you're using and how you've set it up, but likely, if you're lucky, the filesystem uses UTF-8. So take your wchar_t (which is probably a UTF-16 encoded string?), convert it to a char string encoded in UTF-8, and pass that to fopen.

这取决于您使用的是什么版本的Linux以及您使用的文件系统,以及您是如何设置的,但如果您幸运的话,文件系统可能使用UTF-8。因此,将wchar_t(可能是UTF-16编码的字符串)转换为UTF-8编码的char字符串,并将其传递给fopen。

#6


0  

// locals
string file_to_read;           // any file
wstring file;                  // read ascii or non-ascii file here 
FILE *stream;
int read = 0;    
wchar_t buffer= '0';

if( fopen_s( &stream, file_to_read.c_str(), "r+b" ) == 0 )   // in binary mode
  {      
      while( !feof( stream ))
      { 
     // if ascii file second arg must be sizeof(char). if non ascii file sizeof( wchar_t)
        read = fread( & buffer, sizeof( char ), 1, stream );  
        file.append(1, buffer);
      }
  }

file.pop_back(); // since this code reads the last character twice.Throw the last one
fclose(stream);

// and the file is in wstring format.You can use it in any C++ wstring operation
// this code is fast enough i think, at least in my practice
// for windows because of fopen_s

#1


12  

There are two possible answers:

有两个可能的答案:

If you want to make sure all Unicode filenames are representable, you can hard-code the assumption that the filesystem uses UTF-8 filenames. This is the "modern" Linux desktop-app approach. Just convert your strings from wchar_t (UTF-32) to UTF-8 with library functions (iconv would work well) or your own implementation (but lookup the specs so you don't get it horribly wrong like Shelwien did), then use fopen.

如果您希望确保所有Unicode文件名都是可表示的,那么您可以硬编码该文件系统使用UTF-8文件名的假设。这是“现代”的Linux桌面应用程序。只需将您的字符串从wchar_t (UTF-32)转换为UTF-8,使用库函数(iconv可以很好地工作)或您自己的实现(但是查找规范,这样您就不会像Shelwien那样得到严重的错误),然后使用fopen。

If you want to do things the more standards-oriented way, you should use wcsrtombs to convert the wchar_t string to a multibyte char string in the locale's encoding (which hopefully is UTF-8 anyway on any modern system) and use fopen. Note that this requires that you previously set the locale with setlocale(LC_CTYPE, "") or setlocale(LC_ALL, "").

如果您想要做一些更符合标准的事情,您应该使用wcsr将wchar_t字符串转换为本地语言编码中的多字节字符字符串(希望在任何现代系统中都是UTF-8),并使用fopen。请注意,这要求您之前设置的区域设置为setlocale(LC_CTYPE,“”)或setlocale(LC_ALL,“”)。

And finally, not exactly an answer but a recommendation:

最后,不是一个确切的答案,而是一个建议:

Storing filenames as wchar_t strings is probably a horrible mistake. You should instead store filenames as abstract byte strings, and only convert those to wchar_t just-in-time for displaying them in the user interface (if it's even necessary for that; many UI toolkits use plain byte strings themselves and do the interpretation as characters for you). This way you eliminate a lot of possible nasty corner cases, and you never encounter a situation where some files are inaccessible due to their names.

存储文件名作为wchar_t字符串可能是一个可怕的错误。相反,应该将文件名作为抽象的字节字符串存储,并且只将它们转换为wchar_t,以便在用户界面中显示它们(如果有必要的话);许多UI工具包使用简单的字节字符串本身,并以字符的形式为您进行解释。通过这种方式,您可以消除许多可能出现的严重问题,并且您永远不会遇到这样的情况:某些文件由于其名称而无法访问。

#2


3  

Linux is not UTF-8, but it's your only choice for filenames anyway

(Files can have anything you want inside them.)

(文件可以在里面放任何你想要的东西。)


With respect to filenames, linux does not really have a string encoding to worry about. Filenames are byte strings that need to be null-terminated.

关于文件名,linux并没有真正需要担心的字符串编码。Filenames是需要以null结尾的字节字符串。

This doesn't precisely mean that Linux is UTF-8, but it does mean that it's not compatible with wide characters as they could have a zero in a byte that's not the end byte.

这并不意味着Linux是UTF-8,但它确实意味着它与宽字符不兼容,因为它们在字节中可以有一个0,而不是end字节。

But UTF-8 preserves the no-nulls-except-at-the-end model, so I have to believe that the practical approach is "convert to UTF-8" for filenames.

但是UTF-8保留了“无空”的模式,所以我不得不相信,实际的方法是“转换为UTF-8”作为文件名。

The content of files is a matter for standards above the Linux kernel level, so here there isn't anything Linux-y that you can or want to do. The content of files will be solely the concern of the programs that read and write them. Linux just stores and returns the byte stream, and it can have all the embedded nuls you want.

文件的内容与Linux内核级别之上的标准有关,所以这里没有您可以或想要做的任何Linux-y。文件的内容将完全是那些读和写程序的关注点。Linux只是存储和返回字节流,它可以拥有您想要的所有嵌入的nuls。

#3


1  

Convert wchar string to utf8 char string, then use fopen.

将wchar字符串转换为utf8 char字符串,然后使用fopen。

typedef unsigned int   uint;
typedef unsigned short word;
typedef unsigned char  byte;

int UTF16to8( wchar_t* w, char* s ) {
  uint  c;
  word* p = (word*)w;
  byte* q = (byte*)s; byte* q0 = q;
  while( 1 ) {
    c = *p++;
    if( c==0 ) break;
    if( c<0x080 ) *q++ = c; else 
      if( c<0x800 ) *q++ = 0xC0+(c>>6), *q++ = 0x80+(c&63); else 
        *q++ = 0xE0+(c>>12), *q++ = 0x80+((c>>6)&63), *q++ = 0x80+(c&63);
  }
  *q = 0;
  return q-q0;
}

int UTF8to16( char* s, wchar_t* w ) {
  uint  cache,wait,c;
  byte* p = (byte*)s;
  word* q = (word*)w; word* q0 = q;
  while(1) {
    c = *p++;
    if( c==0 ) break;
    if( c<0x80 ) cache=c,wait=0; else
      if( (c>=0xC0) && (c<=0xE0) ) cache=c&31,wait=1; else 
        if( (c>=0xE0) ) cache=c&15,wait=2; else
          if( wait ) (cache<<=6)+=c&63,wait--;
    if( wait==0 ) *q++=cache;
  }
  *q = 0;
  return q-q0;
}

#4


0  

Check out this document

看看这个文档

http://www.firstobject.com/wchar_t-string-on-linux-osx-windows.htm

http://www.firstobject.com/wchar_t-string-on-linux-osx-windows.htm

I think Linux follows POSIX standard, which treats all file names as UTF-8.

我认为Linux遵循POSIX标准,它将所有文件命名为UTF-8。

#5


0  

I take it it's the name of the file that contains non-ascii characters, not the file itself, when you say "non-ascii file in file system". It doesn't really matter what the file contains.

我认为它是包含非ascii字符的文件的名称,而不是文件本身,当你说“文件系统中的非ascii文件”时。文件包含什么并不重要。

You can do this with normal fopen, but you'll have to match the encoding the filesystem uses.

您可以使用正常的fopen来实现这一点,但是您必须匹配文件系统使用的编码。

It depends on what version of Linux and what filesystem you're using and how you've set it up, but likely, if you're lucky, the filesystem uses UTF-8. So take your wchar_t (which is probably a UTF-16 encoded string?), convert it to a char string encoded in UTF-8, and pass that to fopen.

这取决于您使用的是什么版本的Linux以及您使用的文件系统,以及您是如何设置的,但如果您幸运的话,文件系统可能使用UTF-8。因此,将wchar_t(可能是UTF-16编码的字符串)转换为UTF-8编码的char字符串,并将其传递给fopen。

#6


0  

// locals
string file_to_read;           // any file
wstring file;                  // read ascii or non-ascii file here 
FILE *stream;
int read = 0;    
wchar_t buffer= '0';

if( fopen_s( &stream, file_to_read.c_str(), "r+b" ) == 0 )   // in binary mode
  {      
      while( !feof( stream ))
      { 
     // if ascii file second arg must be sizeof(char). if non ascii file sizeof( wchar_t)
        read = fread( & buffer, sizeof( char ), 1, stream );  
        file.append(1, buffer);
      }
  }

file.pop_back(); // since this code reads the last character twice.Throw the last one
fclose(stream);

// and the file is in wstring format.You can use it in any C++ wstring operation
// this code is fast enough i think, at least in my practice
// for windows because of fopen_s