对std::string和std::wstring区别的解释,807个赞同,有例子

时间:2022-08-20 18:14:03

stringwstring?

std::string is a basic_string templated on a char, and std::wstring on a wchar_t.

char vs. wchar_t

char is supposed to hold a character, usually a 1-byte character. wchar_t is supposed to hold a wide character, and then, things get tricky: On Linux, a wchar_t is 4-bytes, while on Windows, it's 2-bytes

what about Unicode, then?

The problem is that neither char nor wchar_t is directly tied to unicode.

On Linux?

Let's take a Linux OS: My Ubuntu system is already unicode aware. When I work with a char string, it is natively encoded in UTF-8 (i.e. Unicode string of chars). The following code:

#include <cstring>
#include <iostream> int main(int argc, char* argv[])
{
const char text[] = "olé" ; std::cout << "sizeof(char) : " << sizeof(char) << std::endl ;
std::cout << "text : " << text << std::endl ;
std::cout << "sizeof(text) : " << sizeof(text) << std::endl ;
std::cout << "strlen(text) : " << strlen(text) << std::endl ; std::cout << "text(bytes) :" ; for(size_t i = 0, iMax = strlen(text); i < iMax; ++i)
{
std::cout << " " << static_cast<unsigned int>(
static_cast<unsigned char>(text[i])
);
} std::cout << std::endl << std::endl ; // - - - const wchar_t wtext[] = L"olé" ; std::cout << "sizeof(wchar_t) : " << sizeof(wchar_t) << std::endl ;
//std::cout << "wtext : " << wtext << std::endl ; <- error
std::cout << "wtext : UNABLE TO CONVERT NATIVELY." << std::endl ;
std::wcout << L"wtext : " << wtext << std::endl; std::cout << "sizeof(wtext) : " << sizeof(wtext) << std::endl ;
std::cout << "wcslen(wtext) : " << wcslen(wtext) << std::endl ; std::cout << "wtext(bytes) :" ; for(size_t i = 0, iMax = wcslen(wtext); i < iMax; ++i)
{
std::cout << " " << static_cast<unsigned int>(
static_cast<unsigned short>(wtext[i])
);
} std::cout << std::endl << std::endl ; return 0;
}

outputs the following text:

sizeof(char)    : 1
text : olé
sizeof(text) : 5
strlen(text) : 4
text(bytes) : 111 108 195 169 sizeof(wchar_t) : 4
wtext : UNABLE TO CONVERT NATIVELY.
wtext : ol�
sizeof(wtext) : 16
wcslen(wtext) : 3
wtext(bytes) : 111 108 233

You'll see the "olé" text in char is really constructed by four chars: 110, 108, 195 and 169 (not counting the trailing zero). (I'll let you study the wchar_t code as an exercise)

So, when working with a char on Linux, you should usually end up using Unicode without even knowing it. And as std::string works with char, so std::string is already unicode-ready.

Note that std::string, like the C string API, will consider the "olé" string to have 4 characters, not three. So you should be cautious when truncating/playing with unicode chars because some combination of chars is forbidden in UTF-8.

On Windows?

On Windows, this is a bit different. Win32 had to support a lot of application working with char and on different charsets/codepages produced in all the world, before the advent of Unicode.

So their solution was an interesting one: If an application works with char, then the char strings are encoded/printed/shown on GUI labels using the local charset/codepage on the machine. For example, "olé" would be "olé" in a french-localized Windows, but would be something différent on an cyrillic-localized Windows ("olй" if you use Windows-1251). Thus, "historical apps" will usually still work the same old way.

For Unicode based applications, Windows uses wchar_t, which is 2-bytes wide, and is encoded in UTF-16, which is Unicode encoded on 2-bytes characters (or at the very least, the mostly compatible UCS-2, which is almost the same thing IIRC).

Applications using char are said "multibyte" (because each glyph is composed of one or more chars), while applications using wchar_t are said "widechar" (because each glyph is composed of one or two wchar_t. See MultiByteToWideChar and WideCharToMultiByte Win32 conversion API for more info.

Thus, if you work on Windows, you badly want to use wchar_t (unless you use a framework hiding that, like GTK+ or QT...). The fact is that behind the scenes, Windows works with wchar_tstrings, so even historical applications will have their char strings converted in wchar_t when using API like SetWindowText (low level API function to set the label on a Win32 GUI).

Memory issues?

UTF-32 is 4 bytes per characters, so there is no much to add, if only that a UTF-8 text and UTF-16 text will always use less or the same amount of memory than an UTF-32 text (and usually less).

If there is a memory issue, then you should know than for most western languages, UTF-8 text will use less memory than the same UTF-16 one.

Still, for other languages (chinese, japanese, etc.), the memory used will be either the same, or larger for UTF-8 than for UTF-16.

All in all, UTF-16 will mostly use 2 bytes per characters (unless you're dealing with some kind of esoteric language glyphs (Klingon? Elvish?), while UTF-8 will spend from 1 to 4 bytes.

See http://en.wikipedia.org/wiki/UTF-8#Compared_to_UTF-16 for more info.

Conclusion

1. When I should use std::wstring over std::string?

On Linux? Almost never (§).
On Windows? Almost always (§).
On cross-platform code? Depends on your toolkit...

(§) : unless you use a toolkit/framework saying otherwise

2. Can std::string hold all the ASCII character set including special characters?

Notice: A std::string is suitable for holding a 'binary' buffer, where a std::wstring is not!

On Linux? Yes.
On Windows? Only special characters available for the current locale of the Windows user.

Edit (After a comment from Johann Gerell): a std::string will be enough to handle all char based strings (each char being a number from 0 to 255). But:

  1. ASCII is supposed to go from 0 to 127. Higher chars are NOT ASCII.
  2. a char from 0 to 127 will be held correctly
  3. a char from 128 to 255 will have a signification depending on your encoding (unicode, non-unicode, etc.), but it will be able to hold all Unicode glyphs as long as they are encoded in UTF-8.

3. Is std::wstring supported by almost all popular C++ compilers?

Mostly, with the exception of GCC based compilers that are ported to Windows
It works on my g++ 4.3.2 (under Linux), and I used Unicode API on Win32 since Visual C++ 6.

4. What is exactly a wide character?

On C/C++, it's a character type written wchar_t which is larger than the simple char character type. It is supposed to be used to put inside characters whose indices (like Unicode glyphs) are larger than 255 (or 127, depending...)

answered Dec 31 '08 at 12:47
对std::string和std::wstring区别的解释,807个赞同,有例子
paercebal
57.7k32110146
 
4  
@Sorin Sbarnea: UTF-8 could take 1-6 bytes, but apparently the standard limits it to 1-4. See en.wikipedia.org/wiki/UTF8#Description for more information. – paercebal Jan 13 '10 at 13:10
8  
While this examples produces different results on Linux and Windows the C++ program contains implementation-defined behavior as to whether olè is encoded as UTF-8 or not. Further more, the reason you cannot natively stream wchar_t * to std::cout is because the types are incompatible resulting in an ill-formed program and it has nothing to do with the use of encodings. It's worth pointing out that whether you use std::string or std::wstring depends on your own encoding preference rather than the platform, especially if you want your code to be portable. – John Leidegren Aug 9 '12 at 9:37
4  
@paercebal Whatever the platform supports is entirely arbitrary and besides the point. If you store all strings internally as UTF-8 on Windows you'll have to convert them to either ANSI or UTF-16 and call the corresponding Win32 function but if you know your UTF-8 strings are just plain ASCII strings you don't have to do anything. The platform doesn't dictate how you use strings as much as the circumstances. – John Leidegren Aug 9 '12 at 16:35
13  
Windows actually uses UTF-16 and have been for quite some time, older versions of Windows did use UCS-2 but this is not the case any longer. My only issue here is the conclusion that std::wstring should be used on Windows because it's a better fit for the Unicode Windows API which I think is fallacious. If your only concern was calling into the Unicode Windows API and not marshalling strings then sure but I don't buy this as the general case. – John Leidegren Aug 9 '12 at 18:15
11  
@ John Leidegren : If your only concern was calling into the Unicode Windows API and not marshalling strings then sure : Then, we agree. I'm coding in C++, not JavaScript. Avoiding useless marshalling or any other potentially costly processing at runtime when it can be done at compile time is at the heart of that language. Coding against WinAPI and using std::string is just an unjustified wasting runtime resources. You find it fallacious, and it's Ok, as it is your viewpoint. My own is that I won't write code with pessimization on Windows just because it looks better from the Linux side. – paercebal Aug 9 '12 at 19:48
对std::string和std::wstring区别的解释,807个赞同,有例子

So, every reader here now should have a clear understanding about the facts, the situation. If not, then you must read paercebal's outstandingly comprehensive answer [btw: thanks!].

My pragmatical conclusion is shockingly simple: all that C++ (and STL) "character encoding" stuff is substantially broken and useless. Blame it on Microsoft or not, that will not help anyway.

My solution, after in-depth investigation, much frustration and the consequential experiences is the following:

  1. accept, that you have to be responsible on your own for the encoding and conversion stuff (and you will see that much of it is rather trivial)

  2. use std::string for any UTF-8 encoded strings (just a typedef std::string UTF8String)

  3. accept that such an UTF8String object is just a dumb, but cheap container. Do never ever access and/or manipulate characters in it directly (no search, replace, and so on). You could, but you really just really, really do not want to waste your time writing text manipulation algorithms for multi-byte strings! Even if other people already did such stupid things, don't do that! Let it be! (Well, there are scenarios where it makes sense... just use the ICU library for those).

  4. use std::wstring for UCS-2 encoded strings (typedef std::wstring UCS2String) - this is a compromise, and a concession to the mess that the WIN32 API introduced). UCS-2 is sufficient for most of us (more on that later...).

  5. use UCS2String instances whenever a character-by-character access is required (read, manipulate, and so on). Any character-based processing should be done in a NON-multibyte-representation. It is simple, fast, easy.

  6. add two utility functions to convert back & forth between UTF-8 and UCS-2:

    UCS2String ConvertToUCS2( const UTF8String &str );
    UTF8String ConvertToUTF8( const UCS2String &str );

The conversions are straightforward, google should help here ...

That's it. Use UTF8String wherever memory is precious and for all UTF-8 I/O. Use UCS2String wherever the string must be parsed and/or manipulated. You can convert between those two representations any time.

Alternatives & Improvements

  • conversions from & to single-byte character encodings (e.g. ISO-8859-1) can be realized with help of plain translation tables, e.g. const wchar_t tt_iso88951[256] = {0,1,2,...}; and appropriate code for conversion to & from UCS2.

  • if UCS-2 is not sufficient, than switch to UCS-4 (typedef std::basic_string<uint32_t> UCS2String)

ICU or other unicode libraries?

For advanced stuff.

answered Nov 7 '11 at 6:07
对std::string和std::wstring区别的解释,807个赞同,有例子
Frunsi
6,11952642
 
    
Dang, it's not good to know that native Unicode support isn't there. – Mihai Danila Dec 15 '13 at 16:59
    
@Frunsi, I'm curious to know if you've tried Glib::ustring and if so, what are your thoughts? – Caroline BeltranSep 19 '14 at 19:44
    
@CarolineBeltran: I know Glib, but I never used it, and I probably will never even use it, because it is rather limited to a rather unspecific target platform (unixoid systems...). Its windows port is based on external win2unix-layer, and there IMHO is no OSX-compatibility-layer at all. All this stuff is directing clearly into a wrong direction, at least for my code (on this arch level...) ;-) So, Glib is not an option – Frunsi Sep 20 '14 at 5:01
4  
Search, replace, and so on works just fine on UTF-8 strings (a part of the byte sequence representing a character can never be misinterpreted as another character). In fact, UTF-16 and UTF-32 don't make this any easier at all: all three encodings are multibyte encodings in practice, because a user-perceived character (grapheme cluster) can be any number of unicode codepoints long! The pragmatic solution is to use UTF-8 for everything, and convert to UTF-16 only when dealing with the Windows API. – Daniel Oct 17 '14 at 10:49
1  
@Frunsi: Search and replace works just as fine with UTF-8 as with UTF-32. It's precisely because proper Unicode-aware text processing needs to deal with multi-codepoint 'characters' anyways, that using a variable length encoding like UTF-8 doesn't make string processing any more complicated. So just use UTF-8 everywhere. Normal C string functions will work fine on UTF-8 (and correspond to ordinal comparisons on the Unicode string), and if you need anything more language-aware, you'll have to call into a Unicode library anyways, UTF-16/32 can't save you from that. – Daniel Oct 23 '14 at 10:16

I recommend avoiding std::wstring on Windows or elsewhere, except when required by the interface, or anywhere near Windows API calls and respective encoding conversions as a syntactic sugar.

My view is summarized in http://utf8everywhere.org of which I am a co-author.

Unless your application is API-call-centric, e.g. mainly UI application, the suggestion is to store Unicode strings in std::string and encoded in UTF-8, performing conversion near API calls. The benefits outlined in the article outweigh the apparent annoyance of conversion, especially in complex applications. This is doubly so for multi-platform and library development.

And now, answering your questions:

  1. A few weak reasons. It exists for historical reasons, where widechars were believed to be the proper way of supporting Unicode. It is now used to interface APIs that prefer UTF-16 strings. I use them only in direct vicinity of such API calls.
  2. This has nothing to do with std::string. It can hold whatever encoding you put in it. The only question is how You treat it's content. My recommendation is UTF-8, so it will be able to hold all unicode characters correctly. It's a common practice on Linux, but I think Windows programs should do it also.
  3. No.
  4. Wide character is a confusing name. In the early days of Unicode, there was a belief that character can be encoded in two bytes, hence the name. Today, it stands for "any part of the character that is two bytes long". UTF-16 is seen as a sequence of such byte pairs (aka Wide characters). A character in UTF-16 takes either one or two pares.
answered Dec 29 '09 at 16:14
对std::string和std::wstring区别的解释,807个赞同,有例子
Pavel Radzivilovsky
14.6k14657
 
  1. When you want to have wide characters stored in your string. wide depends on the implementation. Visual C++ defaults to 16 bit if i remember correctly, while GCC defaults depending on the target. It's 32 bits long here. Please note wchar_t (wide character type) has nothing to do with unicode. It's merely guaranteed that it can store all the members of the largest character set that the implementation supports by its locales, and at least as long as char. You can store unicode strings fine into std::string using the utf-8 encoding too. But it won't understand the meaning of unicode code points. So str.size() won't give you the amount of logical characters in your string, but merely the amount of char or wchar_t elements stored in that string/wstring. For that reason, the gtk/glib C++ wrapper folks have developed a Glib::ustring class that can handle utf-8.

    If your wchar_t is 32 bits long, then you can use utf-32 as an unicode encoding, and you can store and handle unicode strings using a fixed (utf-32 is fixed length) encoding. This means your wstring's s.size() function will then return the right amount of wchar_t elements andlogical characters.

  2. Yes, char is always at least 8 bit long, which means it can store all ASCII values.
  3. Yes, all major compilers support it.
answered Dec 31 '08 at 11:48
对std::string和std::wstring区别的解释,807个赞同,有例子
Johannes Schaub - litb
372k867201095
 
    
I'm curious about #2. I thought 7 bits would be technically valid too? Or is it required to be able to store anything past 7-bit ASCII chars? – jalf Dec 31 '08 at 12:11
1  
yes, jalf. c89 specifies minimal ranges for basic types in its documentation of limits.h (for unsigned char, that's 0..255 min), and a pure binary system for integer types. it follows char, unsigned char and signed char have minimum bit lengths of 8. c++ inherits those rules. – Johannes Schaub - litb Dec 31 '08 at 12:26
    
Ah cool, thanks. :) – jalf Dec 31 '08 at 12:32
14  
"This means your wstring's s.size() function will then return the right amount of wchar_t elements and logical characters." This is not entirely accurate, even for Unicode. It would be more accurate to say codepoint than "logical character", even in UTF-32 a given character may be composed of multiple codepoints. – Logan Capaldo May 16 '10 at 17:26 
    
Are you guys in essence saying that C++ doesn't have native support for the Unicode character set? – Mihai Danila Dec 15 '13 at 16:56

I frequently use std::string to hold utf-8 characters without any problems at all. I heartily recommend doing this when interfacing with API's which use utf-8 as the native string type as well.

For example, I use utf-8 when interfacing my code with the Tcl interpreter.

The major caveat is the length of the std::string, is no longer the number of characters in the string.

answered Dec 31 '08 at 4:33
 
Juan
 
 
1  
Juan : Do you mean that std::string can hold all unicode characters but the length will report incorrectly? Is there a reason that it is reporting incorrect length? – Appu Dec 31 '08 at 4:35
3  
When using the utf-8 encoding, a single unicode character may be made up of multiple bytes. This is why utf-8 encoding is smaller when using mostly characters from the standard ascii set. You need to use special functions (or roll your own) to measure the number of unicode characters. – Juan Dec 31 '08 at 4:39
2  
(Windows specific) Most functions will expect that a string using bytes is ASCII and 2 bytes is Unicode, older versions MBCS. Which means if you are storing 8 bit unicode that you will have to convert to 16 bit unicode to call a standard windows function (unless you are only using ASCII portion). – Greg Domjan Dec 31 '08 at 4:58
2  
Not only will a std::string report the length incorrectly, but it will also output the wrong string. If some Unicode character is represented in UTF-8 as multiple bytes, which std::string thinks of as its own characters, then your typically std::string manipulation routines will probably output the several strange characters that result from the misinterpretation of the one correct character. – Mihai Danila Dec 15 '13 at 17:01
2  
I suggest changing the answer to indicate that strings should be thought of as only containers of bytes, and, if the bytes are some Unicode encoding (UTF-8, UTF-16, ...), then you should use specific libraries that understand that. The standard string-based APIs (length, substr, etc.) will all fail miserably with multibyte characters. If this update is made, I will remove my downvote. – Mihai Danila Oct 7 '14 at 14:19 
  1. When you want to store 'wide' (Unicode) characters.
  2. Yes: 255 of them (excluding 0).
  3. Yes.
  4. Here's an introductory article: http://www.joelonsoftware.com/articles/Unicode.html
answered Dec 31 '08 at 4:16
对std::string和std::wstring区别的解释,807个赞同,有例子
ChrisW
43.7k776173
 
9  
std::string can hold 0 just fine (just be careful if you call the c_str() method) – Mr Fooz Dec 31 '08 at 4:40
3  
And strictly speaking, a char isn't guaranteed to be 8 bits. :) Your link in #4 is a must-read, but I don't think it answers the question. A wide character is strictly nothing to do with unicode. It is simply a wider character. (How much wider depends on OS, but typically 16 or 32 bit) – jalf Dec 31 '08 at 12:08
9  
wide != unicode! (especially on windows) – Pavel Radzivilovsky Jan 5 '11 at 12:43

Applications that are not satisfied with only 256 different characters have the options of either using wide characters (more than 8 bits) or a variable-length encoding (a multibyte encoding in C++ terminology) such as UTF-8. Wide characters generally require more space than a variable-length encoding, but are faster to process. Multi-language applications that process large amounts of text usually use wide characters when processing the text, but convert it to UTF-8 when storing it to disk.

The only difference between a string and a wstring is the data type of the characters they store. A string stores chars whose size is guaranteed to be at least 8 bits, so you can use strings for processing e.g. ASCII, ISO-8859-15, or UTF-8 text. The standard says nothing about the character set or encoding.

Practically every compiler uses a character set whose first 128 characters correspond with ASCII. This is also the case with compilers that use UTF-8 encoding. The important thing to be aware of when using strings in UTF-8 or some other variable-length encoding, is that the indices and lengths are measured in bytes, not characters.

The data type of a wstring is wchar_t, whose size is not defined in the standard, except that it has to be at least as large as a char, usually 16 bits or 32 bits. wstring can be used for processing text in the implementation defined wide-character encoding. Because the encoding is not defined in the standard, it is not straightforward to convert between strings and wstrings. One cannot assume wstrings to have a fixed-length encoding either.

If you don't need multi-language support, you might be fine with using only regular strings. On the other hand, if you're writing a graphical application, it is often the case that the API supports only wide characters. Then you probably want to use the same wide characters when processing the text. Keep in mind that UTF-16 is a variable-length encoding, meaning that you cannot assume length() to return the number of characters. If the API uses a fixed-length encoding, such as UCS-2, processing becomes easy. Converting between wide characters and UTF-8 is difficult to do in a portable way, but then again, your user interface API probably supports the conversion.

answered Sep 11 '11 at 9:28
对std::string和std::wstring区别的解释,807个赞同,有例子
Seppo Enarvi
1,40421720
 
    
So, paraphrasing the first paragraph: Application needing more than 256 characters need to use a multibyte-encoding or a maybe_multibyte-encoding. – Deduplicator Oct 10 '15 at 12:44
    
Generally 16 and 32 bit encodings such as UCS-2 and UCS-4 are not called multibyte encodings, though. The C++ standard distinguishes between multibyte encodings and wide characters. A wide character representation uses a fixed number (generally more than 8) bits per character. Encodings that use a single byte to encode the most common characters, and multiple bytes to encode the rest of the character set, are called multibyte encodings. – Seppo Enarvi Oct 12 '15 at 21:16
    
Sorry, sloppy comment. Should have said variable-length encoding. UTF-16 is a variable-length-encoding, just like UTF-8. Pretending it isn't is a bad idea. – Deduplicator Oct 12 '15 at 21:23 
    
That's a good point. There's no reason why wstrings couldn't be used to store UTF-16 (instead of UCS-2), but then the convenience of a fixed-length encoding is lost. – Seppo Enarvi Oct 12 '15 at 22:13

1) As mentioned by Greg, wstring is helpful for internationalization, that's when you will be releasing your product in languages other than english

4) Check this out for wide character http://en.wikipedia.org/wiki/Wide_character

answered Dec 31 '08 at 4:24
对std::string和std::wstring区别的解释,807个赞同,有例子
Raghu
138410
 
  1. when you want to use Unicode strings and not just ascii, helpful for internationalisation
  2. yes, but it doesn't play well with 0
  3. not aware of any that don't
  4. wide character is the compiler specific way of handling the fixed length representation of a unicode character, for MSVC it is a 2 byte character, for gcc I understand it is 4 bytes. and a +1 for http://www.joelonsoftware.com/articles/Unicode.html
answered Dec 31 '08 at 4:16
对std::string和std::wstring区别的解释,807个赞同,有例子
Greg Domjan
8,08262950
 
1  
2. An std::string can hold a NULL character just fine. It can also hold utf-8 and wide characters as well. – JuanDec 31 '08 at 4:29
    
@Juan : That put me into confusion again. If std::string can keep unicode characters, what is special with std::wstring? – Appu Dec 31 '08 at 4:33
1  
@Appu: std::string can hold UTF-8 unicode characters. There are a number of unicode standards targeted at different character widths. UTf8 is 8 bits wide. There's also UTF-16 and UTF-32 at 16 and 32 bits wide respectively – Greg D Dec 31 '08 at 4:40
    
With a std::wstring. Each unicode character can be one wchar_t when using the fixed length encodings. For example, if you choose to use the joel on software approach as Greg links to. Then the length of the wstring is exactly number of unicode characters in the string. But it takes up more space – Juan Dec 31 '08 at 4:43
    
I didn't say it could not hold a 0 '\0', and what I meant by doesn't play well is that some methods may not give you an expected result containing all the data of the wstring. So harsh on the down votes. – Greg DomjanDec 31 '08 at 4:53

A good question! I think DATA ENCODING (sometime CHARSET also involved) is a MEMORY EXPRESSION MECHANISM in order to save data to file or transfer data via network, so I answer this question as:

1.When should I use std::wstring over std::string?

If the programming platform or API function is a single-byte one, and we want to process or parse some unicode datas, e.g read from Windows' .REG file or network 2-byte stream, we should declare std::wstring variable to easy process them. e.g.: wstring ws=L"中国a"(6 octets memory: 0x4E2D 0x56FD 0x0061), we can use ws[0] to get character '中' and ws[1] to get character '国' and ws[2] to get character 'a', etc.

2.Can std::string hold the entire ASCII character set, including the special characters?

Yes. But notice: American ASCII, means each 0x00~0xFF octet stand for one character ,including printable text such as "123abc&*_&" and you said special one, mostly print it as a '.' avoid confusing editors or terminals. And some other countries extend their own "ASCII" charset ,e.g. Chinese, use 2 octets to stand for one character.

3.Is std::wstring supported by all popular C++ compilers?

Maybe, or mostly. I have used: VC++6 and GCC 3.3, YES

4.What is exactly a "wide character"?

wide character mostly indicate using 2 octets or 4 octets to hold all countries's characters. 2 octets UCS2 is a representative sample, and further e.g. English 'a', its memory is 2 octet of 0x0061(vs in ASCII 'a's memory is 1 octet 0x61)

https://*.com/questions/402283/stdwstring-vs-stdstring