I am trying to replace every non alpha character in a string with " "
using Boost:
我试图使用Boost替换字符串中的每个非字母字符:
std::string sanitize(std::string &str)
{
boost::regex re;
re.imbue(std::locale("fr_FR.UTF-8"));
re.assign("[^[:alpha:]]");
str = boost::regex_replace(str, re, " ");
return str;
}
int main ()
{
std::string test = "(ça) /.2424,@ va très bien ?";
cout << sanitize(test) << endl;
return 0;
}
The result is a va tr s bien
but I would like to get ça va très bien
.
结果是一个va tr s bien,但我想得到çavatrèsbien。
What am I missing?
我错过了什么?
1 个解决方案
#1
6
boost::regex::imbue
doesn't do what you are hoping for here - in particular, it won't make boost::regex work with UTF-8. (You could probably make it work this way with ISO 8859-1 or a similar single-byte character encoding, but that doesn't seem to be what you want here).
boost :: regex :: imbue没有做你想要的事情 - 特别是,它不会使boost :: regex与UTF-8一起工作。 (您可以使用ISO 8859-1或类似的单字节字符编码使其工作,但这似乎不是您想要的)。
For UTF-8 support, you will need to use one of the boost::regex classes which will deal with Unicode - see http://www.boost.org/doc/libs/1_55_0/libs/regex/doc/html/boost_regex/unicode.html.
对于UTF-8支持,您将需要使用一个处理Unicode的boost :: regex类 - 请参阅http://www.boost.org/doc/libs/1_55_0/libs/regex/doc/html/ boost_regex / unicode.html。
Here is some code which I think does what you want:
这是一些我认为你想做的代码:
#include <string>
#include <boost/regex/icu.hpp>
std::string sanitize(std::string &str)
{
boost::u32regex re = boost::make_u32regex("[^[:alpha:]]");
str = boost::u32regex_replace(str, re, " ");
return str;
}
int main ()
{
std::string test = "(ça) /.2424,@ va très bien ?";
std::cout << test << "\n" << sanitize(test) << std::endl;
return 0;
}
See http://www.boost.org/doc/libs/1_55_0/libs/regex/doc/html/boost_regex/ref/non_std_strings/icu/unicode_algo.html for more examples.
有关更多示例,请参阅http://www.boost.org/doc/libs/1_55_0/libs/regex/doc/html/boost_regex/ref/non_std_strings/icu/unicode_algo.html。
#1
6
boost::regex::imbue
doesn't do what you are hoping for here - in particular, it won't make boost::regex work with UTF-8. (You could probably make it work this way with ISO 8859-1 or a similar single-byte character encoding, but that doesn't seem to be what you want here).
boost :: regex :: imbue没有做你想要的事情 - 特别是,它不会使boost :: regex与UTF-8一起工作。 (您可以使用ISO 8859-1或类似的单字节字符编码使其工作,但这似乎不是您想要的)。
For UTF-8 support, you will need to use one of the boost::regex classes which will deal with Unicode - see http://www.boost.org/doc/libs/1_55_0/libs/regex/doc/html/boost_regex/unicode.html.
对于UTF-8支持,您将需要使用一个处理Unicode的boost :: regex类 - 请参阅http://www.boost.org/doc/libs/1_55_0/libs/regex/doc/html/ boost_regex / unicode.html。
Here is some code which I think does what you want:
这是一些我认为你想做的代码:
#include <string>
#include <boost/regex/icu.hpp>
std::string sanitize(std::string &str)
{
boost::u32regex re = boost::make_u32regex("[^[:alpha:]]");
str = boost::u32regex_replace(str, re, " ");
return str;
}
int main ()
{
std::string test = "(ça) /.2424,@ va très bien ?";
std::cout << test << "\n" << sanitize(test) << std::endl;
return 0;
}
See http://www.boost.org/doc/libs/1_55_0/libs/regex/doc/html/boost_regex/ref/non_std_strings/icu/unicode_algo.html for more examples.
有关更多示例,请参阅http://www.boost.org/doc/libs/1_55_0/libs/regex/doc/html/boost_regex/ref/non_std_strings/icu/unicode_algo.html。