如何匹配unicode字符与boost::spirit?

时间:2022-10-14 20:21:00

How can I match utf8 unicode characters using boost::spirit?

如何使用boost::spirit来匹配utf8 unicode字符?

For example, I want to recognize all characters in this string:

例如,我想识别这个字符串中的所有字符:

$ echo "На берегу пустынных волн" | ./a.out
Н а б е р е гу п у с т ы н н ы х в о л н

When I try this simple boost::spirit program it will not match the unicode characters correctly:

当我尝试这个简单的boost::spirit程序时,它将不能正确匹配unicode字符:

#include <boost/spirit/include/qi.hpp>
#include <boost/spirit/include/support_istream_iterator.hpp>
#include <boost/foreach.hpp>
namespace qi = boost::spirit::qi;

int main() {
  std::cin.unsetf(std::ios::skipws);
  boost::spirit::istream_iterator begin(std::cin);
  boost::spirit::istream_iterator end;

  std::vector<char> letters;
  bool result = qi::phrase_parse(
      begin, end,  // input     
      +qi::char_,  // match every character
      qi::space,   // skip whitespace 
      letters);    // result    

  BOOST_FOREACH(char letter, letters) {
    std::cout << letter << " ";
  }
  std::cout << std::endl;
}

It behaves like this:

它的行为是这样的:

$ echo "На берегу пустынных волн" | ./a.out | less
<D0> <9D> <D0> <B0> <D0> <B1> <D0> <B5> <D1> <80> <D0> <B5> <D0> <B3> <D1> <83> <D0> <BF> <D1> <83> <D1> <81> <D1> <82> <D1> <8B> <D0> <BD> <D0> <BD> <D1> <8B> <D1> <85> <D0> 
<B2> <D0> <BE> <D0> <BB> <D0> <BD> 

UPDATE:

更新:

Okay, I worked on this a bit more, and the following code is sort of working. It first converts the input into an iterator of 32-bit unicode characters (as recommended here):

好的,我在这方面做了更多的工作,下面的代码也可以。它首先将输入转换为32位unicode字符的迭代器(这里推荐使用):

#include <boost/spirit/include/qi.hpp>
#include <boost/spirit/include/support_istream_iterator.hpp>
#include <boost/foreach.hpp>
#include <boost/regex/pending/unicode_iterator.hpp>
namespace qi = boost::spirit::qi;

int main() {
  std::string str = "На берегу пустынных волн";
  boost::u8_to_u32_iterator<std::string::const_iterator>
      begin(str.begin()), end(str.end());
  typedef boost::uint32_t uchar; // a unicode code point
  std::vector<uchar> letters;
  bool result = qi::phrase_parse(
      begin, end,             // input
      +qi::standard_wide::char_,  // match every character
      qi::space,              // skip whitespace
      letters);               // result
  BOOST_FOREACH(uchar letter, letters) {
    std::cout << letter << " ";
  }
  std::cout << std::endl;
}

The code prints the Unicode code points:

该代码打印Unicode代码点:

$ ./a.out 
1053 1072 1073 1077 1088 1077 1075 1091 1087 1091 1089 1090 1099 1085 1085 1099 1093 1074 1086 1083 1085 

which seems to be correct, according to the official Unicode table.

根据官方的Unicode表,这似乎是正确的。

Now, can anyone tell me how to print the actual characters instead, given this vector of Unicode code points?

现在,有谁能告诉我如何打印实际的字符,因为这个向量是Unicode编码点?

3 个解决方案

#1


5  

I haven't got much experience with it, but apparently Spirit (SVN trunk version) supports Unicode.

我对它没有太多的经验,但是显然Spirit (SVN主干版本)支持Unicode。

#define BOOST_SPIRIT_UNICODE // We'll use unicode (UTF8) all throughout

See, e.g. the sexpr parser sample which is in the scheme demo.

参见,例如,在scheme演示中使用的sexpr解析器示例。

BOOST_ROOT/libs/spirit/example/scheme

I believe this is based on the demo from a presentation by Bryce Lelbach1, which specifically showcases:

我相信这是基于Bryce Lelbach1的演示,它特别展示了:

  • wchar support
  • wchar支持
  • utree attributes (still experimental)
  • utree属性(实验)
  • s-expressions
  • s-expressions

There is an online article about S-expressions and variant.

网上有一篇关于s -表达式和变体的文章。


1 In case it is indeed, here is the video from that presentation and the slides (pdf) as found here (odp)

1如果是的话,这是演示视频和幻灯片(pdf)

#2


1  

You can't. The problem is not in boost::spirit but that Unicode is complicated. char doesn't mean a character, it means a 'byte'. And even if you work on the codepoint level, still a user perceived character may be represented by more than one codepoint. (e.g. пусты́нных is 9 characters but 10 codepoints. It may be not clear enough in Russian though because it doesn't use diacritics extensively. other languages do.)

你不能。问题不在于boost::spirit,而在于Unicode很复杂。char并不表示字符,它表示字节。即使您在codepoint级别上工作,仍然可以通过一个以上的codepoint来表示用户感知的字符。(如пусты́нных但是10 codepoints是9字符。但在俄语中可能还不够清楚,因为它没有广泛地使用变音符号。其他语言做的。)

To actually iterate over the user perceived character (or grapheme clusters in Unicode terminology), you'll need to use a Unicode specialized library, namely ICU.

要实际遍历用户感知的字符(或Unicode术语中的图素集群),需要使用Unicode专业库,即ICU。

However, what is the real-world use of iterating over the characters?

然而,迭代字符的实际用途是什么?

#3


0  

In Boost 1.58 I can match any unicode symbols with this:

在Boost 1.58中,我可以将任何unicode符号与以下内容进行匹配:

*boost::spirit::qi::unicode::char_

I don't know how to define a specific range of unicode symbols.

我不知道如何定义特定的unicode符号范围。

#1


5  

I haven't got much experience with it, but apparently Spirit (SVN trunk version) supports Unicode.

我对它没有太多的经验,但是显然Spirit (SVN主干版本)支持Unicode。

#define BOOST_SPIRIT_UNICODE // We'll use unicode (UTF8) all throughout

See, e.g. the sexpr parser sample which is in the scheme demo.

参见,例如,在scheme演示中使用的sexpr解析器示例。

BOOST_ROOT/libs/spirit/example/scheme

I believe this is based on the demo from a presentation by Bryce Lelbach1, which specifically showcases:

我相信这是基于Bryce Lelbach1的演示,它特别展示了:

  • wchar support
  • wchar支持
  • utree attributes (still experimental)
  • utree属性(实验)
  • s-expressions
  • s-expressions

There is an online article about S-expressions and variant.

网上有一篇关于s -表达式和变体的文章。


1 In case it is indeed, here is the video from that presentation and the slides (pdf) as found here (odp)

1如果是的话,这是演示视频和幻灯片(pdf)

#2


1  

You can't. The problem is not in boost::spirit but that Unicode is complicated. char doesn't mean a character, it means a 'byte'. And even if you work on the codepoint level, still a user perceived character may be represented by more than one codepoint. (e.g. пусты́нных is 9 characters but 10 codepoints. It may be not clear enough in Russian though because it doesn't use diacritics extensively. other languages do.)

你不能。问题不在于boost::spirit,而在于Unicode很复杂。char并不表示字符,它表示字节。即使您在codepoint级别上工作,仍然可以通过一个以上的codepoint来表示用户感知的字符。(如пусты́нных但是10 codepoints是9字符。但在俄语中可能还不够清楚,因为它没有广泛地使用变音符号。其他语言做的。)

To actually iterate over the user perceived character (or grapheme clusters in Unicode terminology), you'll need to use a Unicode specialized library, namely ICU.

要实际遍历用户感知的字符(或Unicode术语中的图素集群),需要使用Unicode专业库,即ICU。

However, what is the real-world use of iterating over the characters?

然而,迭代字符的实际用途是什么?

#3


0  

In Boost 1.58 I can match any unicode symbols with this:

在Boost 1.58中,我可以将任何unicode符号与以下内容进行匹配:

*boost::spirit::qi::unicode::char_

I don't know how to define a specific range of unicode symbols.

我不知道如何定义特定的unicode符号范围。