如何使flex(词法扫描程序)读取UTF-8字符输入?

时间:2021-11-06 09:40:27

It seems that flex doesn't support UTF-8 input. Whenever the scanner encounter a non-ASCII char, it stops scanning as if it was an EOF.

似乎flex不支持UTF-8输入。每当扫描程序遇到非ASCII字符时,它就会停止扫描,就像它是EOF一样。

Is there a way to force flex to eat my UTF-8 chars? I don't want it to actually match UTF-8 chars, just eat them when using the '.' pattern.

有没有办法强迫flex吃掉我的UTF-8字符?我不希望它实际匹配UTF-8字符,只是在使用'。'时吃掉它们。图案。

Any suggestion?

EDIT

The most simple solution would be:

最简单的解决方案是:

ANY [\x00-\xff]

and use 'ANY' instead of '.' in my rules.

并使用'ANY'而不是'。'在我的规则中。

2 个解决方案

#1


I have been looking into this myself and reading the Flex mailing list to see if anyone thought about it. To get Flex to read unicode is a complex affair ...

我一直在调查这个并阅读Flex邮件列表,看看有没有人想过它。让Flex读取unicode是一件复杂的事情......

UTF-8 encoding can be done, and most other encodings (the 16s) will lead to massive tables driving the automata.

可以完成UTF-8编码,大多数其他编码(16s)将导致大量表格驱动自动机。

A common method so far is:

目前常见的方法是:

What I did was simply write patterns that match single UTF-8 characters. They look something like the following, but you might want to re-read the UTF-8 specification because I wrote this so long ago.
You will of course need to combine these since you want unicode strings, not just single characters.

我所做的只是编写匹配单个UTF-8字符的模式。它们看起来类似于以下内容,但您可能想重新阅读UTF-8规范,因为我很久以前就写过这个。您当然需要将这些组合起来,因为您需要unicode字符串,而不仅仅是单个字符。

UB [\200-\277] %% 
[\300-\337]{UB}                   { do something } 
[\340-\357]{UB}{2}                { do something } 
[\360-\367]{UB}{3}                { do something } 
[\370-\373]{UB}{4}                { do something } 
[\374-\375]{UB}{5}                { do something }

Taken from the mailing list.

取自邮件列表。

I may look at creating a proper patch for UTF-8 support after looking at it further. The above solution seems unmaintainable for large .l files. And is really ugly! You could use ranges similar to create a '.' substitute rule to match all ASCII and UTF-8 characters, but still rather ugly.

在进一步研究之后,我可能会考虑为UTF-8支持创建一个合适的补丁。对于大型.l文件,上述解决方案似乎无法维护。而且真的很难看!您可以使用类似的范围来创建'。'替换规则匹配所有ASCII和UTF-8字符,但仍然相当丑陋。

hope this helps!

希望这可以帮助!

#2


writing an negatet characterclass might also help:

编写一个否定字符类也可能有所帮助:

[\n \t] return WHITESPACE; [^\n \t] retrun NON_WHITESPACE

[\ n \ t]返回WHITESPACE; [^ \ n \ t]重新启动NON_WHITESPACE

#1


I have been looking into this myself and reading the Flex mailing list to see if anyone thought about it. To get Flex to read unicode is a complex affair ...

我一直在调查这个并阅读Flex邮件列表,看看有没有人想过它。让Flex读取unicode是一件复杂的事情......

UTF-8 encoding can be done, and most other encodings (the 16s) will lead to massive tables driving the automata.

可以完成UTF-8编码,大多数其他编码(16s)将导致大量表格驱动自动机。

A common method so far is:

目前常见的方法是:

What I did was simply write patterns that match single UTF-8 characters. They look something like the following, but you might want to re-read the UTF-8 specification because I wrote this so long ago.
You will of course need to combine these since you want unicode strings, not just single characters.

我所做的只是编写匹配单个UTF-8字符的模式。它们看起来类似于以下内容,但您可能想重新阅读UTF-8规范,因为我很久以前就写过这个。您当然需要将这些组合起来,因为您需要unicode字符串,而不仅仅是单个字符。

UB [\200-\277] %% 
[\300-\337]{UB}                   { do something } 
[\340-\357]{UB}{2}                { do something } 
[\360-\367]{UB}{3}                { do something } 
[\370-\373]{UB}{4}                { do something } 
[\374-\375]{UB}{5}                { do something }

Taken from the mailing list.

取自邮件列表。

I may look at creating a proper patch for UTF-8 support after looking at it further. The above solution seems unmaintainable for large .l files. And is really ugly! You could use ranges similar to create a '.' substitute rule to match all ASCII and UTF-8 characters, but still rather ugly.

在进一步研究之后,我可能会考虑为UTF-8支持创建一个合适的补丁。对于大型.l文件,上述解决方案似乎无法维护。而且真的很难看!您可以使用类似的范围来创建'。'替换规则匹配所有ASCII和UTF-8字符,但仍然相当丑陋。

hope this helps!

希望这可以帮助!

#2


writing an negatet characterclass might also help:

编写一个否定字符类也可能有所帮助:

[\n \t] return WHITESPACE; [^\n \t] retrun NON_WHITESPACE

[\ n \ t]返回WHITESPACE; [^ \ n \ t]重新启动NON_WHITESPACE