It seems that flex doesn't support UTF-8 input. Whenever the scanner encounter a non-ASCII char, it stops scanning as if it was an EOF.
似乎flex不支持UTF-8输入。每当扫描程序遇到非ASCII字符时,它就会停止扫描,就像它是EOF一样。
Is there a way to force flex to eat my UTF-8 chars? I don't want it to actually match UTF-8 chars, just eat them when using the '.' pattern.
有没有办法强迫flex吃掉我的UTF-8字符?我不希望它实际匹配UTF-8字符,只是在使用'。'时吃掉它们。图案。
Any suggestion?
EDIT
The most simple solution would be:
最简单的解决方案是:
ANY [\x00-\xff]
and use 'ANY' instead of '.' in my rules.
并使用'ANY'而不是'。'在我的规则中。
2 个解决方案
#1
I have been looking into this myself and reading the Flex mailing list to see if anyone thought about it. To get Flex to read unicode is a complex affair ...
我一直在调查这个并阅读Flex邮件列表,看看有没有人想过它。让Flex读取unicode是一件复杂的事情......
UTF-8 encoding can be done, and most other encodings (the 16s) will lead to massive tables driving the automata.
可以完成UTF-8编码,大多数其他编码(16s)将导致大量表格驱动自动机。
A common method so far is:
目前常见的方法是:
What I did was simply write patterns that match single UTF-8 characters. They look something like the following, but you might want to re-read the UTF-8 specification because I wrote this so long ago.
You will of course need to combine these since you want unicode strings, not just single characters.我所做的只是编写匹配单个UTF-8字符的模式。它们看起来类似于以下内容,但您可能想重新阅读UTF-8规范,因为我很久以前就写过这个。您当然需要将这些组合起来,因为您需要unicode字符串,而不仅仅是单个字符。
UB [\200-\277] %%
[\300-\337]{UB} { do something }
[\340-\357]{UB}{2} { do something }
[\360-\367]{UB}{3} { do something }
[\370-\373]{UB}{4} { do something }
[\374-\375]{UB}{5} { do something }
Taken from the mailing list.
取自邮件列表。
I may look at creating a proper patch for UTF-8 support after looking at it further. The above solution seems unmaintainable for large .l files. And is really ugly! You could use ranges similar to create a '.' substitute rule to match all ASCII and UTF-8 characters, but still rather ugly.
在进一步研究之后,我可能会考虑为UTF-8支持创建一个合适的补丁。对于大型.l文件,上述解决方案似乎无法维护。而且真的很难看!您可以使用类似的范围来创建'。'替换规则匹配所有ASCII和UTF-8字符,但仍然相当丑陋。
hope this helps!
希望这可以帮助!
#2
writing an negatet characterclass might also help:
编写一个否定字符类也可能有所帮助:
[\n \t] return WHITESPACE; [^\n \t] retrun NON_WHITESPACE
[\ n \ t]返回WHITESPACE; [^ \ n \ t]重新启动NON_WHITESPACE
#1
I have been looking into this myself and reading the Flex mailing list to see if anyone thought about it. To get Flex to read unicode is a complex affair ...
我一直在调查这个并阅读Flex邮件列表,看看有没有人想过它。让Flex读取unicode是一件复杂的事情......
UTF-8 encoding can be done, and most other encodings (the 16s) will lead to massive tables driving the automata.
可以完成UTF-8编码,大多数其他编码(16s)将导致大量表格驱动自动机。
A common method so far is:
目前常见的方法是:
What I did was simply write patterns that match single UTF-8 characters. They look something like the following, but you might want to re-read the UTF-8 specification because I wrote this so long ago.
You will of course need to combine these since you want unicode strings, not just single characters.我所做的只是编写匹配单个UTF-8字符的模式。它们看起来类似于以下内容,但您可能想重新阅读UTF-8规范,因为我很久以前就写过这个。您当然需要将这些组合起来,因为您需要unicode字符串,而不仅仅是单个字符。
UB [\200-\277] %%
[\300-\337]{UB} { do something }
[\340-\357]{UB}{2} { do something }
[\360-\367]{UB}{3} { do something }
[\370-\373]{UB}{4} { do something }
[\374-\375]{UB}{5} { do something }
Taken from the mailing list.
取自邮件列表。
I may look at creating a proper patch for UTF-8 support after looking at it further. The above solution seems unmaintainable for large .l files. And is really ugly! You could use ranges similar to create a '.' substitute rule to match all ASCII and UTF-8 characters, but still rather ugly.
在进一步研究之后,我可能会考虑为UTF-8支持创建一个合适的补丁。对于大型.l文件,上述解决方案似乎无法维护。而且真的很难看!您可以使用类似的范围来创建'。'替换规则匹配所有ASCII和UTF-8字符,但仍然相当丑陋。
hope this helps!
希望这可以帮助!
#2
writing an negatet characterclass might also help:
编写一个否定字符类也可能有所帮助:
[\n \t] return WHITESPACE; [^\n \t] retrun NON_WHITESPACE
[\ n \ t]返回WHITESPACE; [^ \ n \ t]重新启动NON_WHITESPACE