用于多行注释的Unix Flex Regex

时间:2022-08-23 09:39:56

I am making a Lexical Analyzer using Flex on Unix. If you've ever used it before you know that you mainly just define the regex for the tokens of whatever language you are writing the Lexical Analyzer for. I am stuck on the final part. I need the correct Regex for multi-line comments that allows something like

我正在Unix上使用Flex创建一个词法分析器。如果您曾经使用过它,您就会知道,您主要是为您正在编写的词汇分析器的任何语言的标记定义regex。我被困在最后的部分。我需要正确的Regex用于允许类似的多行注释

/* This is a comment \*/

but also allows

但也允许

/* This **** //// is another type of comment */

Can anyone help with this?

有人能帮忙吗?

4 个解决方案

#1


14  

You don't match C style comments with a simple regular expression in Flex; they require a more complex matching method based on start states. The Flex FAQ says how (well, they do for the /*...*/ form; handling the other form in just the <INITIAL> state should be simple).

您不能将C风格的注释与Flex中的简单正则表达式匹配;它们需要基于起始状态的更复杂的匹配方法。Flex FAQ(常见问题解答)会告诉你如何(好的,他们为/*…)* /形式;仅以 <初始> 状态处理另一个表单应该很简单)。

#2


8  

If you're required to make do with just regex, however, there is indeed a not-too-complex solution:

但是,如果您被要求只使用regex,那么确实有一个不太复杂的解决方案:


"/*"( [^*] | (\*+[^*/]) )*\*+\/
The full explanation and derivation of that regex is excellently elaborated upon here.
In short:
  • "/*" marks the start of the comment
  • “/*”标志着评论的开始
  • ( [^*] | (\*+[^*/]) )* says accept all characters that are not * (the [^*] ) or accept a sequence of one or more * as long as the sequence does not have a '*' or a /' following it (the (*+[^*/])). This means that all ******... sequences will be accepted except for *****/ since you can't find a sequence of * there that isn't followed by a * or a /.
  • ((^ *)|(\ * +(^ * /)))*说接受所有的字符不是*((^ *))或接受一个或多个序列*只要序列没有‘*’或后/ '((* +(^ * /)))。这意味着所有的……除了***** */,序列将被接受,因为您无法找到一个*序列,该序列后面没有*或a /。
  • The *******/ case is then handled by the last bit of the RegEx which matches any number of * followed by a / to mark the end of the comment i.e \*+\/

  • 然后,******* */ case由RegEx的最后一个位处理,该位匹配任意数量的*,后跟一个/以标记注释i的末尾。e + \ \ * /
  • #3


    0  

    http://www.lysator.liu.se/c/ANSI-C-grammar-l.html does:

    http://www.lysator.liu.se/c/ANSI-C-grammar-l.html:

    "/*"            { comment(); }
    
    comment() {
        char c, c1;
    
    loop:
        while ((c = input()) != '*' && c != 0)
            putchar(c);
    
        if ((c1 = input()) != '/' && c != 0) {
            unput(c1);
            goto loop;
        }
    
        if (c != 0)
            putchar(c1);
    }
    

    A question which would also solve this is How do I write a non-greedy match in LEX / FLEX?

    一个同样可以解决这个问题的问题是我如何在LEX / FLEX中编写一个非贪婪匹配?

    #4


    -2  

    i don't know flex but i do know regexs. /\/\*.*?\*\//s should match both types (in PCRE), but if you need to differentiate them in your analyser, you may want to then iterate the list of matches to see if they're the second type with /\*\*\s+\/{4}/

    我不知道flex,但我知道regexs。/ \ / \ * . * ?\*\/ s应该匹配这两种类型(在PCRE中),但是如果您需要在您的分析器中对它们进行区分,您可能需要迭代匹配列表,以查看它们是否是第二个类型的/\*\* *\s+\/{4}/

    #1


    14  

    You don't match C style comments with a simple regular expression in Flex; they require a more complex matching method based on start states. The Flex FAQ says how (well, they do for the /*...*/ form; handling the other form in just the <INITIAL> state should be simple).

    您不能将C风格的注释与Flex中的简单正则表达式匹配;它们需要基于起始状态的更复杂的匹配方法。Flex FAQ(常见问题解答)会告诉你如何(好的,他们为/*…)* /形式;仅以 <初始> 状态处理另一个表单应该很简单)。

    #2


    8  

    If you're required to make do with just regex, however, there is indeed a not-too-complex solution:

    但是,如果您被要求只使用regex,那么确实有一个不太复杂的解决方案:


    "/*"( [^*] | (\*+[^*/]) )*\*+\/
    The full explanation and derivation of that regex is excellently elaborated upon here.
    In short:
  • "/*" marks the start of the comment
  • “/*”标志着评论的开始
  • ( [^*] | (\*+[^*/]) )* says accept all characters that are not * (the [^*] ) or accept a sequence of one or more * as long as the sequence does not have a '*' or a /' following it (the (*+[^*/])). This means that all ******... sequences will be accepted except for *****/ since you can't find a sequence of * there that isn't followed by a * or a /.
  • ((^ *)|(\ * +(^ * /)))*说接受所有的字符不是*((^ *))或接受一个或多个序列*只要序列没有‘*’或后/ '((* +(^ * /)))。这意味着所有的……除了***** */,序列将被接受,因为您无法找到一个*序列,该序列后面没有*或a /。
  • The *******/ case is then handled by the last bit of the RegEx which matches any number of * followed by a / to mark the end of the comment i.e \*+\/

  • 然后,******* */ case由RegEx的最后一个位处理,该位匹配任意数量的*,后跟一个/以标记注释i的末尾。e + \ \ * /
  • #3


    0  

    http://www.lysator.liu.se/c/ANSI-C-grammar-l.html does:

    http://www.lysator.liu.se/c/ANSI-C-grammar-l.html:

    "/*"            { comment(); }
    
    comment() {
        char c, c1;
    
    loop:
        while ((c = input()) != '*' && c != 0)
            putchar(c);
    
        if ((c1 = input()) != '/' && c != 0) {
            unput(c1);
            goto loop;
        }
    
        if (c != 0)
            putchar(c1);
    }
    

    A question which would also solve this is How do I write a non-greedy match in LEX / FLEX?

    一个同样可以解决这个问题的问题是我如何在LEX / FLEX中编写一个非贪婪匹配?

    #4


    -2  

    i don't know flex but i do know regexs. /\/\*.*?\*\//s should match both types (in PCRE), but if you need to differentiate them in your analyser, you may want to then iterate the list of matches to see if they're the second type with /\*\*\s+\/{4}/

    我不知道flex,但我知道regexs。/ \ / \ * . * ?\*\/ s应该匹配这两种类型(在PCRE中),但是如果您需要在您的分析器中对它们进行区分,您可能需要迭代匹配列表,以查看它们是否是第二个类型的/\*\* *\s+\/{4}/