您将如何实施违规规则?

时间:2021-07-05 18:10:10

I've already written a generator that does the trick, but I'd like to know the best possible way to implement the off-side rule.

我已经编写了一个可以解决这个问题的生成器,但我想知道实现不合规则的最佳方法。

Shortly: Off-side rule means in this context that indentation is getting recognized as a syntactic element.

简而言之:在这种情况下,偏离规则意味着缩进被识别为语法元素。

Here is the offside rule in pseudocode for making tokenizers that capture indentation in usable form, I don't want to limit answers by language:

这是伪代码的越位规则,用于制作以可用形式捕获缩进的标记器,我不想按语言限制答案:

token NEWLINE
    matches r"\n\ *"
    increase line count
    pick up and store the indentation level
    remember to also record the current level of parenthesis

procedure layout tokens
    level = stack of indentation levels
    push 0 to level
    last_newline = none
    per each token
        if it is NEWLINE put it to last_newline and get next token
        if last_newline contains something
            extract new_level and parenthesis_count from last_newline
            - if newline was inside parentheses, do nothing
            - if new_level > level.top
                push new_level to level
                emit last_newline as INDENT token and clear last_newline
            - if new_level == level.top
                emit last_newline and clear last_newline
            - otherwise
                while new_level < level.top
                    pop from level
                    if new_level > level.top
                        freak out, indentation is broken.
                    emit last_newline as DEDENT token
                clear last_newline
        emit token
    while level.top != 0
        emit token as DEDENT token
        pop from level

comments are ignored before they are getting into the layouter
layouter lies between a lexer and a parser

This layouter doesn't generate more than one NEWLINE at time, and doesn't generate NEWLINE when there's indentation coming up. Therefore parsing rules remain quite simple. It's pretty good I think but inform if there's better way of accomplishing it.

此布局不会生成多个NEWLINE,并且在出现缩进时不会生成NEWLINE。因此,解析规则仍然非常简单。我认为这是非常好的,但请告知是否有更好的方法来完成它。

While using this for a while, I've noticed that after DEDENTs it may be nice to emit newline anyway, this way you can separate the expressions with NEWLINE while keeping the INDENT DEDENT as a trailer for expression.

在使用这一段时间之后,我注意到在DEDENT之后无论如何都可以发出新行,这样你就可以将表达式与NEWLINE分开,同时将INDENT DEDENT保留为表达式的预告片。

3 个解决方案

#1


8  

I've written tokenizers and parsers for a couple of little indentation-centric domain-specific languages in the past couple of years, and what you have there looks pretty reasonable to me, for whatever that's worth. If I'm not mistaken, your method is quite similar to what Python does, for example, which seems like it ought to carry some weight.

在过去的几年里,我已经为几个以缩进为中心的特定领域的语言编写了标记器和解析器,而且你拥有的东西对我来说看起来非常合理,无论价值多少。如果我没有弄错的话,你的方法与Python的方法非常相似,例如,它似乎应该承担一些重量。

Converting NEWLINE NEWLINE INDENT to just INDENT before it hits the parser definitely seems like the right way to do things -- it's a pain (IME) to always be peeking ahead for that in the parser! I've actually done that step as a separate layer in what ended up being a three step process: the first combined what your lexer and layouter do minus all the NEWLINE lookahead stuff (which made it very simple), the second (also very simple) layer folded consecutive NEWLINEs and converted NEWLINE INDENT to just INDENT (or, actually, COLON NEWLINE INDENT to INDENT, since in this case all indented blocks were always preceded by colons), then the parser was the third stage on top of that. But it also makes a lot of sense to me to do things the way you've described them, especially if you want to separate the lexer from the layouter, which presumably you'd want to do if you were using a code-generation tool to make your lexer, for instance, as is common practice.

在遇到解析器之前将NEWLINE NEWLINE INDENT转换为INDENT肯定是正确的做事方式 - 在解析器中始终向前窥视是一种痛苦(IME)!我实际上已经完成了这个步骤作为一个单独的层,最后是一个三步过程:第一步结合你的词法分析器和外行设备减去所有NEWLINE前瞻的东西(这使得它非常简单),第二个(也非常简单) )图层折叠连续NEWLINE并将NEWLINE INDENT转换为INDENT(或实际上,COLON NEWLINE INDENT到INDENT,因为在这种情况下所有缩进块总是以冒号开头),然后解析器是第三阶段。但是对我来说,按照你描述它们的方式做事情也很有意义,特别是如果你想将词法分析器和路由器分开,如果你使用代码生成工具,你可能会想要这样做例如,通常的做法是制作你的词法分析器。

I did have one application that needed to be a bit more flexible about indentation rules, essentially leaving the parser to enforce them when needed -- the following needed to be valid in certain contexts, for instance:

我确实有一个应用程序需要对缩进规则更加灵活,基本上让解析器在需要时强制执行它们 - 以下需要在某些上下文中有效,例如:

this line introduces an indented block of literal text:
    this line of the block is indented four spaces
  but this line is only indented two spaces

which doesn't work terribly well with INDENT/DEDENT tokens, since you end up needing to generate one INDENT for each column of indentation and an equal number of DEDENTs on the way back, unless you look way ahead to figure out where the indent levels are going to end up being, which it doesn't seem like you'd want a tokenizer to do. In that case I tried a few different things and ended up just storing a counter in each NEWLINE token that gave the change in indentation (positive or negative) for the following logical line. (Each token also stored all trailing whitespace, in case it needed preserving; for NEWLINE, the stored whitespace included the EOL itself, any intervening blank lines, and the indentation on the following logical line.) No separate INDENT or DEDENT tokens at all. Getting the parser to deal with that was a bit more work than just nesting INDENTs and DEDENTs, and might well have been hell with a complicated grammar that needed a fancy parser generator, but it wasn't nearly as bad as I'd feared, either. Again, no need for the parser to look ahead from NEWLINE to see if there's an INDENT coming up in this scheme.

这对于INDENT / DEDENT令牌来说效果不是很好,因为你最终需要为每一段缩进生成一个INDENT,并且在回来的路上需要生成相同数量的DEDENT,除非你向前看以找出缩进级别的位置最终会成为现实,看起来你不想要一个令牌器。在那种情况下,我尝试了一些不同的东西,最后只是在每个NEWLINE令牌中存储一个计数器,该计数器给出了后续逻辑行的缩进(正面或负面)的变化。 (每个标记还存储所有尾随空格,以防它需要保留;对于NEWLINE,存储的空白包括EOL本身,任何插入的空白行和下一个逻辑行上的缩进。)根本没有单独的INDENT或DEDENT标记。让解析器处理这个问题比嵌套INDENTs和DEDENTs要多得多,而且很可能是一个复杂的语法,需要一个花哨的解析器生成器,但它并不像我担心的那么糟糕,无论是。同样,解析器无需向前看NEWLINE以查看此方案中是否存在INDENT。

Still, I think you'd agree that allowing and preserving all manner of crazy-looking whitespace in the tokenizer/layouter and letting the parser decide what's a literal and what's code is a bit of an unusual requirement! You certainly wouldn't want your parser to be saddled with that indentation counter if you just wanted to be able to parse Python code, for example. The way you're doing things is almost certainly the right approach for your application and many others besides. Though if anyone else has thoughts on how best to do this sort of thing, I'd obviously love to hear them....

尽管如此,我认为你同意在tokenizer / layouter中允许和保留所有类似疯狂的空白,并让解析器决定什么是文字,什么代码是一个不寻常的要求!例如,如果您只是想解析Python代码,那么您当然不希望您的解析器背负该缩进计数器。你做事的方式几乎肯定是你的应用程序和其他许多其他方法的正确方法。虽然如果有其他人有关于如何最好地做这种事情的想法,我显然喜欢听他们......

#2


3  

Ive been experimenting with this recently, and I came to the conclusion that, for my needs at least, I wanted the NEWLINES to mark the end of each "statement", whether it was the last statement in an indented block or not, i.e. I need the newlines even before DEDENT.

我最近一直在试验这个,我得出的结论是,至少对于我的需求,我希望NEWLINES标记每个“语句”的结尾,无论它是否是缩进块中的最后一个语句,即我甚至在DEDENT之前就需要新行。

My solution was to turn it on its head, and instead of NEWLINES marking the end of lines, I use a LINE token to mark the start of a line.

我的解决方案是把它转过来,而不是标记行尾的NEWLINES,我使用LINE标记来标记一行的开头。

I have a lexer that collapses empty lines (including comment-only lines) and emits a single LINE token with information about the indentation of the last line. Then my preprocessing function takes this token stream and adds INDENT or DEDENT "in between" any lines where the indentation changes. So

我有一个词法分析器折叠空行(包括仅注释行)并发出一个LINE令牌,其中包含有关最后一行缩进的信息。然后我的预处理函数获取此标记流,并在缩进更改的任何行之间添加“介于其间”或“DEDENT”。所以

line1
    line2
    line3
line4

would give the token stream

会给出令牌流

LINE "line1" INDENT LINE "line2" LINE "line3" DEDENT LINE "line4" EOF

This allows me to write clear grammar productions for statements without worrying about detecting the end of statements even when they end with nested, indented, subblocks, something that can be hard if you are matching NEWLINES (and DEDENTS) instead.

这允许我为语句编写清晰的语法产生,而不用担心检测语句的结尾,即使它们以嵌套的,缩进的子块结束,如果你匹配NEWLINES(和DEDENTS),这可能很难。

Here is the core of the preprocessor, written in O'Caml:

这是预处理器的核心,用O'Caml编写:

  match next_token () with
      LINE indentation ->
        if indentation > !current_indentation then
          (
            Stack.push !current_indentation indentation_stack;
            current_indentation := indentation;
            INDENT
          )
        else if indentation < !current_indentation then
          (
            let prev = Stack.pop indentation_stack in
              if indentation > prev then
                (
                  current_indentation := indentation;
                  BAD_DEDENT
                )
              else
                (
                  current_indentation := prev;
                  DEDENT
                )
          )
        else (* indentation = !current_indentation *)
          let  token = remove_next_token () in
            if next_token () = EOF then
              remove_next_token ()
            else
              token
    | _ ->
        remove_next_token ()

I haven't added support for parentheses yet, but that should be a simple extension. It does, however avoid emitting a stray LINE at the end of the file.

我还没有添加对括号的支持,但这应该是一个简单的扩展。但它确实避免在文件末尾发出一个迷路LINE。

#3


1  

Tokenizer in ruby for fun:

红宝石中的Tokenizer乐趣:

def tokenize(input)
  result, prev_indent, curr_indent, line = [""], 0, 0, ""
  line_started = false

  input.each_char do |char|

    case char
    when ' '
      if line_started
        # Content already started, add it.
        line << char
      else
        # No content yet, just count.
        curr_indent += 1
      end
    when "\n"
      result.last << line + "\n"
      curr_indent, line = 0, ""
      line_started = false
    else
      # Check if we are at the first non-space character.
      unless line_started
        # Insert indent and dedent tokens if indentation changed.
        if prev_indent > curr_indent
          # 2 spaces dedentation
          ((prev_indent - curr_indent) / 2).times do
            result << :DEDENT
          end
          result << ""
        elsif prev_indent < curr_indent
          result << :INDENT
          result << ""
        end

        prev_indent = curr_indent
      end

      # Mark line as started and add char to line.
      line_started = true; line << char
    end

  end

  result
end

Does only work for two-space-indentation. Result is something like ["Hello there from level 0\n", :INDENT, "This\nis level\ntwo\n", :DEDENT, "This is level0 again\n"].

仅适用于双空格缩进。结果类似于[“Hello from the level 0 \ n”,:INDENT,“This \ nis level \ ntwo \ n”,:DEDENT,“这又是level0 \ n”]。

#1


8  

I've written tokenizers and parsers for a couple of little indentation-centric domain-specific languages in the past couple of years, and what you have there looks pretty reasonable to me, for whatever that's worth. If I'm not mistaken, your method is quite similar to what Python does, for example, which seems like it ought to carry some weight.

在过去的几年里,我已经为几个以缩进为中心的特定领域的语言编写了标记器和解析器,而且你拥有的东西对我来说看起来非常合理,无论价值多少。如果我没有弄错的话,你的方法与Python的方法非常相似,例如,它似乎应该承担一些重量。

Converting NEWLINE NEWLINE INDENT to just INDENT before it hits the parser definitely seems like the right way to do things -- it's a pain (IME) to always be peeking ahead for that in the parser! I've actually done that step as a separate layer in what ended up being a three step process: the first combined what your lexer and layouter do minus all the NEWLINE lookahead stuff (which made it very simple), the second (also very simple) layer folded consecutive NEWLINEs and converted NEWLINE INDENT to just INDENT (or, actually, COLON NEWLINE INDENT to INDENT, since in this case all indented blocks were always preceded by colons), then the parser was the third stage on top of that. But it also makes a lot of sense to me to do things the way you've described them, especially if you want to separate the lexer from the layouter, which presumably you'd want to do if you were using a code-generation tool to make your lexer, for instance, as is common practice.

在遇到解析器之前将NEWLINE NEWLINE INDENT转换为INDENT肯定是正确的做事方式 - 在解析器中始终向前窥视是一种痛苦(IME)!我实际上已经完成了这个步骤作为一个单独的层,最后是一个三步过程:第一步结合你的词法分析器和外行设备减去所有NEWLINE前瞻的东西(这使得它非常简单),第二个(也非常简单) )图层折叠连续NEWLINE并将NEWLINE INDENT转换为INDENT(或实际上,COLON NEWLINE INDENT到INDENT,因为在这种情况下所有缩进块总是以冒号开头),然后解析器是第三阶段。但是对我来说,按照你描述它们的方式做事情也很有意义,特别是如果你想将词法分析器和路由器分开,如果你使用代码生成工具,你可能会想要这样做例如,通常的做法是制作你的词法分析器。

I did have one application that needed to be a bit more flexible about indentation rules, essentially leaving the parser to enforce them when needed -- the following needed to be valid in certain contexts, for instance:

我确实有一个应用程序需要对缩进规则更加灵活,基本上让解析器在需要时强制执行它们 - 以下需要在某些上下文中有效,例如:

this line introduces an indented block of literal text:
    this line of the block is indented four spaces
  but this line is only indented two spaces

which doesn't work terribly well with INDENT/DEDENT tokens, since you end up needing to generate one INDENT for each column of indentation and an equal number of DEDENTs on the way back, unless you look way ahead to figure out where the indent levels are going to end up being, which it doesn't seem like you'd want a tokenizer to do. In that case I tried a few different things and ended up just storing a counter in each NEWLINE token that gave the change in indentation (positive or negative) for the following logical line. (Each token also stored all trailing whitespace, in case it needed preserving; for NEWLINE, the stored whitespace included the EOL itself, any intervening blank lines, and the indentation on the following logical line.) No separate INDENT or DEDENT tokens at all. Getting the parser to deal with that was a bit more work than just nesting INDENTs and DEDENTs, and might well have been hell with a complicated grammar that needed a fancy parser generator, but it wasn't nearly as bad as I'd feared, either. Again, no need for the parser to look ahead from NEWLINE to see if there's an INDENT coming up in this scheme.

这对于INDENT / DEDENT令牌来说效果不是很好,因为你最终需要为每一段缩进生成一个INDENT,并且在回来的路上需要生成相同数量的DEDENT,除非你向前看以找出缩进级别的位置最终会成为现实,看起来你不想要一个令牌器。在那种情况下,我尝试了一些不同的东西,最后只是在每个NEWLINE令牌中存储一个计数器,该计数器给出了后续逻辑行的缩进(正面或负面)的变化。 (每个标记还存储所有尾随空格,以防它需要保留;对于NEWLINE,存储的空白包括EOL本身,任何插入的空白行和下一个逻辑行上的缩进。)根本没有单独的INDENT或DEDENT标记。让解析器处理这个问题比嵌套INDENTs和DEDENTs要多得多,而且很可能是一个复杂的语法,需要一个花哨的解析器生成器,但它并不像我担心的那么糟糕,无论是。同样,解析器无需向前看NEWLINE以查看此方案中是否存在INDENT。

Still, I think you'd agree that allowing and preserving all manner of crazy-looking whitespace in the tokenizer/layouter and letting the parser decide what's a literal and what's code is a bit of an unusual requirement! You certainly wouldn't want your parser to be saddled with that indentation counter if you just wanted to be able to parse Python code, for example. The way you're doing things is almost certainly the right approach for your application and many others besides. Though if anyone else has thoughts on how best to do this sort of thing, I'd obviously love to hear them....

尽管如此,我认为你同意在tokenizer / layouter中允许和保留所有类似疯狂的空白,并让解析器决定什么是文字,什么代码是一个不寻常的要求!例如,如果您只是想解析Python代码,那么您当然不希望您的解析器背负该缩进计数器。你做事的方式几乎肯定是你的应用程序和其他许多其他方法的正确方法。虽然如果有其他人有关于如何最好地做这种事情的想法,我显然喜欢听他们......

#2


3  

Ive been experimenting with this recently, and I came to the conclusion that, for my needs at least, I wanted the NEWLINES to mark the end of each "statement", whether it was the last statement in an indented block or not, i.e. I need the newlines even before DEDENT.

我最近一直在试验这个,我得出的结论是,至少对于我的需求,我希望NEWLINES标记每个“语句”的结尾,无论它是否是缩进块中的最后一个语句,即我甚至在DEDENT之前就需要新行。

My solution was to turn it on its head, and instead of NEWLINES marking the end of lines, I use a LINE token to mark the start of a line.

我的解决方案是把它转过来,而不是标记行尾的NEWLINES,我使用LINE标记来标记一行的开头。

I have a lexer that collapses empty lines (including comment-only lines) and emits a single LINE token with information about the indentation of the last line. Then my preprocessing function takes this token stream and adds INDENT or DEDENT "in between" any lines where the indentation changes. So

我有一个词法分析器折叠空行(包括仅注释行)并发出一个LINE令牌,其中包含有关最后一行缩进的信息。然后我的预处理函数获取此标记流,并在缩进更改的任何行之间添加“介于其间”或“DEDENT”。所以

line1
    line2
    line3
line4

would give the token stream

会给出令牌流

LINE "line1" INDENT LINE "line2" LINE "line3" DEDENT LINE "line4" EOF

This allows me to write clear grammar productions for statements without worrying about detecting the end of statements even when they end with nested, indented, subblocks, something that can be hard if you are matching NEWLINES (and DEDENTS) instead.

这允许我为语句编写清晰的语法产生,而不用担心检测语句的结尾,即使它们以嵌套的,缩进的子块结束,如果你匹配NEWLINES(和DEDENTS),这可能很难。

Here is the core of the preprocessor, written in O'Caml:

这是预处理器的核心,用O'Caml编写:

  match next_token () with
      LINE indentation ->
        if indentation > !current_indentation then
          (
            Stack.push !current_indentation indentation_stack;
            current_indentation := indentation;
            INDENT
          )
        else if indentation < !current_indentation then
          (
            let prev = Stack.pop indentation_stack in
              if indentation > prev then
                (
                  current_indentation := indentation;
                  BAD_DEDENT
                )
              else
                (
                  current_indentation := prev;
                  DEDENT
                )
          )
        else (* indentation = !current_indentation *)
          let  token = remove_next_token () in
            if next_token () = EOF then
              remove_next_token ()
            else
              token
    | _ ->
        remove_next_token ()

I haven't added support for parentheses yet, but that should be a simple extension. It does, however avoid emitting a stray LINE at the end of the file.

我还没有添加对括号的支持,但这应该是一个简单的扩展。但它确实避免在文件末尾发出一个迷路LINE。

#3


1  

Tokenizer in ruby for fun:

红宝石中的Tokenizer乐趣:

def tokenize(input)
  result, prev_indent, curr_indent, line = [""], 0, 0, ""
  line_started = false

  input.each_char do |char|

    case char
    when ' '
      if line_started
        # Content already started, add it.
        line << char
      else
        # No content yet, just count.
        curr_indent += 1
      end
    when "\n"
      result.last << line + "\n"
      curr_indent, line = 0, ""
      line_started = false
    else
      # Check if we are at the first non-space character.
      unless line_started
        # Insert indent and dedent tokens if indentation changed.
        if prev_indent > curr_indent
          # 2 spaces dedentation
          ((prev_indent - curr_indent) / 2).times do
            result << :DEDENT
          end
          result << ""
        elsif prev_indent < curr_indent
          result << :INDENT
          result << ""
        end

        prev_indent = curr_indent
      end

      # Mark line as started and add char to line.
      line_started = true; line << char
    end

  end

  result
end

Does only work for two-space-indentation. Result is something like ["Hello there from level 0\n", :INDENT, "This\nis level\ntwo\n", :DEDENT, "This is level0 again\n"].

仅适用于双空格缩进。结果类似于[“Hello from the level 0 \ n”,:INDENT,“This \ nis level \ ntwo \ n”,:DEDENT,“这又是level0 \ n”]。