正则表达式匹配字符串从偏移量开始

时间:2022-01-19 21:36:58

I'm learning Rust and trying to write a simple tokenizer right now. I want to go through a string running each regular expression against the current position in the string, create a token, then skip ahead and repeat until I've processed the whole string. I know I can put them into a larger regex and loop through captures, but I need to process them individually for domain reseasons.

我正在学习Rust并试图立即编写一个简单的标记器。我想通过一个字符串来运行每个正则表达式对着字符串中的当前位置,创建一个标记,然后向前跳过并重复,直到我处理完整个字符串。我知道我可以将它们放入更大的正则表达式并循环捕获,但我需要单独处理它们以进行域解析。

However, I see nowhere in the regex crate that allows an offset so I can begin matching again at specific point.

但是,我在正则表达式框中看不到允许偏移,所以我可以在特定点再次开始匹配。

extern crate regex;
use regex::Regex;

fn main() {

    let input = "3 + foo/4";

    let ident_re = Regex::new("[a-zA-Z][a-zA-Z0-9]*").unwrap();
    let number_re = Regex::new("[1-9][0-9]*").unwrap();
    let ops_re = Regex::new(r"[+-*/]").unwrap();
    let ws_re = Regex::new(r"[ \t\n\r]*").unwrap();

    let mut i: usize = 0;

    while i < input.len() {

        // Here check each regex to see if a match starting at input[i]
        // if so copy the match and increment i by length of match.
    }
}

Those regexs that I'm currently scaning for will actually vary at runtime too. Sometimes I may only be looking for a few of them while others (at top level) I might be looking for almost all of them.

我正在寻找的那些正则表达式实际上也会在运行时变化。有时我可能只会寻找其中的一些而在其他人(在*)我可能会寻找几乎所有这些。

1 个解决方案

#1


6  

The regex crate works on string slices. You can always take a sub-slice of another slice and then operate on that one. Instead of moving along indices, you can modify the variable that points to your slice to point to your subslice.

正则表达式包适用于字符串切片。您始终可以获取另一个切片的子切片,然后对该切片进行操作。您可以修改指向切片的变量以指向您的子切片,而不是沿索引移动。

fn main() {
    let mut s = "hello";
    while !s.is_empty() {
        println!("{}", s);
        s = &s[1..];
    }
}

Note that the slice operation slices at byte-positions, not utf8-char-positions. This allows the slicing operation to be done in O(1) instead of O(n), but will also cause the program to panic if the indices you are slicing from and to happen to be in the middle of a multi-byte utf8 character.

请注意,切片操作在字节位置切片,而不是utf8-char-positions。这允许切片操作在O(1)而不是O(n)中完成,但如果正在切片的索引恰好位于多字节utf8字符的中间,也会导致程序发生混乱。

#1


6  

The regex crate works on string slices. You can always take a sub-slice of another slice and then operate on that one. Instead of moving along indices, you can modify the variable that points to your slice to point to your subslice.

正则表达式包适用于字符串切片。您始终可以获取另一个切片的子切片,然后对该切片进行操作。您可以修改指向切片的变量以指向您的子切片,而不是沿索引移动。

fn main() {
    let mut s = "hello";
    while !s.is_empty() {
        println!("{}", s);
        s = &s[1..];
    }
}

Note that the slice operation slices at byte-positions, not utf8-char-positions. This allows the slicing operation to be done in O(1) instead of O(n), but will also cause the program to panic if the indices you are slicing from and to happen to be in the middle of a multi-byte utf8 character.

请注意,切片操作在字节位置切片,而不是utf8-char-positions。这允许切片操作在O(1)而不是O(n)中完成,但如果正在切片的索引恰好位于多字节utf8字符的中间,也会导致程序发生混乱。