I am creating a syntax highlighter, and I am using String.split to create tokens from an input string. The first issue is that String.split creates a huge amount of empty strings, which causes everything to be quite slower than it could otherwise be.
我正在创建一个语法高亮显示符,我正在使用字符串。分割以从输入字符串创建令牌。第一个问题是字符串。split将创建大量的空字符串,这将导致一切都比其他情况下慢得多。
For example, "***".split(/(\*)/)
-> ["", "*", "", "*", "", "*", ""]
. Is there a way to avoid this?
例如,“* * *”.split(/(\ *)/)- >(“”、“*”、“,“*”,“”,“*”,“”)。有没有办法避免这种情况?
Another issue is the expression precedence in the regular expression itself. Let's say I am trying to parse a C style multi-line comment. That is, /* comment */
. Now let's assume the input string is "/****/"
. If I were to use the following regular expression, it would work, but produce a lot of extra tokens (and all those empty strings!).
另一个问题是正则表达式本身的表达式优先级。假设我正在解析一个C风格的多行注释。也就是,/*评论*/。现在假设输入字符串是"/****/"。如果我使用下面的正则表达式,它会工作,但会产生大量额外的令牌(以及所有这些空字符串!)
/(\/\*|\*\/|\*)/
A better way is to read /*
's, */
's and then read all the rest of the *
's in one token. That is, the better result for the above string is ["/*", "**", "*/"]
. However, when using the regular expression that should do this, I get bad results. The regular expression is like so: /(\/\*|\*\/|\*+)/
.
更好的方法是读取/* s、*/,然后在一个令牌中读取其余的* s。也就是说,上述字符串的更好结果是["/*","**","*/"]。然而,当使用正则表达式时,我得到了糟糕的结果。正则表达式是一样的:/(* | \ * \ \ / \ / | \ * +)/。
The result of this expression is however this: ["/*", "***", "/"]
. I am guessing this is because the last part is greedy so it steals the match from the other part.
这个表达式的结果是:["/*"、"** "、"/"]。我猜这是因为最后一个部分是贪婪的,所以它从另一个部分窃取了匹配。
The only solution I found was to make a negated lookahead expression, like this:
我找到的唯一解决办法是做一个否定的前视表达式,如下所示:
/(\/\*|\*\/|\*+(?!\/)/
This gives the expected result, but it is very slow compared to the other one, and this has an effect for big strings.
这给出了预期的结果,但是与另一个相比,它非常慢,这对大字符串有影响。
Is there a solution for either of these problems?
这两个问题都有解决方案吗?
2 个解决方案
#1
14
Use lookahed to avoid empty matches:
使用lookahed避免空匹配:
arr = "***".split(/(?=\*)/);
//=> ["*", "*", "*"]
OR use filter(Boolean)
to discard empty matches:
或使用过滤器(布尔)丢弃空匹配项:
arr = "***".split(/(\*)/).filter(Boolean);
//=> ["*", "*", "*"]
#2
0
Generally for tokenizing you use match
, not split
:
一般来说,在标记时你使用的是match,而不是split:
> str = "/****/"
"/****/"
> str.match(/(\/\*)(.*?)(\*\/)/)
["/****/", "/*", "**", "*/"]
Also note how the non-greedy modifier ?
solves the second problem.
还要注意非贪心修饰语的用法吗?解决第二个问题。
#1
14
Use lookahed to avoid empty matches:
使用lookahed避免空匹配:
arr = "***".split(/(?=\*)/);
//=> ["*", "*", "*"]
OR use filter(Boolean)
to discard empty matches:
或使用过滤器(布尔)丢弃空匹配项:
arr = "***".split(/(\*)/).filter(Boolean);
//=> ["*", "*", "*"]
#2
0
Generally for tokenizing you use match
, not split
:
一般来说,在标记时你使用的是match,而不是split:
> str = "/****/"
"/****/"
> str.match(/(\/\*)(.*?)(\*\/)/)
["/****/", "/*", "**", "*/"]
Also note how the non-greedy modifier ?
solves the second problem.
还要注意非贪心修饰语的用法吗?解决第二个问题。