I have to tokenize a conditional string expression :
我必须标记条件字符串表达式:
Aritmetic operators are = +, -, *, /, %
Aritmetic运算符是= +, - ,*,/,%
Boolean operators are = &&, ||
布尔运算符是= &&,||
Conditional Operators are = ==, >=, >, <, <=, <,!=
条件运算符是= ==,> =,>,<,<=,<,!=
An example expression is = (x+3>5*y)&&(z>=3 || k!=x)
示例表达式是=(x + 3> 5 * y)&&(z> = 3 || k!= x)
What i want is tokenize this string = operators + operands.
我想要的是标记这个字符串=运算符+操作数。
Because of ">" and ">=" and "=" and "!=" [ which contains same string] i have problems with tokenizing.
由于“>”和“> =”和“=”以及“!=”[包含相同的字符串],我在标记化方面存在问题。
PS1: I do not want to make complex lexial analysis. Just simply parse if possible with reqular expressions.
PS1:我不想进行复杂的词汇分析。只需简单地用reqular表达式解析即可。
PS2: Or in other words, i look for a regular expression which is given sample expression wihout whitespace =
PS2:或者换句话说,我寻找一个正则表达式,它给出了样本表达式,而没有whitespace =
(x+3>5*y)&&(z>=3 || k!=x)
and will produce each token is separated with a white space like :
并将生成每个令牌与白色空格分开,如:
( x + 3 > 5 * y ) && ( z >= 3 || k != x )
2 个解决方案
#1
4
Not a regex, but a basic tokenizer that might just work (note that you don't need to do the string.Join
- you can use the IEnumerable<string>
via foreach
):
不是正则表达式,而是一个可能正常工作的基本标记器(请注意,您不需要执行string.Join - 您可以通过foreach使用IEnumerable
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
static class Program
{
static void Main()
{
// and will produce each token is separated with a white space like : ( x + 3 > 5 * y ) && ( z >= 3 || k != x )
string recombined = string.Join(" ", Tokenize("(x+3>5*y)&&(z>=3 || k!=x)"));
// output: ( x + 3 > 5 * y ) && ( z >= 3 || k != x )
}
public static IEnumerable<string> Tokenize(string input)
{
var buffer = new StringBuilder();
foreach (char c in input)
{
if (char.IsWhiteSpace(c))
{
if (buffer.Length > 0)
{
yield return Flush(buffer);
}
continue; // just skip whitespace
}
if (IsOperatorChar(c))
{
if (buffer.Length > 0)
{
// we have back-buffer; could be a>b, but could be >=
// need to check if there is a combined operator candidate
if (!CanCombine(buffer, c))
{
yield return Flush(buffer);
}
}
buffer.Append(c);
continue;
}
// so here, the new character is *not* an operator; if we have
// a back-buffer that *is* operators, yield that
if (buffer.Length > 0 && IsOperatorChar(buffer[0]))
{
yield return Flush(buffer);
}
// append
buffer.Append(c);
}
// out of chars... anything left?
if (buffer.Length != 0)
yield return Flush(buffer);
}
static string Flush(StringBuilder buffer)
{
string s = buffer.ToString();
buffer.Clear();
return s;
}
static readonly string[] operators = { "+", "-", "*", "/", "%", "=", "&&", "||", "==", ">=", ">", "<", "<=", "!=", "(",")" };
static readonly char[] opChars = operators.SelectMany(x => x.ToCharArray()).Distinct().ToArray();
static bool IsOperatorChar(char newChar)
{
return Array.IndexOf(opChars, newChar) >= 0;
}
static bool CanCombine(StringBuilder buffer, char c)
{
foreach (var op in operators)
{
if (op.Length <= buffer.Length) continue;
// check starts with same plus this one
bool startsWith = true;
for (int i = 0; i < buffer.Length; i++)
{
if (op[i] != buffer[i])
{
startsWith = false;
break;
}
}
if (startsWith && op[buffer.Length] == c) return true;
}
return false;
}
}
#2
1
If you can predefine all the operators that you're going to use, something like this might work for you.
如果您可以预定义您将要使用的所有运算符,则此类内容可能对您有用。
Be sure to put the double-character operators earlier in the regex, so that you will try to match '<' before you match '<='.
请务必将双字符运算符放在正则表达式的前面,以便在匹配'<='之前尝试匹配'<'。
using System;
using System.Text.RegularExpressions;
public class Example
{
public static void Main()
{
string pattern = "!=|<=|>=|\\|\\||\\&\\&|\\d+|[a-z()+\\-*/<>]";
string sentence = "(x+35>5*y)&&(z>=3 || k!=x)";
foreach (Match match in Regex.Matches(sentence, pattern))
Console.WriteLine("Found '{0}' at position {1}",
match.Value, match.Index);
}
}
Output:
Found '(' at position 0
Found 'x' at position 1
Found '+' at position 2
Found '35' at position 3
Found '>' at position 5
Found '5' at position 6
Found '*' at position 7
Found 'y' at position 8
Found ')' at position 9
Found '&&' at position 10
Found '(' at position 12
Found 'z' at position 13
Found '>=' at position 14
Found '3' at position 16
Found '||' at position 18
Found 'k' at position 21
Found '!=' at position 22
Found 'x' at position 24
Found ')' at position 25
#1
4
Not a regex, but a basic tokenizer that might just work (note that you don't need to do the string.Join
- you can use the IEnumerable<string>
via foreach
):
不是正则表达式,而是一个可能正常工作的基本标记器(请注意,您不需要执行string.Join - 您可以通过foreach使用IEnumerable
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
static class Program
{
static void Main()
{
// and will produce each token is separated with a white space like : ( x + 3 > 5 * y ) && ( z >= 3 || k != x )
string recombined = string.Join(" ", Tokenize("(x+3>5*y)&&(z>=3 || k!=x)"));
// output: ( x + 3 > 5 * y ) && ( z >= 3 || k != x )
}
public static IEnumerable<string> Tokenize(string input)
{
var buffer = new StringBuilder();
foreach (char c in input)
{
if (char.IsWhiteSpace(c))
{
if (buffer.Length > 0)
{
yield return Flush(buffer);
}
continue; // just skip whitespace
}
if (IsOperatorChar(c))
{
if (buffer.Length > 0)
{
// we have back-buffer; could be a>b, but could be >=
// need to check if there is a combined operator candidate
if (!CanCombine(buffer, c))
{
yield return Flush(buffer);
}
}
buffer.Append(c);
continue;
}
// so here, the new character is *not* an operator; if we have
// a back-buffer that *is* operators, yield that
if (buffer.Length > 0 && IsOperatorChar(buffer[0]))
{
yield return Flush(buffer);
}
// append
buffer.Append(c);
}
// out of chars... anything left?
if (buffer.Length != 0)
yield return Flush(buffer);
}
static string Flush(StringBuilder buffer)
{
string s = buffer.ToString();
buffer.Clear();
return s;
}
static readonly string[] operators = { "+", "-", "*", "/", "%", "=", "&&", "||", "==", ">=", ">", "<", "<=", "!=", "(",")" };
static readonly char[] opChars = operators.SelectMany(x => x.ToCharArray()).Distinct().ToArray();
static bool IsOperatorChar(char newChar)
{
return Array.IndexOf(opChars, newChar) >= 0;
}
static bool CanCombine(StringBuilder buffer, char c)
{
foreach (var op in operators)
{
if (op.Length <= buffer.Length) continue;
// check starts with same plus this one
bool startsWith = true;
for (int i = 0; i < buffer.Length; i++)
{
if (op[i] != buffer[i])
{
startsWith = false;
break;
}
}
if (startsWith && op[buffer.Length] == c) return true;
}
return false;
}
}
#2
1
If you can predefine all the operators that you're going to use, something like this might work for you.
如果您可以预定义您将要使用的所有运算符,则此类内容可能对您有用。
Be sure to put the double-character operators earlier in the regex, so that you will try to match '<' before you match '<='.
请务必将双字符运算符放在正则表达式的前面,以便在匹配'<='之前尝试匹配'<'。
using System;
using System.Text.RegularExpressions;
public class Example
{
public static void Main()
{
string pattern = "!=|<=|>=|\\|\\||\\&\\&|\\d+|[a-z()+\\-*/<>]";
string sentence = "(x+35>5*y)&&(z>=3 || k!=x)";
foreach (Match match in Regex.Matches(sentence, pattern))
Console.WriteLine("Found '{0}' at position {1}",
match.Value, match.Index);
}
}
Output:
Found '(' at position 0
Found 'x' at position 1
Found '+' at position 2
Found '35' at position 3
Found '>' at position 5
Found '5' at position 6
Found '*' at position 7
Found 'y' at position 8
Found ')' at position 9
Found '&&' at position 10
Found '(' at position 12
Found 'z' at position 13
Found '>=' at position 14
Found '3' at position 16
Found '||' at position 18
Found 'k' at position 21
Found '!=' at position 22
Found 'x' at position 24
Found ')' at position 25