For example, given the string "2009/11/12" I want to get the regex ("\d{2}/d{2}/d{4}"), so I'll be able to match "2001/01/02" too.
例如,给定字符串“2009/11/12”,我想获得regex(“\d{2}/d{2}/d{4}”),因此我也可以匹配“2001/01/02”。
Is there something that does that? Something similar? Any idea' as to how to do it?
有什么东西能做到吗?类似的事情吗?你知道怎么做吗?
11 个解决方案
#1
25
There is text2re, a free web-based "regex by example" generator.
还有text2re,一个免费的基于web的“示例regex”生成器。
I don't think this is available in source code, though. I dare to say there is no automatic regex generator that gets it right without user intervention, since this would require the machine knowing what you want.
不过,我不认为这在源代码中是可用的。我敢说,在没有用户干预的情况下,没有自动的regex生成器能够正确地处理它,因为这需要机器知道您想要什么。
Note that text2re uses a template-based, modularized and very generalized approach to regular expression generation. The expressions it generates work, but they are much more complex than the equivalent hand-crafted expression. It is not a good tool to learn regular expressions because it does a pretty lousy job at setting examples.
注意,text2re使用基于模板的、模块化的、非常一般化的方法来生成正则表达式。它生成的表达式可以工作,但是它们比等效的手工生成的表达式复杂得多。它不是学习正则表达式的好工具,因为它在设置示例方面做得很糟糕。
For instance, the string "2009/11/12"
would be recognized as a yyyymmdd
pattern, which is helpful. The tool transforms it into this 125 character monster:
例如,字符串“2009/11/12”将被识别为yyyymmdd模式,这是很有帮助的。这个工具把它变成了这个125个角色的怪物:
((?:(?:[1]{1}\d{1}\d{1}\d{1})|(?:[2]{1}\d{3}))[-:\/.](?:[0]?[1-9]|[1][012])[-:\/.](?:(?:[0-2]?\d{1})|(?:[3][01]{1})))(?![\d])
The hand-made equivalent would take up merely two fifths of that (50 characters):
手工制作的等价物只占其中的五分之二(50个字符):
([12]\d{3})[-:/.](0?\d|1[0-2])[-:/.]([0-2]?\d|3[01])\b
#2
11
It's not possible to write a general solution for your problem. The trouble is that any generator probably wouldn't know what you want to check for, e.g. should "2312/45/67" be allowed too? What about "2009.11.12"?
为你的问题写一个通解是不可能的。问题是,任何生成器都可能不知道您想要检查什么,例如“2312/45/67”也应该被允许吗?“2009.11.12”呢?
What you could do is write such a generator yourself that is suited for your exact problem, but a general solution won't be possible.
您可以自己编写一个适合您的问题的生成器,但是一般的解决方案是不可能的。
#3
3
I've tried a very naive approach:
我尝试了一种非常天真的方法:
class RegexpGenerator {
public static Pattern generateRegexp(String prototype) {
return Pattern.compile(generateRegexpFrom(prototype));
}
private static String generateRegexpFrom(String prototype) {
StringBuilder stringBuilder = new StringBuilder();
for (int i = 0; i < prototype.length(); i++) {
char c = prototype.charAt(i);
if (Character.isDigit(c)) {
stringBuilder.append("\\d");
} else if (Character.isLetter(c)) {
stringBuilder.append("\\w");
} else { // falltrought: literal
stringBuilder.append(c);
}
}
return stringBuilder.toString();
}
private static void test(String prototype) {
Pattern pattern = generateRegexp(prototype);
System.out.println(String.format("%s -> %s", prototype, pattern));
if (!pattern.matcher(prototype).matches()) {
throw new AssertionError();
}
}
public static void main(String[] args) {
String[] prototypes = {
"2009/11/12",
"I'm a test",
"me too!!!",
"124.323.232.112",
"ISBN 332212"
};
for (String prototype : prototypes) {
test(prototype);
}
}
}
output:
输出:
2009/11/12 -> \d\d\d\d/\d\d/\d\d
I'm a test -> \w'\w \w \w\w\w\w
me too!!! -> \w\w \w\w\w!!!
124.323.232.112 -> \d\d\d.\d\d\d.\d\d\d.\d\d\d
ISBN 332212 -> \w\w\w\w \d\d\d\d\d\d
2009/11/12 - > \ d \ d \ d \ d / \ \ d / d \ \ d我测试- > \ w \ w \ w \ w \ w \ w \ w我也是! ! !- > \ w \ w \ w \ w \ w ! ! !124.323.232.112 - > \ d \ d \ d \ d \ d \ d \ d \ d \ d。\d\d\d\d\d\d \d\d\d\d\d\d \d\d\d\d\d\d \d\d\d\d\d\d \d\d\d\d\d\
As already outlined by others a general solution to this problem is impossible. This class is applicable only in few contexts
正如其他人已经指出的那样,解决这个问题的一般办法是不可能的。这个类只适用于少数上下文中
#4
3
Excuse me, but what you all call impossible is clearly an achievable task. It will not be able to give results for ALL examples, and maybe not the best results, but you can give it various hints, and it will make life easy. A few examples will follow.
对不起,你们都认为不可能的事情显然是可以实现的。它将不能给出所有例子的结果,也许不是最好的结果,但是你可以给它各种提示,它将使生活变得简单。下面是一些例子。
Also a readable output translating the result would be very useful. Something like:
此外,一个可读的输出翻译结果将非常有用。喜欢的东西:
- "Search for: a word starting with a non-numeric letter and ending with the string: "ing".
- 搜索:一个以非数字字母开头,以字符串“ing”结尾的词。
- or: Search for: text that has bbb in it, followed somewhere by zzz
- 或者:Search for:包含bbb的文本,然后是zzz
- or: *Search for: a pattern which looks so "aa/bbbb/cccc" where "/" is a separator, "aa" is two digits, "bbbb" is a word of any length and "cccc" are four digits between 1900 and 2020 *
- 或者:*Search for:一个模式看起来如此“aa/bbbb/cccc”,其中“/”是分隔符,“aa”是两位数,“bbbb”是任何长度的单词,“cccc”是1900 - 2020年间的四位数*
Maybe we could make a "back translator" with an SQL type of language to create regex, instead of creating it in geekish.
也许我们可以用SQL类型的语言做一个“反向转换器”来创建regex,而不是用geykish创建。
Here's are a few examples that are doable:
以下是一些可行的例子:
class Hint:
Properties: HintType, HintString
enum HintType { Separator, ParamDescription, NumberOfParameters }
enum SampleType { FreeText, DateOrTime, Formatted, ... }
public string RegexBySamples( List<T> samples,
List<SampleType> sampleTypes,
List<Hint> hints,
out string GeneralRegExp, out string description,
out string generalDescription)...
regex = RegExpBySamples( {"11/November/1999", "2/January/2003"},
SampleType.DateOrTime,
new HintList( HintType.NumberOfParameters, 3 ));
regex = RegExpBySamples( "123-aaaaJ-1444",
SampleType.Format, HintType.Seperator, "-" );
A GUI where you mark sample text or enter it, adding to the regex would be possible too. First you mark a date (the "sample"), and choose if this text is already formatted, or if you are building a format, also what the format type is: free text, formatted text, date, GUID or Choose... from existing formats (which you can store in library).
在GUI中,您可以标记示例文本或输入它,也可以添加到regex中。首先,标记一个日期(“sample”),然后选择该文本是否已经格式化,或者是否正在构建格式,格式类型为:*文本、格式化文本、日期、GUID或选择…从现有的格式(您可以在库中存储)。
Lets design a spec for this, and make it open source... Anybody wants to join?
让我们为此设计一个规范,并使其成为开源的……有人想加入吗?
#5
1
No, you cannot get a regex that matches what you want reliably, since the regex would not contain semantic information about the input (i.e. it would need to know it's generating a regex for dates). If the issue is with dates only I would recommend trying multiple regular expressions and see if one of them matches all.
不,您不能得到可靠地匹配所需内容的regex,因为regex将不包含关于输入的语义信息(例如,它需要知道它正在为日期生成regex)。如果问题是日期问题,我建议您尝试多个正则表达式,看看其中一个是否匹配所有的正则表达式。
#6
1
I'm not sure if this is possible, at least not without many sample strings and some learning algorithm.
我不确定这是否可能,至少没有很多样本字符串和一些学习算法。
There are many regex' that would match and it's not possible for a simple algorithm to pick the 'right' one. You'd need to give it some delimiters or other things to look for, so you might as well just write the regex yourself.
有许多正则表达式匹配,一个简单的算法不可能选择正确的。您需要给它一些分隔符或其他需要查找的东西,因此您不妨自己编写regex。
#7
1
sounds like a machine learning problem. You'll have to have more than one example on hand (many more) and an indication of whether or not each example is considered a match or not.
听起来像是机器学习的问题。您必须手头有多个示例(更多),并指示每个示例是否被视为匹配。
#8
1
Loreto pretty much does this. It's an open source implementation using the common longest substring(s) to generate the regular expressions. Needs multiple examples of course, though.
Loreto几乎做到了这一点。它是一个使用公共最长子字符串生成正则表达式的开源实现。当然,需要多个例子。
#9
0
I don't remember the name but if my theory of computation cells serve me right its impossible in theory :)
我不记得这个名字了,但是如果我的计算细胞理论在理论上是正确的,那是不可能的。
#10
0
I haven't found anything that does it , but since the problem domain is relatively small (you'd be surprised how many people use the weirdest date formats) , I've able to write some kind of a "date regular expression generator". Once I'm satisfied with the unit tests , I'll publish it - just in case someone will ever need something of the kind.
我还没有找到任何方法,但是由于问题域比较小(您会惊讶地发现有多少人使用最奇怪的日期格式),我可以编写某种“日期正则表达式生成器”。一旦我对单元测试感到满意,我就会发布它——以防有人需要这样的东西。
Thanks to everyone who answered (the guy with the (.*) excluded - jokes are great , but this one was sssssssssoooo lame :) )
感谢所有回答(被排除在外的那个家伙)的人——笑话很棒,但这个很差劲:)
#11
0
In addition to feeding the learning algorithm examples of "good" input, you could feed it "bad" input so it would know what not to look for. No letters in a phone number, for example.
除了提供“好的”输入的学习算法示例之外,还可以提供“坏的”输入,这样它就知道不应该寻找什么了。例如,电话号码中没有字母。
#1
25
There is text2re, a free web-based "regex by example" generator.
还有text2re,一个免费的基于web的“示例regex”生成器。
I don't think this is available in source code, though. I dare to say there is no automatic regex generator that gets it right without user intervention, since this would require the machine knowing what you want.
不过,我不认为这在源代码中是可用的。我敢说,在没有用户干预的情况下,没有自动的regex生成器能够正确地处理它,因为这需要机器知道您想要什么。
Note that text2re uses a template-based, modularized and very generalized approach to regular expression generation. The expressions it generates work, but they are much more complex than the equivalent hand-crafted expression. It is not a good tool to learn regular expressions because it does a pretty lousy job at setting examples.
注意,text2re使用基于模板的、模块化的、非常一般化的方法来生成正则表达式。它生成的表达式可以工作,但是它们比等效的手工生成的表达式复杂得多。它不是学习正则表达式的好工具,因为它在设置示例方面做得很糟糕。
For instance, the string "2009/11/12"
would be recognized as a yyyymmdd
pattern, which is helpful. The tool transforms it into this 125 character monster:
例如,字符串“2009/11/12”将被识别为yyyymmdd模式,这是很有帮助的。这个工具把它变成了这个125个角色的怪物:
((?:(?:[1]{1}\d{1}\d{1}\d{1})|(?:[2]{1}\d{3}))[-:\/.](?:[0]?[1-9]|[1][012])[-:\/.](?:(?:[0-2]?\d{1})|(?:[3][01]{1})))(?![\d])
The hand-made equivalent would take up merely two fifths of that (50 characters):
手工制作的等价物只占其中的五分之二(50个字符):
([12]\d{3})[-:/.](0?\d|1[0-2])[-:/.]([0-2]?\d|3[01])\b
#2
11
It's not possible to write a general solution for your problem. The trouble is that any generator probably wouldn't know what you want to check for, e.g. should "2312/45/67" be allowed too? What about "2009.11.12"?
为你的问题写一个通解是不可能的。问题是,任何生成器都可能不知道您想要检查什么,例如“2312/45/67”也应该被允许吗?“2009.11.12”呢?
What you could do is write such a generator yourself that is suited for your exact problem, but a general solution won't be possible.
您可以自己编写一个适合您的问题的生成器,但是一般的解决方案是不可能的。
#3
3
I've tried a very naive approach:
我尝试了一种非常天真的方法:
class RegexpGenerator {
public static Pattern generateRegexp(String prototype) {
return Pattern.compile(generateRegexpFrom(prototype));
}
private static String generateRegexpFrom(String prototype) {
StringBuilder stringBuilder = new StringBuilder();
for (int i = 0; i < prototype.length(); i++) {
char c = prototype.charAt(i);
if (Character.isDigit(c)) {
stringBuilder.append("\\d");
} else if (Character.isLetter(c)) {
stringBuilder.append("\\w");
} else { // falltrought: literal
stringBuilder.append(c);
}
}
return stringBuilder.toString();
}
private static void test(String prototype) {
Pattern pattern = generateRegexp(prototype);
System.out.println(String.format("%s -> %s", prototype, pattern));
if (!pattern.matcher(prototype).matches()) {
throw new AssertionError();
}
}
public static void main(String[] args) {
String[] prototypes = {
"2009/11/12",
"I'm a test",
"me too!!!",
"124.323.232.112",
"ISBN 332212"
};
for (String prototype : prototypes) {
test(prototype);
}
}
}
output:
输出:
2009/11/12 -> \d\d\d\d/\d\d/\d\d
I'm a test -> \w'\w \w \w\w\w\w
me too!!! -> \w\w \w\w\w!!!
124.323.232.112 -> \d\d\d.\d\d\d.\d\d\d.\d\d\d
ISBN 332212 -> \w\w\w\w \d\d\d\d\d\d
2009/11/12 - > \ d \ d \ d \ d / \ \ d / d \ \ d我测试- > \ w \ w \ w \ w \ w \ w \ w我也是! ! !- > \ w \ w \ w \ w \ w ! ! !124.323.232.112 - > \ d \ d \ d \ d \ d \ d \ d \ d \ d。\d\d\d\d\d\d \d\d\d\d\d\d \d\d\d\d\d\d \d\d\d\d\d\d \d\d\d\d\d\
As already outlined by others a general solution to this problem is impossible. This class is applicable only in few contexts
正如其他人已经指出的那样,解决这个问题的一般办法是不可能的。这个类只适用于少数上下文中
#4
3
Excuse me, but what you all call impossible is clearly an achievable task. It will not be able to give results for ALL examples, and maybe not the best results, but you can give it various hints, and it will make life easy. A few examples will follow.
对不起,你们都认为不可能的事情显然是可以实现的。它将不能给出所有例子的结果,也许不是最好的结果,但是你可以给它各种提示,它将使生活变得简单。下面是一些例子。
Also a readable output translating the result would be very useful. Something like:
此外,一个可读的输出翻译结果将非常有用。喜欢的东西:
- "Search for: a word starting with a non-numeric letter and ending with the string: "ing".
- 搜索:一个以非数字字母开头,以字符串“ing”结尾的词。
- or: Search for: text that has bbb in it, followed somewhere by zzz
- 或者:Search for:包含bbb的文本,然后是zzz
- or: *Search for: a pattern which looks so "aa/bbbb/cccc" where "/" is a separator, "aa" is two digits, "bbbb" is a word of any length and "cccc" are four digits between 1900 and 2020 *
- 或者:*Search for:一个模式看起来如此“aa/bbbb/cccc”,其中“/”是分隔符,“aa”是两位数,“bbbb”是任何长度的单词,“cccc”是1900 - 2020年间的四位数*
Maybe we could make a "back translator" with an SQL type of language to create regex, instead of creating it in geekish.
也许我们可以用SQL类型的语言做一个“反向转换器”来创建regex,而不是用geykish创建。
Here's are a few examples that are doable:
以下是一些可行的例子:
class Hint:
Properties: HintType, HintString
enum HintType { Separator, ParamDescription, NumberOfParameters }
enum SampleType { FreeText, DateOrTime, Formatted, ... }
public string RegexBySamples( List<T> samples,
List<SampleType> sampleTypes,
List<Hint> hints,
out string GeneralRegExp, out string description,
out string generalDescription)...
regex = RegExpBySamples( {"11/November/1999", "2/January/2003"},
SampleType.DateOrTime,
new HintList( HintType.NumberOfParameters, 3 ));
regex = RegExpBySamples( "123-aaaaJ-1444",
SampleType.Format, HintType.Seperator, "-" );
A GUI where you mark sample text or enter it, adding to the regex would be possible too. First you mark a date (the "sample"), and choose if this text is already formatted, or if you are building a format, also what the format type is: free text, formatted text, date, GUID or Choose... from existing formats (which you can store in library).
在GUI中,您可以标记示例文本或输入它,也可以添加到regex中。首先,标记一个日期(“sample”),然后选择该文本是否已经格式化,或者是否正在构建格式,格式类型为:*文本、格式化文本、日期、GUID或选择…从现有的格式(您可以在库中存储)。
Lets design a spec for this, and make it open source... Anybody wants to join?
让我们为此设计一个规范,并使其成为开源的……有人想加入吗?
#5
1
No, you cannot get a regex that matches what you want reliably, since the regex would not contain semantic information about the input (i.e. it would need to know it's generating a regex for dates). If the issue is with dates only I would recommend trying multiple regular expressions and see if one of them matches all.
不,您不能得到可靠地匹配所需内容的regex,因为regex将不包含关于输入的语义信息(例如,它需要知道它正在为日期生成regex)。如果问题是日期问题,我建议您尝试多个正则表达式,看看其中一个是否匹配所有的正则表达式。
#6
1
I'm not sure if this is possible, at least not without many sample strings and some learning algorithm.
我不确定这是否可能,至少没有很多样本字符串和一些学习算法。
There are many regex' that would match and it's not possible for a simple algorithm to pick the 'right' one. You'd need to give it some delimiters or other things to look for, so you might as well just write the regex yourself.
有许多正则表达式匹配,一个简单的算法不可能选择正确的。您需要给它一些分隔符或其他需要查找的东西,因此您不妨自己编写regex。
#7
1
sounds like a machine learning problem. You'll have to have more than one example on hand (many more) and an indication of whether or not each example is considered a match or not.
听起来像是机器学习的问题。您必须手头有多个示例(更多),并指示每个示例是否被视为匹配。
#8
1
Loreto pretty much does this. It's an open source implementation using the common longest substring(s) to generate the regular expressions. Needs multiple examples of course, though.
Loreto几乎做到了这一点。它是一个使用公共最长子字符串生成正则表达式的开源实现。当然,需要多个例子。
#9
0
I don't remember the name but if my theory of computation cells serve me right its impossible in theory :)
我不记得这个名字了,但是如果我的计算细胞理论在理论上是正确的,那是不可能的。
#10
0
I haven't found anything that does it , but since the problem domain is relatively small (you'd be surprised how many people use the weirdest date formats) , I've able to write some kind of a "date regular expression generator". Once I'm satisfied with the unit tests , I'll publish it - just in case someone will ever need something of the kind.
我还没有找到任何方法,但是由于问题域比较小(您会惊讶地发现有多少人使用最奇怪的日期格式),我可以编写某种“日期正则表达式生成器”。一旦我对单元测试感到满意,我就会发布它——以防有人需要这样的东西。
Thanks to everyone who answered (the guy with the (.*) excluded - jokes are great , but this one was sssssssssoooo lame :) )
感谢所有回答(被排除在外的那个家伙)的人——笑话很棒,但这个很差劲:)
#11
0
In addition to feeding the learning algorithm examples of "good" input, you could feed it "bad" input so it would know what not to look for. No letters in a phone number, for example.
除了提供“好的”输入的学习算法示例之外,还可以提供“坏的”输入,这样它就知道不应该寻找什么了。例如,电话号码中没有字母。