正则表达式对给定有限代表字符串列表的语法推断?

时间:2022-03-11 11:40:24

I'm working on analyzing a large public dataset with lots of verbose human-readable strings that were clearly generated by some regular (in the formal language theory sense) grammar.

我正在分析一个大型公共数据集,其中包含大量冗长的人类可读字符串,这些字符串显然是由一些常规(在正式语言理论意义上)语法生成的。

It's not too hard to look at sets of these strings one by one to see the patterns; unfortunately, there's about 24,000 of these unique strings broken up into 33 categories and 1714 subcategories, so it's somewhat painful to do this manually.

看一组一组的这些弦就不难看出规律了;不幸的是,大约有24000个这种独特的字符串被分成了33个类别和1714个子类别,所以手工操作有点痛苦。

Basically, I'm looking for an existing algorithm (preferably with an existing reference implementation) to take an arbitrary list of strings and try to infer some minimal (for some reasonable definition of minimal) spanning set of regular expressions that can be used to generate them (i.e. infer a regular grammar from a finite set of strings from the language generated by that grammar).

基本上,我在找现有的算法与现有的参考实现(最好)任意的字符串列表,试图推断出一些最小(最小的一些合理的定义)生成的正则表达式,可用于生成它们(即推断从有限的一组字符串正则文法生成的语言语法)。

I've considered doing repeated greedy longest common substring elimination, but that only goes so far because it won't collapse anything but exact matches, so won't detect, say, a common pattern of varying numerical strings at a particular position in the grammar.

我已经考虑过重复贪婪的最长公共子串消除,但这只是到目前为止,因为它不会崩溃任何东西,而是精确的匹配,所以不会检测,比如说,在语法中某个特定位置上的不同数字字符串的常见模式。

Brute forcing anything that doesn't fall out of common substring elimination is possible, but probably computationally unfeasible. (Furthermore, I've thought about it and there might be a "phase ordering" and/or "local minimum" issue with substring elimination, since you might make a greedy substring match that ends up forcing the final grammar to be less compressed/minimal even though it appears to be the best reduction initially).

强制执行任何没有从常见子字符串排除中掉出来的东西是可能的,但在计算上可能是不可行的。(而且,我已经考虑过了,可能会有一个“阶段排序”和/或“局部最小”问题,并带有子字符串消除,因为您可能会生成一种令人讨厌的子字符串匹配,最终导致最终的语法被压缩/最小化,即使它最初看起来是最好的还原)。

2 个解决方案

#1


19  

Yes, it turns out this does exist; what is required is what is known academically as a DFA Learning algorithm, examples of which include:

是的,这确实存在;所需要的是学术上所知的DFA学习算法,其示例包括:

  • Angluin's L*
  • Angluin的L *
  • L* (adding counter-examples to columns)
  • L*(在列中添加反例)
  • Kearns / Vazirani
  • 卡恩斯/ Vazirani
  • Rivest / Schapire
  • 里维斯特/ Schapire
  • NL*
  • 问*
  • Regular positive negative inference (RPNI)
  • 常规正否定推理(RPNI)
  • DeLeTe2
  • DeLeTe2
  • Biermann & Feldman's algorithm
  • 比尔曼和费尔德曼的算法
  • Biermann & Feldman's algorithm (using SAT-solving)
  • Biermann & Feldman的算法(使用sat -求解)

Source for the above is libalf, an open-source automata learning algorithm framework in C++; descriptions of at least some of these algorithms can be found in this textbook, among others. There are also implementations of grammatical inference algorithms (including DFA learning) in gitoolbox for MATLAB.

以上内容的来源是libalf,一个c++的开源自动机学习算法框架;至少这些算法的描述可以在这本教科书中找到,还有其他的。在MATLAB的gitoolbox中还实现了语法推理算法(包括DFA学习)。

Since this question has come up before and has not been satisfactorily answered in the past, I am in the process of evaluating these algorithms and will update will more information about how useful they are, unless someone with more expertise in the area does first (which is preferable).

由于这个问题以前曾出现过,而且过去也没有得到满意的回答,所以我正在评估这些算法,并将更新更多关于它们有多有用的信息,除非有更精通该领域的人先这么做(这更好)。

NOTE: I am accepting my own answer for now but will gladly accept a better one if someone can provide one.

注:我现在接受自己的答案,但如果有人能提供一个更好的答案,我会欣然接受。

FURTHER NOTE: I've decided to go with the route of using custom code, since using a generic algorithm turns out to be a bit overkill for the data I'm working with. I'm leaving this answer here in case someone else needs it, and will update if I ever do evaluate these.

进一步注意:我决定采用使用自定义代码的方式,因为使用通用算法对我正在处理的数据来说有点过头了。我在这里留下这个答案,以防有人需要它,如果我计算这些,我会更新。

#2


0  

The only thing I can suggest is to play around with Nltk (Natural Language Toolkit for Python) a bit and see if it can at least recognize recurring patterns.

我唯一能建议的就是使用Nltk (Python的自然语言工具包),看看它是否至少能识别重复的模式。

Another thing you may look into is MALLET (Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction etc.)

您还可以查看MALLET(基于java的包,用于统计自然语言处理、文档分类、集群、主题建模、信息提取等)。

Perl has something called LinkParser but it seems to require you to provide a representation of the actual grammar (on the other hand, it comes with a large set of different models so maybe it could be shoehorned to help you sorting your samples).

Perl有一种叫做LinkParser的东西,但它似乎要求您提供实际语法的表示(另一方面,它提供了大量不同的模型,因此它可能可以帮助您对示例进行排序)。

Gate may allow you to create examples from a subset of records in your corpus and possibly reverse engineer the grammar from those.

Gate允许您从语料库中的记录子集创建示例,并可能对语法进行反向工程。

Finally, have a look at the CRAN repository for text-specific packages.

最后,看看针对文本的包的CRAN存储库。

#1


19  

Yes, it turns out this does exist; what is required is what is known academically as a DFA Learning algorithm, examples of which include:

是的,这确实存在;所需要的是学术上所知的DFA学习算法,其示例包括:

  • Angluin's L*
  • Angluin的L *
  • L* (adding counter-examples to columns)
  • L*(在列中添加反例)
  • Kearns / Vazirani
  • 卡恩斯/ Vazirani
  • Rivest / Schapire
  • 里维斯特/ Schapire
  • NL*
  • 问*
  • Regular positive negative inference (RPNI)
  • 常规正否定推理(RPNI)
  • DeLeTe2
  • DeLeTe2
  • Biermann & Feldman's algorithm
  • 比尔曼和费尔德曼的算法
  • Biermann & Feldman's algorithm (using SAT-solving)
  • Biermann & Feldman的算法(使用sat -求解)

Source for the above is libalf, an open-source automata learning algorithm framework in C++; descriptions of at least some of these algorithms can be found in this textbook, among others. There are also implementations of grammatical inference algorithms (including DFA learning) in gitoolbox for MATLAB.

以上内容的来源是libalf,一个c++的开源自动机学习算法框架;至少这些算法的描述可以在这本教科书中找到,还有其他的。在MATLAB的gitoolbox中还实现了语法推理算法(包括DFA学习)。

Since this question has come up before and has not been satisfactorily answered in the past, I am in the process of evaluating these algorithms and will update will more information about how useful they are, unless someone with more expertise in the area does first (which is preferable).

由于这个问题以前曾出现过,而且过去也没有得到满意的回答,所以我正在评估这些算法,并将更新更多关于它们有多有用的信息,除非有更精通该领域的人先这么做(这更好)。

NOTE: I am accepting my own answer for now but will gladly accept a better one if someone can provide one.

注:我现在接受自己的答案,但如果有人能提供一个更好的答案,我会欣然接受。

FURTHER NOTE: I've decided to go with the route of using custom code, since using a generic algorithm turns out to be a bit overkill for the data I'm working with. I'm leaving this answer here in case someone else needs it, and will update if I ever do evaluate these.

进一步注意:我决定采用使用自定义代码的方式,因为使用通用算法对我正在处理的数据来说有点过头了。我在这里留下这个答案,以防有人需要它,如果我计算这些,我会更新。

#2


0  

The only thing I can suggest is to play around with Nltk (Natural Language Toolkit for Python) a bit and see if it can at least recognize recurring patterns.

我唯一能建议的就是使用Nltk (Python的自然语言工具包),看看它是否至少能识别重复的模式。

Another thing you may look into is MALLET (Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction etc.)

您还可以查看MALLET(基于java的包,用于统计自然语言处理、文档分类、集群、主题建模、信息提取等)。

Perl has something called LinkParser but it seems to require you to provide a representation of the actual grammar (on the other hand, it comes with a large set of different models so maybe it could be shoehorned to help you sorting your samples).

Perl有一种叫做LinkParser的东西,但它似乎要求您提供实际语法的表示(另一方面,它提供了大量不同的模型,因此它可能可以帮助您对示例进行排序)。

Gate may allow you to create examples from a subset of records in your corpus and possibly reverse engineer the grammar from those.

Gate允许您从语料库中的记录子集创建示例,并可能对语法进行反向工程。

Finally, have a look at the CRAN repository for text-specific packages.

最后,看看针对文本的包的CRAN存储库。