Solr edismax支持哪些正则表达式特性?

时间:2021-11-02 19:56:39

Regular expressions allows for the pattern matching syntax shown below. I'm trying to implement a powerful search tool that implements as many of these as possible. I'm told that edismax is the most flexible tool for the job. Which of the pattern matching expressions below can be accomplished with edismax? Can I do better than edismax? Can you suggest which filters and parser patches I might use to work towards achieving this functionality? Am I dreaming if I think Solr can achieve acceptable performance (i.e. server-side processing time) of these kinds of searches?

正则表达式允许如下所示的模式匹配语法。我正在尝试实现一个强大的搜索工具,它可以实现尽可能多的搜索。我听说edismax是这项工作最灵活的工具。下面哪个模式匹配表达式可以用edismax完成?我能比edismax做得更好吗?您能建议我使用哪些过滤器和解析器补丁来实现这个功能吗?如果我认为Solr能够实现这些搜索的可接受性能(即服务器端处理时间),那么我是在做梦吗?

regular expression syntax & examples from mysql

来自mysql的正则表达式语法和示例。

  1. ^ match beginning of string. 'fofo' REGEXP '^fo' => true
  2. ^匹配字符串的开始。”fofo“正则表达式”^ fo ' = >正确的
  3. $ match end of string. 'fo\no' REGEXP '^fo\no$' => true
  4. $ match结束字符串。”佛\没有“正则表达式”^ fo \没有美元= >正确的
  5. * 0-unlimited wildcard. 'Baaaan' REGEXP 'Ba*n' => true
  6. * 0-unlimited通配符。“Baaaan”REGEXP“Ba*n”=>
  7. ? 0-1 wildcard. 'Baan' REGEXP '^Ba?n => false'
  8. 吗?0 - 1通配符。“博安公司“正则表达式”^英航吗?n = >假'
  9. + 1-unlimited wildcard. 'Bn' REGEXP 'Ba+n' => false
  10. + 1-unlimited通配符。'Bn' REGEXP 'Ba+n' => false
  11. | or. 'pi' REGEXP 'pi|apa' => true
  12. |或。'pi' REGEXP 'pi|apa' =>
  13. ()* sequence match. 'pipi' REGEXP '^(pi)*$' => true
  14. ()*序列匹配。“皮皮”REGEXP”^(π)* $ ' = >正确的
  15. [a-dX], [^a-dX] character range/set 'aXbc' REGEXP '[a-dXYZ]' => true
  16. [dx]、[^ dx)字符范围/设置aXbc的REGEXP[a-dXYZ]= > true
  17. {n} or {m,n} cardinality notation 'abcde' REGEXP 'a[bcd]{3}e' => true
  18. {n}或{m,n}基数表示法'abcde' REGEXP 'a[bcd]{3}e' => true
  19. [:character_class:] 'justalnums' REGEXP '[[:alnum:]]+' => true
  20. [:character_class:] 'justalnums' REGEXP '[:alnum:] +' => true

2 个解决方案

#1


15  

Version 4.0 of Lucene will support regex queries directly in the standard query parser using special syntax. I verified that it works on an instance of Solr I am running, built from the subversion trunk in February.

Lucene的4.0版本将使用特殊语法直接支持标准查询解析器中的regex查询。我验证了它是在Solr我正在运行的实例上运行的,它是在2月份由subversion主干构建的。

Jira ticket 2604 describes the extension of the standard query parser using special regex syntax, using forward slashes to delimit the regex, similar to syntax in Javascript. It seems to be using the underlying RegexpQuery parser.

Jira ticket 2604使用特殊的regex语法描述了标准查询解析器的扩展,使用正向斜杠分隔regex,类似于Javascript中的语法。它似乎是在使用底层的RegexpQuery解析器。

So a brief example:

所以一个简单的例子:

body:/[0-9]{5}/

will match on a five-digit zip code in the textual corpus I have indexed. But, oddly, body:/\d{5}/ did not work for me, and ^ failed as well.

将匹配我已索引的文本语料库中的五位数邮政编码。但是,奇怪的是,身体:\ d { 5 } /不为我工作,以及^失败。

The regex dialect would have to be Java's, but I'm not sure if everything in it works, since I have only done a cursory examination. One would probably have to look carefully at the RegexpQuery code to understand what works and what doesn't.

regex方言必须是Java的,但我不确定它中的所有内容是否都有效,因为我只做了一个粗略的检查。您可能需要仔细查看RegexpQuery代码,以了解哪些有效,哪些无效。

#2


4  

Regular expressions and (e)dismax are not really comparable. Dismax is meant to work directly with common end-user input, while regular expressions are not typical end-user input.

正则表达式和(e)dismax并没有可比性。Dismax是指直接使用常见的终端用户输入,而正则表达式不是典型的终端用户输入。

Also, matching regular-expression-like things with dismax depends largely on text analysis settings and schema design, not on dismax itself. With Solr you typically tailor the schema and text analysis to the concrete search need, possibly doing much of the work at index-time. Regular expressions are at odds with this and even with the basic structure of Lucene inverted indices.

同样,与dismax匹配的正则表达式类似的东西在很大程度上取决于文本分析设置和模式设计,而不是dismax本身。使用Solr,您通常会根据具体的搜索需求定制模式和文本分析,可能会在索引时完成大部分工作。正则表达式与此不一致,甚至与Lucene倒指数的基本结构不一致。

Still, Lucene provides RegexQuery and the newer RegexpQuery. As far as I know, these are not integrated with Solr, but they could be. Start a new item in the Solr issue tracker and happy coding! :)

不过,Lucene提供了RegexQuery和更新的RegexpQuery。据我所知,它们并没有与Solr集成,但是它们可以。在Solr问题跟踪器和快乐编码中启动一个新项目!:)

Keep in mind that regex queries will probably always be slow... but they could have acceptable performance in your case.

请记住,regex查询可能总是很慢的……但在你的情况下,他们的表现是可以接受的。

#1


15  

Version 4.0 of Lucene will support regex queries directly in the standard query parser using special syntax. I verified that it works on an instance of Solr I am running, built from the subversion trunk in February.

Lucene的4.0版本将使用特殊语法直接支持标准查询解析器中的regex查询。我验证了它是在Solr我正在运行的实例上运行的,它是在2月份由subversion主干构建的。

Jira ticket 2604 describes the extension of the standard query parser using special regex syntax, using forward slashes to delimit the regex, similar to syntax in Javascript. It seems to be using the underlying RegexpQuery parser.

Jira ticket 2604使用特殊的regex语法描述了标准查询解析器的扩展,使用正向斜杠分隔regex,类似于Javascript中的语法。它似乎是在使用底层的RegexpQuery解析器。

So a brief example:

所以一个简单的例子:

body:/[0-9]{5}/

will match on a five-digit zip code in the textual corpus I have indexed. But, oddly, body:/\d{5}/ did not work for me, and ^ failed as well.

将匹配我已索引的文本语料库中的五位数邮政编码。但是,奇怪的是,身体:\ d { 5 } /不为我工作,以及^失败。

The regex dialect would have to be Java's, but I'm not sure if everything in it works, since I have only done a cursory examination. One would probably have to look carefully at the RegexpQuery code to understand what works and what doesn't.

regex方言必须是Java的,但我不确定它中的所有内容是否都有效,因为我只做了一个粗略的检查。您可能需要仔细查看RegexpQuery代码,以了解哪些有效,哪些无效。

#2


4  

Regular expressions and (e)dismax are not really comparable. Dismax is meant to work directly with common end-user input, while regular expressions are not typical end-user input.

正则表达式和(e)dismax并没有可比性。Dismax是指直接使用常见的终端用户输入,而正则表达式不是典型的终端用户输入。

Also, matching regular-expression-like things with dismax depends largely on text analysis settings and schema design, not on dismax itself. With Solr you typically tailor the schema and text analysis to the concrete search need, possibly doing much of the work at index-time. Regular expressions are at odds with this and even with the basic structure of Lucene inverted indices.

同样,与dismax匹配的正则表达式类似的东西在很大程度上取决于文本分析设置和模式设计,而不是dismax本身。使用Solr,您通常会根据具体的搜索需求定制模式和文本分析,可能会在索引时完成大部分工作。正则表达式与此不一致,甚至与Lucene倒指数的基本结构不一致。

Still, Lucene provides RegexQuery and the newer RegexpQuery. As far as I know, these are not integrated with Solr, but they could be. Start a new item in the Solr issue tracker and happy coding! :)

不过,Lucene提供了RegexQuery和更新的RegexpQuery。据我所知,它们并没有与Solr集成,但是它们可以。在Solr问题跟踪器和快乐编码中启动一个新项目!:)

Keep in mind that regex queries will probably always be slow... but they could have acceptable performance in your case.

请记住,regex查询可能总是很慢的……但在你的情况下,他们的表现是可以接受的。