解析Lucene查询语法并转义CloudSearch。

Basically, I have an application that needs to support both Lucene.NET and Amazon CloudSearch.

基本上，我有一个应用程序需要同时支持这两个Lucene。网和亚马逊CloudSearch。

So, I can't re-write the queries, I need to use the standard queries from lucene, and use the .ToString() on the query to get the syntax.

因此，我不能重写查询，我需要使用来自lucene的标准查询，并在查询上使用. tostring()来获取语法。

The issue is that in Lucene.NET (I don't know if this is the same in the java version), the .ToString() method return the raw string without the escape characters.

问题是在Lucene。NET(我不知道这在java版本中是否相同)，. tostring()方法返回没有转义字符的原始字符串。

Therefore, things like:

因此,诸如:

(title:blah:blah summary:"lala:la")

should be

应该是

(title:blah\:blah summary:"lala\:la")

What I need is a regex that will add the escapes.

我需要一个regex来添加转义。

Is this possible? and if so, what would it look like.

这是可能的吗?如果是这样，它会是什么样子。

Some additional possible variances:

一些额外的可能的差异:

(title:"this is a search:term")
(field5:"this is a title:term")

2 个解决方案

#1

Based on comments and edits, it seems that you want any query string to be able to be correctly escaped by the regex, and any given lucene query to be accurately represented by the resulting string.

根据注释和编辑，您似乎希望regex能够正确地转义任何查询字符串，并希望结果字符串能够准确地表示任何给定的lucene查询。

That ain't gonna happen.

这并不会发生。

Lucene query syntax is not capable of expressing all lucene queries. In fact, the string you get from Query.toString() often can't even be parsed by the QueryParser, nevermind being an accurate reconstruction of the query.

Lucene查询语法不能表达所有的Lucene查询。实际上，您从query. tostring()获得的字符串甚至不能被QueryParser解析，更别提查询的精确重构了。

The long and short of it: You are going about this the wrong way. Query.ToString() is not designed to serialize the query, and it's goal is not to create a parsable string query. It's mainly for debugging and such. If you keep attempting to use it this way, this tomfoolery of trying to use a regex to escape ambiguous query syntax will likely just be the start of your troubles.

长话短说:你走错了路。tostring()不是为序列化查询而设计的，它的目标不是创建一个可解析的字符串查询。它主要用于调试之类的。如果您继续尝试以这种方式使用它，那么尝试使用regex来逃避不明确的查询语法的这种愚蠢行为很可能只是麻烦的开始。

This question provides another example of this.

这个问题提供了另一个例子。

#2

You can use this regex to escape the colon : at strategic points of the string

您可以使用这个regex来转义冒号:在字符串的策略点

(?<!title|summary):

Then escape the captured colon :

然后逃离捕获的冒号:

Explanation

解释

Look behind ?<! for any colon that is not followed by title or summary, then match the colon :

看后面吗? < !若冒号后面没有标题或摘要，则匹配冒号:

See Demo

看到演示

input

输入

(title:blah:blah summary:"lala:la")

Output

输出

(title:blah\:blah summary:"lala\:la")

#1