查找MongoDB集合中正则表达式数组的匹配

时间:2022-05-06 16:48:19

Say I have a collection with these fields:

假设我有一个包含这些字段的集合:

{
    "category" : "ONE",
    "data": [
        {
            "regex": "/^[0-9]{2}$/",
            "type" : "TYPE1"
        },
        {
            "regex": "/^[a-z]{3}$/",
            "type" : "TYPE2"
        }
        // etc
    ]
}

So my input is "abc" so I'd like to obtain the corresponding type (or best match, although initially I'm assuming RegExes are exclusive). Is there any possible way to achieve this with decent performance? (that would be excluding iterating over each item of the RegEx array)

因此,我的输入是“abc”,因此我希望获得相应的类型(或最佳匹配,尽管最初我假设RegExes是排他性的)。有什么可能的方式来实现这与体面的表现?(这将排除对RegEx数组的每个项进行迭代)

Please note the schema could be re-arranged if possible, as this project is still in the design phase. So alternatives would be welcomed.

请注意,如果可能,方案可以重新安排,因为这个项目仍在设计阶段。因此,替代方案将受到欢迎。

Each category can have around 100 - 150 RegExes. I plan to have around 300 categories. But I do know that types are mutually exclusive.

每个类别可以有大约100 - 150个regex。我计划有大约300个类别。但是我知道类型是相互排斥的。

Real world example for one category:

一个类别的真实世界例子:

type1=^34[0-9]{4}$, 
type2=^54[0-9]{4}$, 
type3=^39[0-9]{4}$, 
type4=^1[5-9]{2}$, 
type5=^2[4-9]{2,3}$

2 个解决方案

#1


2  

Describing the RegEx (Divide et Impera) would greatly help in limiting the number of Documents needed to be processed.

描述RegEx (Divide et Impera)将极大地帮助限制需要处理的文档数量。

Some ideas in this direction:

这方面的一些想法:

  • RegEx accepting length (fixed, min, max)
  • 接受长度(固定,最小,最大值)
  • POSIX style character classes ([:alpha:], [:digit:], [:alnum:], etc.)
  • POSIX样式字符类([:alpha:], [:digit:], [:alnum:]等)
  • Tree like Document structure (umm)
  • 树状文档结构(嗯)

Implementing each of these would add to the complexity (code and/or manual input) for Insertion and also some overhead for describing the searchterm before the query.

实现其中的每一个都将增加插入的复杂性(代码和/或手动输入),以及在查询之前描述searchterm的一些开销。

Having mutually exclusive types in a category simplifies things, but what about between categories?

在一个类别中拥有互斥类型可以简化事情,但是在类别之间呢?

300 categories @ 100-150 RegExps/category => 30k to 45k RegExps

300个类别@ 100-150个regexp /category => 30k到45k个regexp

... some would surely be exact duplicates if not most of them.

…如果不是大部分的话,有些肯定是完全相同的。

In this approach I'll try to minimise the total number of Documents to be stored/queried in a reversed style vs. your initial proposed 'schema'.
Note: included only string lengths in this demo for narrowing, this may come naturally for manual input as it could reinforce a visual check over the RegEx

在这种方法中,我将尽量减少以反向样式存储/查询的文档总数,而不是您最初提出的“模式”。注意:在这个演示中只包含字符串长度以进行缩小,这对于手动输入来说是很自然的,因为它可以加强对RegEx的可视化检查

Consider rewiting the regexes Collection with Documents as follows:

考虑将regexes集合与文档重新连接,如下所示:

{
   "max_length": NumberLong(2),
   "min_length": NumberLong(2),
   "regex": "^[0-9][2]$",
   "types": [
     "ONE/TYPE1",
     "NINE/TYPE6"
  ]
},
{
   "max_length": NumberLong(4),
   "min_length": NumberLong(3),
   "regex": "^2[4-9][2,3]$",
   "types": [
     "ONE/TYPE5",
     "TWO/TYPE2",
     "SIX/TYPE8"
  ]
},
{
   "max_length": NumberLong(6),
   "min_length": NumberLong(6),
   "regex": "^39[0-9][4]$",
   "types": [
     "ONE/TYPE3",
     "SIX/TYPE2"
  ]
},
{
   "max_length": NumberLong(3),
   "min_length": NumberLong(3),
   "regex": "^[a-z][3]$",
   "types": [
     "ONE/TYPE2"
  ]
} 

.. each unique RegEx as it's own document, having Categories it belongs to (extensible to multiple types per category)

. .每个唯一的RegEx都是它自己的文档,包含它所属的类别(每个类别可扩展为多种类型)

Demo Aggregation code:

演示聚合代码:

function () {

   match=null;
   query='abc';

   db.regexes.aggregate(
    {$match: {
        max_length: {$gte: query.length},
        min_length: {$lte: query.length},
        types: /^ONE\//
        }
    },
    {$project: {
        regex: 1, 
        types: 1, 
        _id:0
        }
    }
   ).result.some(function(re){ 
       if (query.match(new RegExp(re.regex))) return match=re.types;
   });
   return match;
}

Return for 'abc' query:

换取“abc”查询:

[
   "ONE/TYPE2"
] 

this will run against only these two Documents:

这只针对这两份文件:

{
   "regex": "^2[4-9][2,3]$",
   "types": [
     "ONE/TYPE5",
     "TWO/TYPE2",
     "SIX/TYPE8"
  ]
},
 {
   "regex": "^[a-z][3]$",
   "types": [
     "ONE/TYPE2"
  ]
} 

narrowed by the length 3 and having the category ONE.

由3的长度变窄,然后是第一类。

Could be narrowed even further by implementing POSIX descriptors (easy to test against the searchterm but have to input 2 RegExps in the DB)

通过实现POSIX描述符(容易对searchterm进行测试,但必须在DB中输入2个regexp),可以进一步缩小范围。

#2


0  

Breadth first search. If your input starts with a letter you can throw away type 1, if it also contains a number you can throw away exclusive(numbers only or letters only) categories, and if it also contains a symbol then keep only a handful of types containing all three. Then follow above advice for remaining categories. In a sense, set up cases for input types and use cases for a select number of 'regex types' to search down to the right one.

广度优先搜索。如果你的输入以一个字母开头,你可以扔掉type 1,如果它还包含一个数字,你可以扔掉独占(数字或字母)类别,如果它也包含一个符号,那么只保留少数几种类型,包含这三种类型。然后按照上面的建议进行其他分类。在某种意义上,为输入类型设置用例,为选择的“regex类型”设置用例,以便搜索到正确的类型。

Or you can create a regex model based on the input and compare it to the list of regex models existing as a string to get the type. That way you just have to spend resources analyzing the input to build the regex for it.

或者,您可以基于输入创建一个regex模型,并将其与作为字符串存在的regex模型列表进行比较,以获得该类型。这样,您只需花费资源分析输入,为其构建regex。

#1


2  

Describing the RegEx (Divide et Impera) would greatly help in limiting the number of Documents needed to be processed.

描述RegEx (Divide et Impera)将极大地帮助限制需要处理的文档数量。

Some ideas in this direction:

这方面的一些想法:

  • RegEx accepting length (fixed, min, max)
  • 接受长度(固定,最小,最大值)
  • POSIX style character classes ([:alpha:], [:digit:], [:alnum:], etc.)
  • POSIX样式字符类([:alpha:], [:digit:], [:alnum:]等)
  • Tree like Document structure (umm)
  • 树状文档结构(嗯)

Implementing each of these would add to the complexity (code and/or manual input) for Insertion and also some overhead for describing the searchterm before the query.

实现其中的每一个都将增加插入的复杂性(代码和/或手动输入),以及在查询之前描述searchterm的一些开销。

Having mutually exclusive types in a category simplifies things, but what about between categories?

在一个类别中拥有互斥类型可以简化事情,但是在类别之间呢?

300 categories @ 100-150 RegExps/category => 30k to 45k RegExps

300个类别@ 100-150个regexp /category => 30k到45k个regexp

... some would surely be exact duplicates if not most of them.

…如果不是大部分的话,有些肯定是完全相同的。

In this approach I'll try to minimise the total number of Documents to be stored/queried in a reversed style vs. your initial proposed 'schema'.
Note: included only string lengths in this demo for narrowing, this may come naturally for manual input as it could reinforce a visual check over the RegEx

在这种方法中,我将尽量减少以反向样式存储/查询的文档总数,而不是您最初提出的“模式”。注意:在这个演示中只包含字符串长度以进行缩小,这对于手动输入来说是很自然的,因为它可以加强对RegEx的可视化检查

Consider rewiting the regexes Collection with Documents as follows:

考虑将regexes集合与文档重新连接,如下所示:

{
   "max_length": NumberLong(2),
   "min_length": NumberLong(2),
   "regex": "^[0-9][2]$",
   "types": [
     "ONE/TYPE1",
     "NINE/TYPE6"
  ]
},
{
   "max_length": NumberLong(4),
   "min_length": NumberLong(3),
   "regex": "^2[4-9][2,3]$",
   "types": [
     "ONE/TYPE5",
     "TWO/TYPE2",
     "SIX/TYPE8"
  ]
},
{
   "max_length": NumberLong(6),
   "min_length": NumberLong(6),
   "regex": "^39[0-9][4]$",
   "types": [
     "ONE/TYPE3",
     "SIX/TYPE2"
  ]
},
{
   "max_length": NumberLong(3),
   "min_length": NumberLong(3),
   "regex": "^[a-z][3]$",
   "types": [
     "ONE/TYPE2"
  ]
} 

.. each unique RegEx as it's own document, having Categories it belongs to (extensible to multiple types per category)

. .每个唯一的RegEx都是它自己的文档,包含它所属的类别(每个类别可扩展为多种类型)

Demo Aggregation code:

演示聚合代码:

function () {

   match=null;
   query='abc';

   db.regexes.aggregate(
    {$match: {
        max_length: {$gte: query.length},
        min_length: {$lte: query.length},
        types: /^ONE\//
        }
    },
    {$project: {
        regex: 1, 
        types: 1, 
        _id:0
        }
    }
   ).result.some(function(re){ 
       if (query.match(new RegExp(re.regex))) return match=re.types;
   });
   return match;
}

Return for 'abc' query:

换取“abc”查询:

[
   "ONE/TYPE2"
] 

this will run against only these two Documents:

这只针对这两份文件:

{
   "regex": "^2[4-9][2,3]$",
   "types": [
     "ONE/TYPE5",
     "TWO/TYPE2",
     "SIX/TYPE8"
  ]
},
 {
   "regex": "^[a-z][3]$",
   "types": [
     "ONE/TYPE2"
  ]
} 

narrowed by the length 3 and having the category ONE.

由3的长度变窄,然后是第一类。

Could be narrowed even further by implementing POSIX descriptors (easy to test against the searchterm but have to input 2 RegExps in the DB)

通过实现POSIX描述符(容易对searchterm进行测试,但必须在DB中输入2个regexp),可以进一步缩小范围。

#2


0  

Breadth first search. If your input starts with a letter you can throw away type 1, if it also contains a number you can throw away exclusive(numbers only or letters only) categories, and if it also contains a symbol then keep only a handful of types containing all three. Then follow above advice for remaining categories. In a sense, set up cases for input types and use cases for a select number of 'regex types' to search down to the right one.

广度优先搜索。如果你的输入以一个字母开头,你可以扔掉type 1,如果它还包含一个数字,你可以扔掉独占(数字或字母)类别,如果它也包含一个符号,那么只保留少数几种类型,包含这三种类型。然后按照上面的建议进行其他分类。在某种意义上,为输入类型设置用例,为选择的“regex类型”设置用例,以便搜索到正确的类型。

Or you can create a regex model based on the input and compare it to the list of regex models existing as a string to get the type. That way you just have to spend resources analyzing the input to build the regex for it.

或者,您可以基于输入创建一个regex模型,并将其与作为字符串存在的regex模型列表进行比较,以获得该类型。这样,您只需花费资源分析输入,为其构建regex。