使用字符串匹配来减少查询结果

时间:2022-09-19 18:58:29

I have a table with the format:

我有一个格式的表:

Id     | Loc |
-------|-----|
789-A  | 4   |
123    | 1   |
123-BZ | 1   |
123-CG | 2   |
456    | 2   |
456    | 3   |
789    | 4   |

I want to exclude certain rows from the result of query based on whether a duplicate exists. In this case, though, the definition of a duplicate row is pretty complex:

我想根据是否存在重复从查询结果中排除某些行。但是,在这种情况下,重复行的定义非常复杂:

If any row returned by the query (let's refer to this hypothetical row as ThisRow) has a counterpart row also contained within the query results where Loc is identical to ThisRow.Loc AND Id is of the form <ThisRow.Id>-<an alphanumeric suffix> then ThisRow should be considered a duplicate and excluded from the query results.

如果查询返回的任何行(让我们将此假设行称为ThisRow)在查询结果中也包含对应的行,其中Loc与ThisRow.Loc相同而且Id的格式为 - 那么ThisRow应被视为重复并从查询结果中排除。

For example, using the table above, SELECT * FROM table should return the results set below:

例如,使用上表,SELECT * FROM表应返回以下结果集:

Id     | Loc |
-------|-----|
789-A  | 4   |
123-BZ | 1   |
123-CG | 2   |
456    | 2   |
456    | 3   |

I understand how to write the string matching conditional:

我理解如何编写匹配条件的字符串:

ThisRow.Id REGEXP '^PossibleDuplicateRow.Id-[A-Za-z0-9]*'

and the straight comparison of Loc:

和Loc的直接比较:

ThisRow.Loc = PossibleDuplicateRow.Loc

What I don't understand is how to form these conditionals into a (self-referential?) query.

我不明白的是如何将这些条件形成为(自引用?)查询。

2 个解决方案

#1


3  

Here's one way:

这是一种方式:

SELECT *
FROM myTable t1
WHERE NOT EXISTS
(
    SELECT 1
    FROM myTable t2
    WHERE t2.Loc = t1.Loc
    AND t2.Id LIKE CONCAT(t1.Id, '-%')
)

SQL Fiddle example

SQL小提琴示例

Or, the same query using an anti-join (which should be a little faster):

或者,使用反连接的相同查询(应该更快一点):

SELECT *
FROM myTable t1
LEFT OUTER JOIN myTable t2 
    ON t2.Loc = t1.Loc
    AND t2.Id LIKE CONCAT(t1.Id, '-%')
WHERE t2.Id IS NULL

SQL Fiddle example

SQL小提琴示例

#2


0  

In the example data you give, there are no examples of duplicate locs not being on duplicate rows. For example, you don't have a row "123-AZ, 1", where the prefix row "123, 1" would conflict with two rows.

在您提供的示例数据中,没有重复的loc不在重复行上的示例。例如,您没有“123-AZ,1”行,其中前缀行“123,1”将与两行冲突。

If this is a real characteristic of the data, then you can eliminate dups without a self join, by using aggregation:

如果这是数据的真实特征,那么您可以通过使用聚合来消除没有自连接的重复:

select max(id), loc
from t
group by (case when locate(id, '-') = 0 then id
               else left(id, locate(id, '-') - 1)
          end), loc

I offer this because an aggregation should be much faster than a non-equijoin.

我提供这个是因为聚合应该比非等值连接快得多。

#1


3  

Here's one way:

这是一种方式:

SELECT *
FROM myTable t1
WHERE NOT EXISTS
(
    SELECT 1
    FROM myTable t2
    WHERE t2.Loc = t1.Loc
    AND t2.Id LIKE CONCAT(t1.Id, '-%')
)

SQL Fiddle example

SQL小提琴示例

Or, the same query using an anti-join (which should be a little faster):

或者,使用反连接的相同查询(应该更快一点):

SELECT *
FROM myTable t1
LEFT OUTER JOIN myTable t2 
    ON t2.Loc = t1.Loc
    AND t2.Id LIKE CONCAT(t1.Id, '-%')
WHERE t2.Id IS NULL

SQL Fiddle example

SQL小提琴示例

#2


0  

In the example data you give, there are no examples of duplicate locs not being on duplicate rows. For example, you don't have a row "123-AZ, 1", where the prefix row "123, 1" would conflict with two rows.

在您提供的示例数据中,没有重复的loc不在重复行上的示例。例如,您没有“123-AZ,1”行,其中前缀行“123,1”将与两行冲突。

If this is a real characteristic of the data, then you can eliminate dups without a self join, by using aggregation:

如果这是数据的真实特征,那么您可以通过使用聚合来消除没有自连接的重复:

select max(id), loc
from t
group by (case when locate(id, '-') = 0 then id
               else left(id, locate(id, '-') - 1)
          end), loc

I offer this because an aggregation should be much faster than a non-equijoin.

我提供这个是因为聚合应该比非等值连接快得多。