I have a table with the format:
我有一个格式的表:
Id | Loc |
-------|-----|
789-A | 4 |
123 | 1 |
123-BZ | 1 |
123-CG | 2 |
456 | 2 |
456 | 3 |
789 | 4 |
I want to exclude certain rows from the result of query based on whether a duplicate exists. In this case, though, the definition of a duplicate row is pretty complex:
我想根据是否存在重复从查询结果中排除某些行。但是,在这种情况下,重复行的定义非常复杂:
If any row returned by the query (let's refer to this hypothetical row as ThisRow
) has a counterpart row also contained within the query results where Loc
is identical to ThisRow.Loc
AND Id
is of the form <ThisRow.Id>-<an alphanumeric suffix>
then ThisRow
should be considered a duplicate and excluded from the query results.
如果查询返回的任何行(让我们将此假设行称为ThisRow)在查询结果中也包含对应的行,其中Loc与ThisRow.Loc相同而且Id的格式为
For example, using the table above, SELECT * FROM table
should return the results set below:
例如,使用上表,SELECT * FROM表应返回以下结果集:
Id | Loc |
-------|-----|
789-A | 4 |
123-BZ | 1 |
123-CG | 2 |
456 | 2 |
456 | 3 |
I understand how to write the string matching conditional:
我理解如何编写匹配条件的字符串:
ThisRow.Id REGEXP '^PossibleDuplicateRow.Id-[A-Za-z0-9]*'
and the straight comparison of Loc
:
和Loc的直接比较:
ThisRow.Loc = PossibleDuplicateRow.Loc
What I don't understand is how to form these conditionals into a (self-referential?) query.
我不明白的是如何将这些条件形成为(自引用?)查询。
2 个解决方案
#1
3
Here's one way:
这是一种方式:
SELECT *
FROM myTable t1
WHERE NOT EXISTS
(
SELECT 1
FROM myTable t2
WHERE t2.Loc = t1.Loc
AND t2.Id LIKE CONCAT(t1.Id, '-%')
)
SQL小提琴示例
Or, the same query using an anti-join (which should be a little faster):
或者,使用反连接的相同查询(应该更快一点):
SELECT *
FROM myTable t1
LEFT OUTER JOIN myTable t2
ON t2.Loc = t1.Loc
AND t2.Id LIKE CONCAT(t1.Id, '-%')
WHERE t2.Id IS NULL
SQL小提琴示例
#2
0
In the example data you give, there are no examples of duplicate locs not being on duplicate rows. For example, you don't have a row "123-AZ, 1", where the prefix row "123, 1" would conflict with two rows.
在您提供的示例数据中,没有重复的loc不在重复行上的示例。例如,您没有“123-AZ,1”行,其中前缀行“123,1”将与两行冲突。
If this is a real characteristic of the data, then you can eliminate dups without a self join, by using aggregation:
如果这是数据的真实特征,那么您可以通过使用聚合来消除没有自连接的重复:
select max(id), loc
from t
group by (case when locate(id, '-') = 0 then id
else left(id, locate(id, '-') - 1)
end), loc
I offer this because an aggregation should be much faster than a non-equijoin.
我提供这个是因为聚合应该比非等值连接快得多。
#1
3
Here's one way:
这是一种方式:
SELECT *
FROM myTable t1
WHERE NOT EXISTS
(
SELECT 1
FROM myTable t2
WHERE t2.Loc = t1.Loc
AND t2.Id LIKE CONCAT(t1.Id, '-%')
)
SQL小提琴示例
Or, the same query using an anti-join (which should be a little faster):
或者,使用反连接的相同查询(应该更快一点):
SELECT *
FROM myTable t1
LEFT OUTER JOIN myTable t2
ON t2.Loc = t1.Loc
AND t2.Id LIKE CONCAT(t1.Id, '-%')
WHERE t2.Id IS NULL
SQL小提琴示例
#2
0
In the example data you give, there are no examples of duplicate locs not being on duplicate rows. For example, you don't have a row "123-AZ, 1", where the prefix row "123, 1" would conflict with two rows.
在您提供的示例数据中,没有重复的loc不在重复行上的示例。例如,您没有“123-AZ,1”行,其中前缀行“123,1”将与两行冲突。
If this is a real characteristic of the data, then you can eliminate dups without a self join, by using aggregation:
如果这是数据的真实特征,那么您可以通过使用聚合来消除没有自连接的重复:
select max(id), loc
from t
group by (case when locate(id, '-') = 0 then id
else left(id, locate(id, '-') - 1)
end), loc
I offer this because an aggregation should be much faster than a non-equijoin.
我提供这个是因为聚合应该比非等值连接快得多。