I would like to remove the duplicate data only if three columns (name, price and new price) matching with the same data. But in an other python script.
我想仅在三列(名称,价格和新价格)与相同数据匹配时才删除重复数据。但在另一个python脚本中。
So the data can insert in to the database, but with an other python script, I want to delete this duplicate data by a cron job.
所以数据可以插入到数据库中,但是使用其他python脚本,我想通过cron作业删除这些重复数据。
So in this case:
所以在这种情况下:
cur.execute("INSERT INTO cars VALUES(8,'Hummer',41400, 49747)")
cur.execute("INSERT INTO cars VALUES(9,'Volkswagen',21600, 36456)")
are duplicates. Example script with inserted data:
是重复的。插入数据的示例脚本:
import psycopg2
import sys
con = None
try:
con = psycopg2.connect(database='testdb', user='janbodnar')
cur = con.cursor()
cur.execute("CREATE TABLE cars(id INT PRIMARY KEY, name VARCHAR(20), price INT, new price INT)")
cur.execute("INSERT INTO cars VALUES(1,'Audi',52642, 98484)")
cur.execute("INSERT INTO cars VALUES(2,'Mercedes',57127, 874897)")
cur.execute("INSERT INTO cars VALUES(3,'Skoda',9000, 439788)")
cur.execute("INSERT INTO cars VALUES(4,'Volvo',29000, 743878)")
cur.execute("INSERT INTO cars VALUES(5,'Bentley',350000, 434684)")
cur.execute("INSERT INTO cars VALUES(6,'Citroen',21000, 43874)")
cur.execute("INSERT INTO cars VALUES(7,'Hummer',41400, 49747)")
cur.execute("INSERT INTO cars VALUES(8,'Hummer',41400, 49747)")
cur.execute("INSERT INTO cars VALUES(9,'Volkswagen',21600, 36456)")
cur.execute("INSERT INTO cars VALUES(10,'Volkswagen',21600, 36456)")
con.commit()
except psycopg2.DatabaseError, e:
if con:
con.rollback()
print 'Error %s' % e
sys.exit(1
finally:
if con:
con.close()
2 个解决方案
#1
3
You can do this in one statement without additional round-trips to the server.
您可以在一个语句中执行此操作,而无需额外往返服务器。
DELETE FROM cars
USING (
SELECT id, row_number() OVER (PARTITION BY name, price, new_price
ORDER BY id) AS rn
FROM cars
) x
WHERE cars.id = x.id
AND x.rn > 1;
Requires PostgreSQL 8.4 or later for the window function row_number()
.
Out of a set of dupes the smallest id survives.
Note that I changed "new price"
to new_price
.
需要PostgreSQL 8.4或更高版本的窗口函数row_number()。在一组欺骗中,最小的身份存活下来。请注意,我将“新价格”更改为new_price。
Or use the EXISTS
semi-join, that @wildplasser posted as comment to the same effect.
或者使用EXISTS半连接,即@wildplasser发布评论相同的效果。
Or, to by special request of CTE-devotee @wildplasser, with a CTE instead of the subquery ... :)
或者,通过CTE-devotee @wildplasser的特殊要求,用CTE而不是子查询... :)
WITH x AS (
SELECT id, row_number() OVER (PARTITION BY name, price, new_price
ORDER BY id) AS rn
FROM cars
)
DELETE FROM cars
USING x
WHERE cars.id = x.id
AND x.rn > 1;
Data modifying CTE requires Postgres 9.1 or later.
This form will perform about the same as the one with the subquery.
修改CTE的数据需要Postgres 9.1或更高版本。此表单的执行方式与子查询的表单大致相同。
#2
2
Use a GROUP BY
SQL statement to identify the rows, together with the initial primary key:
使用GROUP BY SQL语句标识行以及初始主键:
duplicate_query = '''\
SELECT MIN(id), "name", price, "new price"
FROM cars
GROUP BY "name", price, "new price"
HAVING COUNT(ID) > 1
'''
The above query selects the lowest primary key id
for each group of (name, price, "new price") rows where there is more than one primary key id
. For your sample data, this will return:
上述查询为每组(名称,价格,“新价格”)行选择最低主键ID,其中有多个主键ID。对于您的示例数据,这将返回:
7, 'Hummer', 41400, 49747
9, 'Volkswagen', 21600, 36456
You can then use the returned data to delete the duplicates:
然后,您可以使用返回的数据删除重复项:
delete_dupes = '''
DELETE
FROM cars
WHERE
"name"=%(name)s AND price=%(price)s AND "new price"=%(newprice)s AND
id > %(id)s
'''
cur.execute(duplicate_query)
dupes = cur.fetchall()
cur.executemany(delete_dupes, [
dict(name=r[1], price=r[2], newprice=r[3], id=r[0])
for r in dupes])
Note that we delete any row where the primary key id
is larger than the first id
with the same 3 columns. For the first dupe, only the row with id
8 will match, for the second dupe the row with id
10 matches.
请注意,我们删除主键id大于具有相同3列的第一个id的任何行。对于第一个欺骗,只有id为8的行匹配,对于第二个dupe,id为10的行匹配。
This does do a separate delete for each dupe found. You can combine this into one statement with a WHERE EXISTS
sub-select query:
这会对找到的每个欺骗进行单独删除。您可以使用WHERE EXISTS子选择查询将其合并为一个语句:
delete_dupes = '''\
DELETE FROM cars cdel
WHERE EXISTS (
SELECT *
FROM cars cex
WHERE
cex."name" = cdel."name" AND
cex.price = cdel.price AND
cex."new price" = cdel."new price" AND
cex.id > cdel.id
)
'''
cur.execute(delete_dupes)
This instructs PostgreSQL to delete any row for which there are other rows with the same name, price and new price but with a primary key that is higher than the current row.
这指示PostgreSQL删除任何行,其中有其他行具有相同的名称,价格和新价格,但主键高于当前行。
#1
3
You can do this in one statement without additional round-trips to the server.
您可以在一个语句中执行此操作,而无需额外往返服务器。
DELETE FROM cars
USING (
SELECT id, row_number() OVER (PARTITION BY name, price, new_price
ORDER BY id) AS rn
FROM cars
) x
WHERE cars.id = x.id
AND x.rn > 1;
Requires PostgreSQL 8.4 or later for the window function row_number()
.
Out of a set of dupes the smallest id survives.
Note that I changed "new price"
to new_price
.
需要PostgreSQL 8.4或更高版本的窗口函数row_number()。在一组欺骗中,最小的身份存活下来。请注意,我将“新价格”更改为new_price。
Or use the EXISTS
semi-join, that @wildplasser posted as comment to the same effect.
或者使用EXISTS半连接,即@wildplasser发布评论相同的效果。
Or, to by special request of CTE-devotee @wildplasser, with a CTE instead of the subquery ... :)
或者,通过CTE-devotee @wildplasser的特殊要求,用CTE而不是子查询... :)
WITH x AS (
SELECT id, row_number() OVER (PARTITION BY name, price, new_price
ORDER BY id) AS rn
FROM cars
)
DELETE FROM cars
USING x
WHERE cars.id = x.id
AND x.rn > 1;
Data modifying CTE requires Postgres 9.1 or later.
This form will perform about the same as the one with the subquery.
修改CTE的数据需要Postgres 9.1或更高版本。此表单的执行方式与子查询的表单大致相同。
#2
2
Use a GROUP BY
SQL statement to identify the rows, together with the initial primary key:
使用GROUP BY SQL语句标识行以及初始主键:
duplicate_query = '''\
SELECT MIN(id), "name", price, "new price"
FROM cars
GROUP BY "name", price, "new price"
HAVING COUNT(ID) > 1
'''
The above query selects the lowest primary key id
for each group of (name, price, "new price") rows where there is more than one primary key id
. For your sample data, this will return:
上述查询为每组(名称,价格,“新价格”)行选择最低主键ID,其中有多个主键ID。对于您的示例数据,这将返回:
7, 'Hummer', 41400, 49747
9, 'Volkswagen', 21600, 36456
You can then use the returned data to delete the duplicates:
然后,您可以使用返回的数据删除重复项:
delete_dupes = '''
DELETE
FROM cars
WHERE
"name"=%(name)s AND price=%(price)s AND "new price"=%(newprice)s AND
id > %(id)s
'''
cur.execute(duplicate_query)
dupes = cur.fetchall()
cur.executemany(delete_dupes, [
dict(name=r[1], price=r[2], newprice=r[3], id=r[0])
for r in dupes])
Note that we delete any row where the primary key id
is larger than the first id
with the same 3 columns. For the first dupe, only the row with id
8 will match, for the second dupe the row with id
10 matches.
请注意,我们删除主键id大于具有相同3列的第一个id的任何行。对于第一个欺骗,只有id为8的行匹配,对于第二个dupe,id为10的行匹配。
This does do a separate delete for each dupe found. You can combine this into one statement with a WHERE EXISTS
sub-select query:
这会对找到的每个欺骗进行单独删除。您可以使用WHERE EXISTS子选择查询将其合并为一个语句:
delete_dupes = '''\
DELETE FROM cars cdel
WHERE EXISTS (
SELECT *
FROM cars cex
WHERE
cex."name" = cdel."name" AND
cex.price = cdel.price AND
cex."new price" = cdel."new price" AND
cex.id > cdel.id
)
'''
cur.execute(delete_dupes)
This instructs PostgreSQL to delete any row for which there are other rows with the same name, price and new price but with a primary key that is higher than the current row.
这指示PostgreSQL删除任何行,其中有其他行具有相同的名称,价格和新价格,但主键高于当前行。