如何从mysql表中删除重复的文本记录

时间:2022-01-11 12:54:10

I have two columns in my mysql table words and name of these columns are wordid which is primary key and other column is lemma.

我的mysql表中有两列单词,这些列的名称是wordid,它是主键,其他列是引理。

I need to remove duplicate values of lemma from the table. Please tell me how can I do it with MySQL command. here is a sample of my table.

我需要从表中删除引理的重复值。请告诉我如何使用MySQL命令执行此操作。这是我的表的样本。

wordid    ||  lemma

+--------+--------------------+

 148206  || wilful disobedience 

 149162  || wilful disobedience 

 149857 || wilful disobedience 

4 个解决方案

#1


1  

The easiest way to do this is to add a UNIQUE index on the lemma column. Include the IGNORE into the ALTER statement, so all the duplicates will be removed. Note that the next inserts with duplicates will throw an error.

最简单的方法是在引理列上添加UNIQUE索引。将IGNORE包含在ALTER语句中,以便删除所有重复项。请注意,下一个带有重复项的插入将引发错误。

ALTER IGNORE TABLE words
ADD UNIQUE INDEX idx_lemma (lemma);

#2


1  

You can do this in once by using the following query:

您可以使用以下查询一次执行此操作:

delete * from table_name where wordid not in (select wordid from table_name group by lemma)

The inner query will select the first wordid for each lemma and ignore the repeated. The outer query will delete all the rows which do not have word id from result of inner query. It will delete all other rows having repeated lemma.

内部查询将为每个引理选择第一个wordid并忽略重复。外部查询将从内部查询的结果中删除所有没有字ID的行。它将删除所有其他具有重复引理的行。

#3


0  

You could use a delete from inner join with subselect form get the wordid to not delete

您可以使用从内部联接删除子选择表单获取wordid不删除

  delete from  my_table a
  inner join  (
    select wordid
    from my_table
    group by lemma
    having count(*) >1 
  )  t  on a.wordid = t.word.id 
  where wordid not in (  select wordid_to_delete
    from (
          select min(wordid) as wordid_to_delete
          from my_table 
          group by lemma
          having count(*)>1
     ) t2

  )

#4


0  

The first step is to identify which rows have duplicate primary key values:

第一步是确定哪些行具有重复的主键值:

      SELECT col1, col2, count(*)
       FROM t1
      GROUP BY col1, col2
       HAVING count(*) > 1

This will return one row for each set of duplicate PK values in the table. The last column in this result is the number of duplicates for the particular PK value.

这将为表中的每组重复PK值返回一行。此结果中的最后一列是特定PK值的重复数。

If there are only a few sets of duplicate PK values, the best procedure is to delete these manually on an individual basis. For example:

如果只有几组重复的PK值,最好的方法是逐个手动删除它们。例如:

    set rowcount 1
     delete from t1
      where col1=1 and col2=1

The rowcount value should be n-1 the number of duplicates for a given key value.

rowcount值应为n-1给定键值的重复数。

If there are many distinct sets of duplicate PK values in the table, it may be too time-consuming to remove them individually. In this case the following procedure can be used:

如果表中存在许多不同的重复PK值集,则单独删除它们可能会非常耗时。在这种情况下,可以使用以下过程:

-- First, run the above GROUP BY query to determine how many sets of duplicate PK values exist, and the count of duplicates for each set.

- 首先,运行上面的GROUP BY查询以确定存在多少组重复PK值,以及每组的重复计数。

-- Select the duplicate key values into a holding table. For example:

- 在保留表中选择重复的键值。例如:

     SELECT col1, col2, col3=count(*)
        INTO holdkey
    FROM t1
     GROUP BY col1, col2
     HAVING count(*) > 1

-- Select the duplicate rows into a holding table, eliminating duplicates in the process. For example:

- 将重复的行选择到保留表中,从而消除过程中的重复行。例如:

      SELECT DISTINCT t1.*
    INTO holddups
     FROM t1, holdkey
    WHERE t1.col1 = holdkey.col1
    AND t1.col2 = holdkey.col2

At this point, the holddups table should have unique PKs, however, this will not be the case if t1 had duplicate PKs. For example,

此时,holddups表应具有唯一的PK,但是,如果t1具有重复的PK,则不会出现这种情况。例如,

Delete the duplicate rows from the original table. For example:

从原始表中删除重复的行。例如:

     DELETE t1
     FROM t1, holdkey
    WHERE t1.col1 = holdkey.col1
    AND t1.col2 = holdkey.col2

Put the unique rows back in the original table. For example:

将唯一的行放回原始表中。例如:

     INSERT t1 SELECT * FROM holddups

Hope this helps!

希望这可以帮助!

#1


1  

The easiest way to do this is to add a UNIQUE index on the lemma column. Include the IGNORE into the ALTER statement, so all the duplicates will be removed. Note that the next inserts with duplicates will throw an error.

最简单的方法是在引理列上添加UNIQUE索引。将IGNORE包含在ALTER语句中,以便删除所有重复项。请注意,下一个带有重复项的插入将引发错误。

ALTER IGNORE TABLE words
ADD UNIQUE INDEX idx_lemma (lemma);

#2


1  

You can do this in once by using the following query:

您可以使用以下查询一次执行此操作:

delete * from table_name where wordid not in (select wordid from table_name group by lemma)

The inner query will select the first wordid for each lemma and ignore the repeated. The outer query will delete all the rows which do not have word id from result of inner query. It will delete all other rows having repeated lemma.

内部查询将为每个引理选择第一个wordid并忽略重复。外部查询将从内部查询的结果中删除所有没有字ID的行。它将删除所有其他具有重复引理的行。

#3


0  

You could use a delete from inner join with subselect form get the wordid to not delete

您可以使用从内部联接删除子选择表单获取wordid不删除

  delete from  my_table a
  inner join  (
    select wordid
    from my_table
    group by lemma
    having count(*) >1 
  )  t  on a.wordid = t.word.id 
  where wordid not in (  select wordid_to_delete
    from (
          select min(wordid) as wordid_to_delete
          from my_table 
          group by lemma
          having count(*)>1
     ) t2

  )

#4


0  

The first step is to identify which rows have duplicate primary key values:

第一步是确定哪些行具有重复的主键值:

      SELECT col1, col2, count(*)
       FROM t1
      GROUP BY col1, col2
       HAVING count(*) > 1

This will return one row for each set of duplicate PK values in the table. The last column in this result is the number of duplicates for the particular PK value.

这将为表中的每组重复PK值返回一行。此结果中的最后一列是特定PK值的重复数。

If there are only a few sets of duplicate PK values, the best procedure is to delete these manually on an individual basis. For example:

如果只有几组重复的PK值,最好的方法是逐个手动删除它们。例如:

    set rowcount 1
     delete from t1
      where col1=1 and col2=1

The rowcount value should be n-1 the number of duplicates for a given key value.

rowcount值应为n-1给定键值的重复数。

If there are many distinct sets of duplicate PK values in the table, it may be too time-consuming to remove them individually. In this case the following procedure can be used:

如果表中存在许多不同的重复PK值集,则单独删除它们可能会非常耗时。在这种情况下,可以使用以下过程:

-- First, run the above GROUP BY query to determine how many sets of duplicate PK values exist, and the count of duplicates for each set.

- 首先,运行上面的GROUP BY查询以确定存在多少组重复PK值,以及每组的重复计数。

-- Select the duplicate key values into a holding table. For example:

- 在保留表中选择重复的键值。例如:

     SELECT col1, col2, col3=count(*)
        INTO holdkey
    FROM t1
     GROUP BY col1, col2
     HAVING count(*) > 1

-- Select the duplicate rows into a holding table, eliminating duplicates in the process. For example:

- 将重复的行选择到保留表中,从而消除过程中的重复行。例如:

      SELECT DISTINCT t1.*
    INTO holddups
     FROM t1, holdkey
    WHERE t1.col1 = holdkey.col1
    AND t1.col2 = holdkey.col2

At this point, the holddups table should have unique PKs, however, this will not be the case if t1 had duplicate PKs. For example,

此时,holddups表应具有唯一的PK,但是,如果t1具有重复的PK,则不会出现这种情况。例如,

Delete the duplicate rows from the original table. For example:

从原始表中删除重复的行。例如:

     DELETE t1
     FROM t1, holdkey
    WHERE t1.col1 = holdkey.col1
    AND t1.col2 = holdkey.col2

Put the unique rows back in the original table. For example:

将唯一的行放回原始表中。例如:

     INSERT t1 SELECT * FROM holddups

Hope this helps!

希望这可以帮助!