删除大型MySql表中的副本

时间:2022-01-11 12:54:16

I have a question about MySql. I have a table with 7.479.194 records. Some records are duplicated. I would like to do this:

我有一个关于MySql的问题。我有一张记录是7.479.194的桌子。一些记录被复制。我想这样做:

insert into new_table 
  select * 
    from old_table 
group by old_table.a, old_table.b

so I would take out the duplicated entries...but problem is that this is a large amount of data. The table is MyIsam.

所以我要取出重复的条目…但问题是,这是大量的数据。MyIsam表。

This is example data- I would like to group it by city, short_ccode...

这是示例数据——我想按城市、short_ccode分组…

id          city      post_code        short_ccode
----------------------------------------------------
4732875     Celje     3502             si
4733306     Celje     3502             si
4734250     Celje     3502             si

I suppose I have to modify my.ini file for some memory for group by statement...which settings are responsible for that?

我想我得修改一下。对于组的一些内存的ini文件……哪些设置负责这个?

I have a machine with 3bg of RAM and 2Ghz processor.

我有一台3bg的RAM和2Ghz处理器的机器。

My ini file:

我的ini文件:


# aaaMySQL Server Instance Configuration File
# ----------------------------------------------------------------------
# Generated by the MySQL Server Instance Configuration Wizard
#
#
# Installation Instructions
# ----------------------------------------------------------------------
#
# On Linux you can copy this file to /etc/my.cnf to set global options,
# mysql-data-dir/my.cnf to set server-specific options
# (@localstatedir@ for this installation) or to
# ~/.my.cnf to set user-specific options.
#
# On Windows you should keep this file in the installation directory 
# of your server (e.g. C:\Program Files\MySQL\MySQL Server 4.1). To
# make sure the server reads the config file use the startup option 
# "--defaults-file". 
#
# To run run the server from the command line, execute this in a 
# command line shell, e.g.
# mysqld --defaults-file="C:\Program Files\MySQL\MySQL Server 4.1\my.ini"
#
# To install the server as a Windows service manually, execute this in a 
# command line shell, e.g.
# mysqld --install MySQL41 --defaults-file="C:\Program Files\MySQL\MySQL Server 4.1\my.ini"
#
# And then execute this in a command line shell to start the server, e.g.
# net start MySQL41
#
#
# Guildlines for editing this file
# ----------------------------------------------------------------------
#
# In this file, you can use all long options that the program supports.
# If you want to know the options a program supports, start the program
# with the "--help" option.
#
# More detailed information about the individual options can also be
# found in the manual.
#
#
# CLIENT SECTION
# ----------------------------------------------------------------------
#
# The following options will be read by MySQL client applications.
# Note that only client applications shipped by MySQL are guaranteed
# to read this section. If you want your own MySQL client program to
# honor these values, you need to specify it as an option during the
# MySQL client library initialization.
#
[client]

port=3306


# SERVER SECTION
# ----------------------------------------------------------------------
#
# The following options will be read by the MySQL Server. Make sure that
# you have installed the server correctly (see above) so it reads this 
# file.
#
[wampmysqld]

# The TCP/IP Port the MySQL Server will listen on
port=3306


#Path to installation directory. All paths are usually resolved relative to this.
basedir=d:/wamp/bin/mysql/mysql5.0.45

#log file
log-error=d:/wamp/logs/mysql.log

#Path to the database root
datadir=d:/wamp/bin/mysql/mysql5.0.45/data

# The default character set that will be used when a new schema or table is
# created and no character set is defined
default-character-set=utf8

# The default storage engine that will be used when create new tables when
default-storage-engine=MyISAM

# The maximum amount of concurrent sessions the MySQL server will
# allow. One of these connections will be reserved for a user with
# SUPER privileges to allow the administrator to login even if the
# connection limit has been reached.
max_connections=1000

# Query cache is used to cache SELECT results and later return them
# without actual executing the same query once again. Having the query
# cache enabled may result in significant speed improvements, if your
# have a lot of identical queries and rarely changing tables. See the
# "Qcache_lowmem_prunes" status variable to check if the current value
# is high enough for your load.
# Note: In case your tables change very often or if your queries are
# textually different every time, the query cache may result in a
# slowdown instead of a performance improvement.
query_cache_size=16M

# The number of open tables for all threads. Increasing this value
# increases the number of file descriptors that mysqld requires.
# Therefore you have to make sure to set the amount of open files
# allowed to at least 4096 in the variable "open-files-limit" in
# section [mysqld_safe]
table_cache=500

# Maximum size for internal (in-memory) temporary tables. If a table
# grows larger than this value, it is automatically converted to disk
# based table This limitation is for a single table. There can be many
# of them.
tmp_table_size=32M


# How many threads we should keep in a cache for reuse. When a client
# disconnects, the client's threads are put in the cache if there aren't
# more than thread_cache_size threads from before.  This greatly reduces
# the amount of thread creations needed if you have a lot of new
# connections. (Normally this doesn't give a notable performance
# improvement if you have a good thread implementation.)
thread_cache_size=12

#*** MyISAM Specific options

# The maximum size of the temporary file MySQL is allowed to use while
# recreating the index (during REPAIR, ALTER TABLE or LOAD DATA INFILE.
# If the file-size would be bigger than this, the index will be created
# through the key cache (which is slower).
myisam_max_sort_file_size=100G

# If the temporary file used for fast index creation would be bigger
# than using the key cache by the amount specified here, then prefer the
# key cache method.  This is mainly used to force long character keys in
# large tables to use the slower key cache method to create the index.
myisam_max_extra_sort_file_size=100G

# If the temporary file used for fast index creation would be bigger
# than using the key cache by the amount specified here, then prefer the
# key cache method.  This is mainly used to force long character keys in
# large tables to use the slower key cache method to create the index.
myisam_sort_buffer_size=32M

# Size of the Key Buffer, used to cache index blocks for MyISAM tables.
# Do not set it larger than 30% of your available memory, as some memory
# is also required by the OS to cache rows. Even if you're not using
# MyISAM tables, you should still set it to 8-64M as it will also be
# used for internal temporary disk tables.
key_buffer_size=64M

# Size of the buffer used for doing full table scans of MyISAM tables.
# Allocated per thread, if a full scan is needed.
read_buffer_size=2M
read_rnd_buffer_size=8MK

# This buffer is allocated when MySQL needs to rebuild the index in
# REPAIR, OPTIMZE, ALTER table statements as well as in LOAD DATA INFILE
# into an empty table. It is allocated per thread so be careful with
# large settings.
sort_buffer_size=256M


#*** INNODB Specific options ***


# Use this option if you have a MySQL server with InnoDB support enabled
# but you do not plan to use it. This will save memory and disk space
# and speed up some things.
#skip-innodb

# Additional memory pool that is used by InnoDB to store metadata
# information.  If InnoDB requires more memory for this purpose it will
# start to allocate it from the OS.  As this is fast enough on most
# recent operating systems, you normally do not need to change this
# value. SHOW INNODB STATUS will display the current amount used.
innodb_additional_mem_pool_size=20M

# If set to 1, InnoDB will flush (fsync) the transaction logs to the
# disk at each commit, which offers full ACID behavior. If you are
# willing to compromise this safety, and you are running small
# transactions, you may set this to 0 or 2 to reduce disk I/O to the
# logs. Value 0 means that the log is only written to the log file and
# the log file flushed to disk approximately once per second. Value 2
# means the log is written to the log file at each commit, but the log
# file is only flushed to disk approximately once per second.
innodb_flush_log_at_trx_commit=1

# The size of the buffer InnoDB uses for buffering log data. As soon as
# it is full, InnoDB will have to flush it to disk. As it is flushed
# once per second anyway, it does not make sense to have it very large
# (even with long transactions).
innodb_log_buffer_size=8M

# InnoDB, unlike MyISAM, uses a buffer pool to cache both indexes and
# row data. The bigger you set this the less disk I/O is needed to
# access data in tables. On a dedicated database server you may set this
# parameter up to 80% of the machine physical memory size. Do not set it
# too large, though, because competition of the physical memory may
# cause paging in the operating system.  Note that on 32bit systems you
# might be limited to 2-3.5G of user level memory per process, so do not
# set it too high.
innodb_buffer_pool_size=512M

# Size of each log file in a log group. You should set the combined size
# of log files to about 25%-100% of your buffer pool size to avoid
# unneeded buffer pool flush activity on log file overwrite. However,
# note that a larger logfile size will increase the time needed for the
# recovery process.
innodb_log_file_size=10M

# Number of threads allowed inside the InnoDB kernel. The optimal value
# depends highly on the application, hardware as well as the OS
# scheduler properties. A too high value may lead to thread thrashing.
innodb_thread_concurrency=8



[mysqld]
port=3306

6 个解决方案

#1


2  

This will populate NEW_TABLE with unique values, and the id value is the first id of the bunch:

这将用唯一的值填充NEW_TABLE, id值是该集合的第一个id:

INSERT INTO NEW_TABLE
  SELECT MIN(ot.id),
         ot.city,
         ot.post_code,
         ot.short_ccode
    FROM OLD_TABLE ot
GROUP BY ot.city, ot.post_code, ot.short_ccode

If you want the highest id value per bunch:

如果你想要每串最高的id值:

INSERT INTO NEW_TABLE
  SELECT MAX(ot.id),
         ot.city,
         ot.post_code,
         ot.short_ccode
    FROM OLD_TABLE ot
GROUP BY ot.city, ot.post_code, ot.short_ccode

#2


1  

A bit dirty maybe, but it has done the trick for me the few times that I've needed it: Remove duplicate entries in MySQL.

这可能有点脏,但它已经为我做了几次我需要它:在MySQL中删除重复的条目。

Basically, you simply create a unique index consisting of all the columns that you wan't to be unique in the table.

基本上,您只需创建一个唯一索引,其中包含所有您不想在表中惟一的列。

As always before this kind of procedures, a backup before proceeding is recommended.

像往常一样,在进行此类过程之前,建议进行备份。

#3


1  

MySQL has a INSERT IGNORE. From the docs:

MySQL有一个插入忽略。从文档:

[...] however, when INSERT IGNORE is used, the insert operation fails silently for the row containing the unmatched value, but any rows that are matched are inserted.

[…然而,当使用插入忽略时,对于包含不匹配值的行,插入操作将以静默方式失败,但是插入任何匹配的行。

So you could use your query from above b just adding a IGNORE

所以你可以使用上面b的查询只添加一个忽略

#4


0  

To avoid the memory issue, avoid the big select by having a small external program, using the logic as below. First, backup your database. Then:

为了避免内存问题,使用下面的逻辑,通过一个小型外部程序来避免大选择。首先,备份你的数据库。然后:

do {
# find a record
x=sql: select * from table1 limit 1;
if (null x)
then
 exit # no more data in table1
fi
insert x into table2

# find the value of the field that should NOT be duplicated
a=parse(x for table1.a)
# delete all such entries from table1
sql: delete * from table1 where a='$a';

}

#5


0  

You don't need to group data. Try this:

您不需要对数据进行分组。试试这个:

 delete from old_table
    USING old_table, old_table as vtable  
    WHERE (old_table.id > vtable.id)  
    AND (old_table.city=vtable.city AND 
old_table.post_code=vtable.post_code 
AND old_table.short_code=vtable.short_code) 

I can't comment posts becouse of my points ... repair table old_table; next: show:

由于我的观点,我不能评论文章。维修表old_table;下一个节目:

EXPLAIN SELECT old_table.id FROM   old_table, old_table as vtable  
        WHERE (old_table.id > vtable.id)  
        AND (old_table.city=vtable.city AND 
    old_table.post_code=vtable.post_code 
    AND old_table.short_code=vtable.short_code

Show: os~> ulimit -a; mysql>SHOW VARIABLES LIKE 'open_files_limit';

显示:os ~ > ulimit - a;mysql >显示变量如“open_files_limit”;

next: Remove all os restrictions form the mysql process.

接下来:从mysql进程中删除所有os限制。

ulimit -n 1024 etc.

ulimit - n 1024等。

#6


0  

From my experience when your table grows to number of millions records and more the most effective way to handle duplicates will: 1) export data to text files 2) sort in file 3) remove duplicates in file 4) load back to database

根据我的经验,当您的表增长到数百万条记录,并且更有效地处理重复的方法是:1)将数据导出到文件中的文本文件2)排序3)删除文件中的重复4)将文件加载回数据库

With increasing size of the data this approach works eventually faster than any SQL query you may invent

随着数据大小的增加,这种方法最终比任何SQL查询都要快

#1


2  

This will populate NEW_TABLE with unique values, and the id value is the first id of the bunch:

这将用唯一的值填充NEW_TABLE, id值是该集合的第一个id:

INSERT INTO NEW_TABLE
  SELECT MIN(ot.id),
         ot.city,
         ot.post_code,
         ot.short_ccode
    FROM OLD_TABLE ot
GROUP BY ot.city, ot.post_code, ot.short_ccode

If you want the highest id value per bunch:

如果你想要每串最高的id值:

INSERT INTO NEW_TABLE
  SELECT MAX(ot.id),
         ot.city,
         ot.post_code,
         ot.short_ccode
    FROM OLD_TABLE ot
GROUP BY ot.city, ot.post_code, ot.short_ccode

#2


1  

A bit dirty maybe, but it has done the trick for me the few times that I've needed it: Remove duplicate entries in MySQL.

这可能有点脏,但它已经为我做了几次我需要它:在MySQL中删除重复的条目。

Basically, you simply create a unique index consisting of all the columns that you wan't to be unique in the table.

基本上,您只需创建一个唯一索引,其中包含所有您不想在表中惟一的列。

As always before this kind of procedures, a backup before proceeding is recommended.

像往常一样,在进行此类过程之前,建议进行备份。

#3


1  

MySQL has a INSERT IGNORE. From the docs:

MySQL有一个插入忽略。从文档:

[...] however, when INSERT IGNORE is used, the insert operation fails silently for the row containing the unmatched value, but any rows that are matched are inserted.

[…然而,当使用插入忽略时,对于包含不匹配值的行,插入操作将以静默方式失败,但是插入任何匹配的行。

So you could use your query from above b just adding a IGNORE

所以你可以使用上面b的查询只添加一个忽略

#4


0  

To avoid the memory issue, avoid the big select by having a small external program, using the logic as below. First, backup your database. Then:

为了避免内存问题,使用下面的逻辑,通过一个小型外部程序来避免大选择。首先,备份你的数据库。然后:

do {
# find a record
x=sql: select * from table1 limit 1;
if (null x)
then
 exit # no more data in table1
fi
insert x into table2

# find the value of the field that should NOT be duplicated
a=parse(x for table1.a)
# delete all such entries from table1
sql: delete * from table1 where a='$a';

}

#5


0  

You don't need to group data. Try this:

您不需要对数据进行分组。试试这个:

 delete from old_table
    USING old_table, old_table as vtable  
    WHERE (old_table.id > vtable.id)  
    AND (old_table.city=vtable.city AND 
old_table.post_code=vtable.post_code 
AND old_table.short_code=vtable.short_code) 

I can't comment posts becouse of my points ... repair table old_table; next: show:

由于我的观点,我不能评论文章。维修表old_table;下一个节目:

EXPLAIN SELECT old_table.id FROM   old_table, old_table as vtable  
        WHERE (old_table.id > vtable.id)  
        AND (old_table.city=vtable.city AND 
    old_table.post_code=vtable.post_code 
    AND old_table.short_code=vtable.short_code

Show: os~> ulimit -a; mysql>SHOW VARIABLES LIKE 'open_files_limit';

显示:os ~ > ulimit - a;mysql >显示变量如“open_files_limit”;

next: Remove all os restrictions form the mysql process.

接下来:从mysql进程中删除所有os限制。

ulimit -n 1024 etc.

ulimit - n 1024等。

#6


0  

From my experience when your table grows to number of millions records and more the most effective way to handle duplicates will: 1) export data to text files 2) sort in file 3) remove duplicates in file 4) load back to database

根据我的经验,当您的表增长到数百万条记录,并且更有效地处理重复的方法是:1)将数据导出到文件中的文本文件2)排序3)删除文件中的重复4)将文件加载回数据库

With increasing size of the data this approach works eventually faster than any SQL query you may invent

随着数据大小的增加,这种方法最终比任何SQL查询都要快