如果我停止一个长时间运行的查询,它是否回滚?

时间:2022-07-31 03:54:15

A query that is used to loop through 17 millions records to remove duplicates has been running now for about 16 hours and I wanted to know if the query is stopped right now if it will finalize the delete statements or if it has been deleting while running this query? Indeed, if I do stop it, does it finalize the deletes or rolls back?

查询用于遍历17数百万删除重复记录现在已经运行了16个小时,我想知道如果查询现在停止如果它将完成删除语句或者运行该查询时已经删除?确实,如果我阻止它,它会终结删除还是回滚?

I have found that when I do a

我发现当我做a的时候

 select count(*) from myTable

That the rows that it returns (while doing this query) is about 5 less than what the starting row count was. Obviously the server resources are extremely poor, so does that mean that this process has taken 16 hours to find 5 duplicates (when there are actually thousands), and this could be running for days?

它返回的行(在执行此查询时)比起始行计数少5行。显然,服务器资源非常贫乏,这是否意味着这个过程花费了16个小时来找到5个副本(实际上有数千个副本),并且这个过程可能会持续数天?

This query took 6 seconds on 2000 rows of test data, and it works great on that set of data, so I figured it would take 15 hours for the complete set.

这个查询在2000行测试数据上花费了6秒,而且它在这组数据上非常有用,所以我想要花15个小时来完成整个测试。

Any ideas?

什么好主意吗?

Below is the query:

下面是查询:

--Declare the looping variable
DECLARE @LoopVar char(10)


    DECLARE
     --Set private variables that will be used throughout
      @long DECIMAL,
      @lat DECIMAL,
      @phoneNumber char(10),
      @businessname varchar(64),
      @winner char(10)

    SET @LoopVar = (SELECT MIN(RecordID) FROM MyTable)

    WHILE @LoopVar is not null
    BEGIN

      --initialize the private variables (essentially this is a .ctor)
      SELECT 
        @long = null,
        @lat = null,
        @businessname = null,
        @phoneNumber = null,
        @winner = null

      -- load data from the row declared when setting @LoopVar  
      SELECT
        @long = longitude,
        @lat = latitude,
        @businessname = BusinessName,
        @phoneNumber = Phone
      FROM MyTable
      WHERE RecordID = @LoopVar

      --find the winning row with that data. The winning row means 
      SELECT top 1 @Winner = RecordID
      FROM MyTable
      WHERE @long = longitude
        AND @lat = latitude
        AND @businessname = BusinessName
        AND @phoneNumber = Phone
      ORDER BY
        CASE WHEN webAddress is not null THEN 1 ELSE 2 END,
        CASE WHEN caption1 is not null THEN 1 ELSE 2 END,
        CASE WHEN caption2 is not null THEN 1 ELSE 2 END,
        RecordID

      --delete any losers.
      DELETE FROM MyTable
      WHERE @long = longitude
        AND @lat = latitude
        AND @businessname = BusinessName
        AND @phoneNumber = Phone
        AND @winner != RecordID

      -- prep the next loop value to go ahead and perform the next duplicate query.
      SET @LoopVar = (SELECT MIN(RecordID) 
    FROM MyTable
    WHERE @LoopVar < RecordID)
    END

12 个解决方案

#1


27  

no, sql server will not roll back the deletes it has already performed if you stop query execution. oracle requires an explicit committal of action queries or the data gets rolled back, but not mssql.

不,如果停止查询执行,sql server将不会回滚已执行的删除操作。oracle需要明确提交操作查询或回滚数据,但不需要mssql。

with sql server it will not roll back unless you are specifically running in the context of a transaction and you rollback that transaction, or the connection closes without the transaction having been committed. but i don't see a transaction context in your above query.

对于sql server,它不会回滚,除非您是在事务的上下文中运行,并且您回滚该事务,或者连接在没有提交事务的情况下关闭。但是我在上面的查询中没有看到事务上下文。

you could also try re-structuring your query to make the deletes a little more efficient, but essentially if the specs of your box are not up to snuff then you might be stuck waiting it out.

您还可以尝试重新构造查询结构,使删除变得更有效一些,但是如果您的框的规格没有达到要求,那么您可能会陷入等待。

going forward, you should create a unique index on the table to keep yourself from having to go through this again.

接下来,您应该在表上创建一个惟一的索引,以避免自己再次遇到这种情况。

#2


7  

Your query is not wrapped in a transaction, so it won't rollback the changes already made by the individual delete statements.

查询没有封装在事务中,因此它不会回滚单个delete语句已经做的更改。

I specifically tested this myself on my own SQL Server using the following query, and the ApplicationLog table was empty even though I cancelled the query:

我使用下面的查询在我自己的SQL服务器上特别测试了这一点,并且ApplicationLog表是空的,尽管我取消了查询:

declare @count int
select @count = 5
WHILE @count > 0
BEGIN
  print @count
  delete from applicationlog;
  waitfor time '20:00';
  select @count = @count -1
END

However your query is likely to take many days or weeks, much longer then 15 hours. Your estimate that you can process 2000 records every 6 seconds is wrong because each iteration in your while loop will take significantly longer with 17 million rows then it does with 2000 rows. So unless your query takes significantly less then a second for 2000 rows, it will take days for all 17 million.

然而,您的查询可能要花费许多天或几个星期,比15个小时要长得多。您对每6秒处理2000条记录的估计是错误的,因为在while循环中,每一次迭代在1700万行时所花费的时间要比在2000行时长得多。因此,除非您的查询在2000行中大大减少了一秒,否则将需要数天时间才能达到1700万。

You should ask a new question on how you can delete duplicate rows efficiently.

您应该问一个新的问题,关于如何有效地删除重复的行。

#3


2  

If you don't do anything explicit about transactions then the connection will be in autocommit transactions mode. In this mode every SQL statement is considered a transaction.

如果您没有对事务做任何明确的操作,那么连接将在自动提交事务模式中。在这种模式下,每个SQL语句都被视为事务。

The question is whether this means the individual SQL statements are transactions and are therefore being committed as you go, or whether the outer WHILE loop counts as a transaction.

问题是,这是否意味着单个SQL语句是事务,因此在执行过程中被提交,或者外部WHILE循环是否算作事务。

There doesn't seem to be any discussion of this in the description of the WHILE construct on MSDN. However, since a WHILE statement can't directly modify the database it would seem logical that it doesn't start an auto-commit transaction.

在MSDN上的WHILE构造的描述中似乎没有对此进行任何讨论。但是,由于WHILE语句不能直接修改数据库,因此不启动自动提交事务似乎是合乎逻辑的。

#4


2  

Implicit transactions

If no 'Implicit transactions' has been set, then each iteration in your loop committed the changes.

如果没有设置“隐式事务”,那么循环中的每个迭代都提交了更改。

It is possible for any SQL Server to be set with 'Implicit transactions'. This is a database setting (by default is OFF). You can also have implicit transactions in the properties of a particular query inside of Management Studio (right click in query pane>options), by default settings in the client, or a SET statement.

任何SQL服务器都可以设置为“隐式事务”。这是一个数据库设置(默认为关闭)。您还可以在Management Studio内的特定查询的属性(在查询窗格>选项中右键单击)、在客户端中的默认设置或SET语句中拥有隐式事务。

SET IMPLICIT_TRANSACTIONS ON;

Either way, if this was the case, you would still need to execute an explicit COMMIT/ROLLBACK regardless of interruption of the query execution.

无论哪种方式,如果是这种情况,您仍然需要执行显式的提交/回滚,而不考虑查询执行的中断。


Implicit transactions reference:

隐式事务参考:

http://msdn.microsoft.com/en-us/library/ms188317.aspx

http://msdn.microsoft.com/en-us/library/ms188317.aspx

http://msdn.microsoft.com/en-us/library/ms190230.aspx

http://msdn.microsoft.com/en-us/library/ms190230.aspx

#5


1  

I inherited a system which had logic something like yours implemented in SQL. In our case, we were trying to link together rows using fuzzy matching that had similar names/addresses, etc, and that logic was done purely in SQL. At the time I inherited it we had about 300,000 rows in the table and according to the timings, we calculated it would take A YEAR to match them all.

我继承了一个用SQL实现的逻辑系统。在我们的示例中,我们尝试使用具有类似名称/地址等的模糊匹配将行链接在一起,而这种逻辑纯粹是在SQL中完成的。在我接手的时候,我们的桌子上有30万排,根据计时,我们计算出要花一年的时间才能把它们全部匹配起来。

As an experiment to see how much faster I could do it outside of SQL, I wrote a program to dump the db table into flat files, read the flat files into a C++ program, build my own indexes, and do the fuzzy logic there, then reimport the flat files into the database. What took A YEAR in SQL took about 30 seconds in the C++ app.

作为一个实验,看看我可以更快之外的SQL,我编写了一个程序转储到平面文件数据库表,平面文件读入一个c++程序,构建自己的索引,并做模糊逻辑,然后再输入平面文件到数据库中。在SQL中花了一年的时间,在c++应用中花了大约30秒。

So, my advice is, don't even try what you are doing in SQL. Export, process, re-import.

因此,我的建议是,不要尝试使用SQL。出口过程中,导入。

#6


1  

DELETES that have been performed up to this point will not be rolled back.

在此之前执行的删除操作将不会回滚。


As the original author of the code in question, and having issued the caveat that performance will be dependant on indexes, I would propose the following items to speed this up.

作为问题代码的原始作者,我已经发出了性能将依赖于索引的警告,我将提出以下项目来加快这一速度。

RecordId better be PRIMARY KEY. I don't mean IDENTITY, I mean PRIMARY KEY. Confirm this using sp_help

RecordId最好是主键。我不是说身份,我是说主键。使用sp_help证实了这一点

Some index should be used in evaluating this query. Figure out which of these four columns has the least repeats and index that...

应该使用一些索引来评估这个查询。找出这四列中重复次数最少的列和索引…

SELECT *
FROM MyTable
WHERE @long = longitude
  AND @lat = latitude
  AND @businessname = BusinessName
  AND @phoneNumber = Phone

Before and After adding this index, check the query plan to see if index scanning has been added.

在添加这个索引之前和之后,检查查询计划,看看是否添加了索引扫描。

#7


0  

As a loop your query will struggle to scale well, even with appropriate indexes. The query should be rewritten to a single statement, as per the suggestions in your previous question on this.

作为循环,查询将很难很好地伸缩,即使使用适当的索引。查询应该重写为一条语句,就像您在前面的问题中提到的那样。

If you're not running it explicitly within a transaction it will only roll back the executing statement.

如果您没有在事务中显式地运行它,它将只回滚执行语句。

#8


0  

I think this query would be much more efficient if it was re-written using a single-pass algorithm using a cursor. You would order you cursor table by longitude,latitude,BusinessName AND @phoneNumber. You’d step through the rows one at a time. If a row has the same longitude, latitude, businessname, and phonenumber as the previous row, then delete it.

我认为,如果使用游标使用单遍算法重新编写,这个查询将会更有效。您可以通过经度、纬度、BusinessName和@phoneNumber来命令您的游标表。你会一步一步地走过一行。如果一行与前一行具有相同的经度、纬度、businessname和phonenumber,则删除它。

#9


0  

I think you need to seriously consider your methodolology. You need to start thinking in sets (although for performance you may need batch processing, but not row by row against a 17 million record table.)

我认为你需要认真考虑你的方法学。您需要开始考虑集合(尽管对于性能,您可能需要批处理,但不需要对1700万记录表进行逐行处理)。

First do all of your records have duplicates? I suspect not, so the first thing you wan to do is limit your processing to only those records which have duplicates. Since this is a large table and you may need to do the deletes in batches over time depending on what other processing is going on, you first pull the records you want to deal with into a table of their own that you then index. You can also use a temp table if you are going to be able to do this all at the same time without ever stopping it other wise create a table in your database and drop at the end.

首先,你的所有记录都有副本吗?我猜不是,所以您要做的第一件事是将您的处理限制为只处理那些具有重复的记录。由于这是一个大表,您可能需要根据其他处理的情况,在一段时间内分批执行删除操作,因此您首先将想要处理的记录提取到自己的表中,然后对其进行索引。如果您想同时使用临时表,而不需要停止它,您也可以使用临时表。

Something like (Note I didn't write the create index statments, I figure you can look that up yourself):

比如(注意我没有写创建索引语句,我想你可以自己查一下):

SELECT min(m.RecordID), m.longitude, m.latitude, m.businessname, m.phone  
     into  #RecordsToKeep    
FROM MyTable   m
join 
(select longitude, latitude, businessname, phone
from MyTable
group by longitude, latitude, businessname, phone
having count(*) >1) a 
on a.longitude = m.longitude and a.latitude = m.latitude and
a.businessname = b.businessname and a.phone = b.phone 
group by  m.longitude, m.latitude, m.businessname, m.phone   
ORDER BY CASE WHEN m.webAddress is not null THEN 1 ELSE 2 END,        
    CASE WHEN m.caption1 is not null THEN 1 ELSE 2 END,        
    CASE WHEN m.caption2 is not null THEN 1 ELSE 2 END



while (select count(*) from #RecordsToKeep) > 0
begin
select top 1000 * 
into #Batch
from #RecordsToKeep

Delete m
from mytable m
join #Batch b 
        on b.longitude = m.longitude and b.latitude = m.latitude and
        b.businessname = b.businessname and b.phone = b.phone 
where r.recordid <> b.recordID

Delete r
from  #RecordsToKeep r
join #Batch b on r.recordid = b.recordid

end

Delete m
from mytable m
join #RecordsToKeep r 
        on r.longitude = m.longitude and r.latitude = m.latitude and
        r.businessname = b.businessname and r.phone = b.phone 
where r.recordid <> m.recordID

#10


0  

Also try thinking another method to remove duplicate rows:

还可以考虑另一种方法来删除重复的行:

delete t1 from table1 as t1 where exists (
    select * from table1 as t2 where
        t1.column1=t2.column1 and
        t1.column2=t2.column2 and
        t1.column3=t2.column3 and
        --add other colums if any
        t1.id>t2.id
)

I suppose that you have an integer id column in your table.

我假设您的表中有一个整数id列。

#11


0  

If your machine doesn't have very advanced hardware then it may take sql server a very long time to complete that command. I don't know for sure how this operation is performed under the hood but based on my experience this could be done more efficiently by bringing the records out of the database and into memory for a program that uses a tree structure with a remove duplicate rule for insertion. Try reading the entirety of the table in chuncks (say 10000 rows at a time) into a C++ program using ODBC. Once in the C++ program use and std::map where key is the unique key and struct is a struct that holds the rest of the data in variables. Loop over all the records and perform insertion into the map. The map insert function will handle removing the duplicates. Since search inside a map is lg(n) time far less time to find duplicates than using your while loop. You can then delete the entire table and add the tuples back into the database from the map by forming insert queries and executing them via odbc or building a text file script and running it in management studio.

如果您的机器没有非常高级的硬件,那么sql server可能需要很长时间才能完成该命令。我不确定这个操作是如何在后台执行的,但是根据我的经验,这可以通过将记录从数据库中取出并存储到内存中来实现,以便使用带有删除重复规则的树结构进行插入。试着使用ODBC在一个c++程序中读取整个表(每次10000行)。在c++程序中使用和std::map时,key是唯一的key, struct是一个将其余数据保存在变量中的struct。循环所有记录并在映射中执行插入操作。map插入函数将处理删除重复项。因为在地图内搜索是lg(n)时间比使用while循环要短得多。然后,您可以删除整个表,并通过创建insert查询并通过odbc执行查询,或者构建一个文本文件脚本并在management studio中运行,将元组从映射中添加回数据库。

#12


-1  

I'm pretty sure that is a negatory. Otherwise what would the point of transactions be?

我很确定这是否定的。否则,事务的意义是什么?

#1


27  

no, sql server will not roll back the deletes it has already performed if you stop query execution. oracle requires an explicit committal of action queries or the data gets rolled back, but not mssql.

不,如果停止查询执行,sql server将不会回滚已执行的删除操作。oracle需要明确提交操作查询或回滚数据,但不需要mssql。

with sql server it will not roll back unless you are specifically running in the context of a transaction and you rollback that transaction, or the connection closes without the transaction having been committed. but i don't see a transaction context in your above query.

对于sql server,它不会回滚,除非您是在事务的上下文中运行,并且您回滚该事务,或者连接在没有提交事务的情况下关闭。但是我在上面的查询中没有看到事务上下文。

you could also try re-structuring your query to make the deletes a little more efficient, but essentially if the specs of your box are not up to snuff then you might be stuck waiting it out.

您还可以尝试重新构造查询结构,使删除变得更有效一些,但是如果您的框的规格没有达到要求,那么您可能会陷入等待。

going forward, you should create a unique index on the table to keep yourself from having to go through this again.

接下来,您应该在表上创建一个惟一的索引,以避免自己再次遇到这种情况。

#2


7  

Your query is not wrapped in a transaction, so it won't rollback the changes already made by the individual delete statements.

查询没有封装在事务中,因此它不会回滚单个delete语句已经做的更改。

I specifically tested this myself on my own SQL Server using the following query, and the ApplicationLog table was empty even though I cancelled the query:

我使用下面的查询在我自己的SQL服务器上特别测试了这一点,并且ApplicationLog表是空的,尽管我取消了查询:

declare @count int
select @count = 5
WHILE @count > 0
BEGIN
  print @count
  delete from applicationlog;
  waitfor time '20:00';
  select @count = @count -1
END

However your query is likely to take many days or weeks, much longer then 15 hours. Your estimate that you can process 2000 records every 6 seconds is wrong because each iteration in your while loop will take significantly longer with 17 million rows then it does with 2000 rows. So unless your query takes significantly less then a second for 2000 rows, it will take days for all 17 million.

然而,您的查询可能要花费许多天或几个星期,比15个小时要长得多。您对每6秒处理2000条记录的估计是错误的,因为在while循环中,每一次迭代在1700万行时所花费的时间要比在2000行时长得多。因此,除非您的查询在2000行中大大减少了一秒,否则将需要数天时间才能达到1700万。

You should ask a new question on how you can delete duplicate rows efficiently.

您应该问一个新的问题,关于如何有效地删除重复的行。

#3


2  

If you don't do anything explicit about transactions then the connection will be in autocommit transactions mode. In this mode every SQL statement is considered a transaction.

如果您没有对事务做任何明确的操作,那么连接将在自动提交事务模式中。在这种模式下,每个SQL语句都被视为事务。

The question is whether this means the individual SQL statements are transactions and are therefore being committed as you go, or whether the outer WHILE loop counts as a transaction.

问题是,这是否意味着单个SQL语句是事务,因此在执行过程中被提交,或者外部WHILE循环是否算作事务。

There doesn't seem to be any discussion of this in the description of the WHILE construct on MSDN. However, since a WHILE statement can't directly modify the database it would seem logical that it doesn't start an auto-commit transaction.

在MSDN上的WHILE构造的描述中似乎没有对此进行任何讨论。但是,由于WHILE语句不能直接修改数据库,因此不启动自动提交事务似乎是合乎逻辑的。

#4


2  

Implicit transactions

If no 'Implicit transactions' has been set, then each iteration in your loop committed the changes.

如果没有设置“隐式事务”,那么循环中的每个迭代都提交了更改。

It is possible for any SQL Server to be set with 'Implicit transactions'. This is a database setting (by default is OFF). You can also have implicit transactions in the properties of a particular query inside of Management Studio (right click in query pane>options), by default settings in the client, or a SET statement.

任何SQL服务器都可以设置为“隐式事务”。这是一个数据库设置(默认为关闭)。您还可以在Management Studio内的特定查询的属性(在查询窗格>选项中右键单击)、在客户端中的默认设置或SET语句中拥有隐式事务。

SET IMPLICIT_TRANSACTIONS ON;

Either way, if this was the case, you would still need to execute an explicit COMMIT/ROLLBACK regardless of interruption of the query execution.

无论哪种方式,如果是这种情况,您仍然需要执行显式的提交/回滚,而不考虑查询执行的中断。


Implicit transactions reference:

隐式事务参考:

http://msdn.microsoft.com/en-us/library/ms188317.aspx

http://msdn.microsoft.com/en-us/library/ms188317.aspx

http://msdn.microsoft.com/en-us/library/ms190230.aspx

http://msdn.microsoft.com/en-us/library/ms190230.aspx

#5


1  

I inherited a system which had logic something like yours implemented in SQL. In our case, we were trying to link together rows using fuzzy matching that had similar names/addresses, etc, and that logic was done purely in SQL. At the time I inherited it we had about 300,000 rows in the table and according to the timings, we calculated it would take A YEAR to match them all.

我继承了一个用SQL实现的逻辑系统。在我们的示例中,我们尝试使用具有类似名称/地址等的模糊匹配将行链接在一起,而这种逻辑纯粹是在SQL中完成的。在我接手的时候,我们的桌子上有30万排,根据计时,我们计算出要花一年的时间才能把它们全部匹配起来。

As an experiment to see how much faster I could do it outside of SQL, I wrote a program to dump the db table into flat files, read the flat files into a C++ program, build my own indexes, and do the fuzzy logic there, then reimport the flat files into the database. What took A YEAR in SQL took about 30 seconds in the C++ app.

作为一个实验,看看我可以更快之外的SQL,我编写了一个程序转储到平面文件数据库表,平面文件读入一个c++程序,构建自己的索引,并做模糊逻辑,然后再输入平面文件到数据库中。在SQL中花了一年的时间,在c++应用中花了大约30秒。

So, my advice is, don't even try what you are doing in SQL. Export, process, re-import.

因此,我的建议是,不要尝试使用SQL。出口过程中,导入。

#6


1  

DELETES that have been performed up to this point will not be rolled back.

在此之前执行的删除操作将不会回滚。


As the original author of the code in question, and having issued the caveat that performance will be dependant on indexes, I would propose the following items to speed this up.

作为问题代码的原始作者,我已经发出了性能将依赖于索引的警告,我将提出以下项目来加快这一速度。

RecordId better be PRIMARY KEY. I don't mean IDENTITY, I mean PRIMARY KEY. Confirm this using sp_help

RecordId最好是主键。我不是说身份,我是说主键。使用sp_help证实了这一点

Some index should be used in evaluating this query. Figure out which of these four columns has the least repeats and index that...

应该使用一些索引来评估这个查询。找出这四列中重复次数最少的列和索引…

SELECT *
FROM MyTable
WHERE @long = longitude
  AND @lat = latitude
  AND @businessname = BusinessName
  AND @phoneNumber = Phone

Before and After adding this index, check the query plan to see if index scanning has been added.

在添加这个索引之前和之后,检查查询计划,看看是否添加了索引扫描。

#7


0  

As a loop your query will struggle to scale well, even with appropriate indexes. The query should be rewritten to a single statement, as per the suggestions in your previous question on this.

作为循环,查询将很难很好地伸缩,即使使用适当的索引。查询应该重写为一条语句,就像您在前面的问题中提到的那样。

If you're not running it explicitly within a transaction it will only roll back the executing statement.

如果您没有在事务中显式地运行它,它将只回滚执行语句。

#8


0  

I think this query would be much more efficient if it was re-written using a single-pass algorithm using a cursor. You would order you cursor table by longitude,latitude,BusinessName AND @phoneNumber. You’d step through the rows one at a time. If a row has the same longitude, latitude, businessname, and phonenumber as the previous row, then delete it.

我认为,如果使用游标使用单遍算法重新编写,这个查询将会更有效。您可以通过经度、纬度、BusinessName和@phoneNumber来命令您的游标表。你会一步一步地走过一行。如果一行与前一行具有相同的经度、纬度、businessname和phonenumber,则删除它。

#9


0  

I think you need to seriously consider your methodolology. You need to start thinking in sets (although for performance you may need batch processing, but not row by row against a 17 million record table.)

我认为你需要认真考虑你的方法学。您需要开始考虑集合(尽管对于性能,您可能需要批处理,但不需要对1700万记录表进行逐行处理)。

First do all of your records have duplicates? I suspect not, so the first thing you wan to do is limit your processing to only those records which have duplicates. Since this is a large table and you may need to do the deletes in batches over time depending on what other processing is going on, you first pull the records you want to deal with into a table of their own that you then index. You can also use a temp table if you are going to be able to do this all at the same time without ever stopping it other wise create a table in your database and drop at the end.

首先,你的所有记录都有副本吗?我猜不是,所以您要做的第一件事是将您的处理限制为只处理那些具有重复的记录。由于这是一个大表,您可能需要根据其他处理的情况,在一段时间内分批执行删除操作,因此您首先将想要处理的记录提取到自己的表中,然后对其进行索引。如果您想同时使用临时表,而不需要停止它,您也可以使用临时表。

Something like (Note I didn't write the create index statments, I figure you can look that up yourself):

比如(注意我没有写创建索引语句,我想你可以自己查一下):

SELECT min(m.RecordID), m.longitude, m.latitude, m.businessname, m.phone  
     into  #RecordsToKeep    
FROM MyTable   m
join 
(select longitude, latitude, businessname, phone
from MyTable
group by longitude, latitude, businessname, phone
having count(*) >1) a 
on a.longitude = m.longitude and a.latitude = m.latitude and
a.businessname = b.businessname and a.phone = b.phone 
group by  m.longitude, m.latitude, m.businessname, m.phone   
ORDER BY CASE WHEN m.webAddress is not null THEN 1 ELSE 2 END,        
    CASE WHEN m.caption1 is not null THEN 1 ELSE 2 END,        
    CASE WHEN m.caption2 is not null THEN 1 ELSE 2 END



while (select count(*) from #RecordsToKeep) > 0
begin
select top 1000 * 
into #Batch
from #RecordsToKeep

Delete m
from mytable m
join #Batch b 
        on b.longitude = m.longitude and b.latitude = m.latitude and
        b.businessname = b.businessname and b.phone = b.phone 
where r.recordid <> b.recordID

Delete r
from  #RecordsToKeep r
join #Batch b on r.recordid = b.recordid

end

Delete m
from mytable m
join #RecordsToKeep r 
        on r.longitude = m.longitude and r.latitude = m.latitude and
        r.businessname = b.businessname and r.phone = b.phone 
where r.recordid <> m.recordID

#10


0  

Also try thinking another method to remove duplicate rows:

还可以考虑另一种方法来删除重复的行:

delete t1 from table1 as t1 where exists (
    select * from table1 as t2 where
        t1.column1=t2.column1 and
        t1.column2=t2.column2 and
        t1.column3=t2.column3 and
        --add other colums if any
        t1.id>t2.id
)

I suppose that you have an integer id column in your table.

我假设您的表中有一个整数id列。

#11


0  

If your machine doesn't have very advanced hardware then it may take sql server a very long time to complete that command. I don't know for sure how this operation is performed under the hood but based on my experience this could be done more efficiently by bringing the records out of the database and into memory for a program that uses a tree structure with a remove duplicate rule for insertion. Try reading the entirety of the table in chuncks (say 10000 rows at a time) into a C++ program using ODBC. Once in the C++ program use and std::map where key is the unique key and struct is a struct that holds the rest of the data in variables. Loop over all the records and perform insertion into the map. The map insert function will handle removing the duplicates. Since search inside a map is lg(n) time far less time to find duplicates than using your while loop. You can then delete the entire table and add the tuples back into the database from the map by forming insert queries and executing them via odbc or building a text file script and running it in management studio.

如果您的机器没有非常高级的硬件,那么sql server可能需要很长时间才能完成该命令。我不确定这个操作是如何在后台执行的,但是根据我的经验,这可以通过将记录从数据库中取出并存储到内存中来实现,以便使用带有删除重复规则的树结构进行插入。试着使用ODBC在一个c++程序中读取整个表(每次10000行)。在c++程序中使用和std::map时,key是唯一的key, struct是一个将其余数据保存在变量中的struct。循环所有记录并在映射中执行插入操作。map插入函数将处理删除重复项。因为在地图内搜索是lg(n)时间比使用while循环要短得多。然后,您可以删除整个表,并通过创建insert查询并通过odbc执行查询,或者构建一个文本文件脚本并在management studio中运行,将元组从映射中添加回数据库。

#12


-1  

I'm pretty sure that is a negatory. Otherwise what would the point of transactions be?

我很确定这是否定的。否则,事务的意义是什么?