比较两个名字和姓氏列表以查找匹配项

时间:2022-05-13 06:45:07

I currently have two tables in SQL Server: TableA with 40,000 records and TableB with 2.1 million records.
Each table has 3 columns: RowID, First_Name, and Last_Name.

我目前在SQL Server中有两个表:TableA有40,000条记录,TableB有210万条记录。每个表有3列:RowID,First_Name和Last_Name。

I am currently taking the First and Last name from the first row of TableA and comparing it to the First and Last name in EVERY row in TableB until it finds a match. However, as you might imagine, my computer does not have enough/strong enough resources to complete this task. It will run for a few hours then SQL Server will crash, and it doesn't save any of the work it has already completed. I have thought about only allowing the loop to run for a set number of records and then restarting the loop from there so I can preserve some data before SQL crashes, but that's going to take forever.

我目前从TableA的第一行获取名字和姓氏,并将其与TableB中的每一行中的First和Last名称进行比较,直到找到匹配项。但是,正如您可能想象的那样,我的计算机没有足够/足够强大的资源来完成此任务。它将运行几个小时然后SQL Server将崩溃,并且它不会保存它已经完成的任何工作。我想过只允许循环运行一定数量的记录,然后从那里重新启动循环,这样我就可以在SQL崩溃之前保留一些数据,但这需要永远。

I am looking for suggestions for other programs or languages to solve this problem. I'm going to keep trying to perfect my SQL query in the meantime to speed up this process, such as by only comparing records if they have the same initials. I don't really know much about other programs or languages, so I'm open to trying something other than SQL Server. I don't know if there's a language out there that is better with resources or better with "timing out" than SQL Server is. I know a lot about Linux, so if there's something out there that I can utilize Linux to save on some of the resources compared to Windows 8, I would definitely be open to that. I don't know if something like Python would work better, a Linux version of SQL, etc?

我正在寻找其他程序或语言的建议来解决这个问题。在此期间,我将继续尝试完善我的SQL查询以加快此过程,例如只有比较记录,如果它们具有相同的首字母。我对其他程序或语言并不是很了解,所以我愿意尝试除SQL Server以外的其他程序。我不知道是否有一种语言可以更好地使用资源,或者更好地使用“超时”而不是SQL Server。我对Linux了解很多,所以如果有一些东西我可以利用Linux来节省一些与Windows 8相比的资源,我肯定会对此持开放态度。我不知道Python之类的东西是否会更好用,Linux版本的SQL等等?

I appreciate your help and thank you for your time!

感谢您的帮助,感谢您的时间!

EDIT ----- Here's a simple version of the Query I'm running.

编辑-----这是我正在运行的查询的简单版本。

DECLARE TableANameF     ,TableANameL
    ,TableBNameF        ,TableBNameL
    ,TableARowIndex     ,TableBRowIndex
    ,TableARowCount     ,TableBRowCount
    ,NameFDifference    ,NameLDifference

SET TableARowIndex = 1

SELECT TableARowCount = COUNT(RowID)
FROM TableA

WHILE (TableARowIndex <= TableARowCount)
    SELECT TableANameF = FIRST_Name
        ,TableANameL = LAST_Name
    FROM TableA
    WHERE RowID = TableARowIndex

    SET TableBRowIndex = 1

    SELECT TableBRowCount = COUNT(RowID)
    FROM TableB

    WHILE (TableBRowIndex <= TableBRowCount)
        SELECT TableBNameF = FIRST_Name
            ,TableBNameL = LAST_Name
        FROM TableB
        WHERE RowID = TableBRowIndex

        SET NameFDifference = DIFFERENCE(TableANameF, TableBNameF)
        SET NameLDifference = DIFFERENCE(TableANameL, TableBNameL)

        IF (NameFDifference > 3 AND NameLDifference > 3)
            --INSERT INTO ANOTHER TABLE TO TRACK MY MATCHES
      --INCREMENT TableBRowIndex
  --END TABLE B WHILE LOOP
 --INCREMENT TableARowIndex
END

2 个解决方案

#1


2  

I think you merely need indexes:

我认为你只需要索引:

create index idx_tablea_firstname_lastname on tablea(firstname, lastname);
create index idx_tableb_firstname_lastname on tableb(firstname, lastname);

I'm not sure what exactly you want to get, but you should be doing a query in the database as opposed to looping:

我不确定你想要得到什么,但你应该在数据库中进行查询而不是循环:

select a.*, b.rowid
from tablea a join
     tableb b
     on a.firstname = b.firstname and a.lastname = b.lastname;

SQL is the correct language/tool for this problem. You just have to allow the database to do the work.

SQL是解决此问题的正确语言/工具。您只需允许数据库完成工作。

#2


2  

If you create index for first_name and last_name on both tables this should be really fast.

如果在两个表上为first_name和last_name创建索引,这应该非常快。

SELECT A.*
FROM  TableA A
INNER JOIN TableB B
        ON DIFFERENCE(TableANameF, TableBNameF) > 3
       AND DIFFERENCE(TableANameL, TableBNameL) > 3

The problem here is DIFFERENCE wont use any index.

这里的问题是DIFFERENCE不会使用任何索引。

The other solution is create one store procedure to run the query you made and saving the index in another table so you can resume it if query fail.

另一个解决方案是创建一个存储过程来运行您所做的查询并将索引保​​存在另一个表中,以便在查询失败时可以恢复它。

#1


2  

I think you merely need indexes:

我认为你只需要索引:

create index idx_tablea_firstname_lastname on tablea(firstname, lastname);
create index idx_tableb_firstname_lastname on tableb(firstname, lastname);

I'm not sure what exactly you want to get, but you should be doing a query in the database as opposed to looping:

我不确定你想要得到什么,但你应该在数据库中进行查询而不是循环:

select a.*, b.rowid
from tablea a join
     tableb b
     on a.firstname = b.firstname and a.lastname = b.lastname;

SQL is the correct language/tool for this problem. You just have to allow the database to do the work.

SQL是解决此问题的正确语言/工具。您只需允许数据库完成工作。

#2


2  

If you create index for first_name and last_name on both tables this should be really fast.

如果在两个表上为first_name和last_name创建索引,这应该非常快。

SELECT A.*
FROM  TableA A
INNER JOIN TableB B
        ON DIFFERENCE(TableANameF, TableBNameF) > 3
       AND DIFFERENCE(TableANameL, TableBNameL) > 3

The problem here is DIFFERENCE wont use any index.

这里的问题是DIFFERENCE不会使用任何索引。

The other solution is create one store procedure to run the query you made and saving the index in another table so you can resume it if query fail.

另一个解决方案是创建一个存储过程来运行您所做的查询并将索引保​​存在另一个表中,以便在查询失败时可以恢复它。