如何比较SQL Server中格式不同的两个名称字符串?

时间:2021-12-19 12:05:38

What would be the best approach for comparing the following set of strings in SQL Server?

比较SQL Server中以下字符串集的最佳方法是什么?

Purpose: The main purpose of this Stored Procedure is to compare a set of input Names of Customers to names that Exist in the Customer database for a specific accounts. If there is a difference between the input name and the name in the Database this should trigger the updating of the Customer Database with the new name information.

目的:此存储过程的主要目的是将一组客户输入名称与特定帐户的客户数据库中存在的名称进行比较。如果输入名称与数据库中的名称存在差异,则应触发使用新名称信息更新客户数据库。

Conditions:

Format of Input: FirstName [MiddleName] LastName

输入格式:FirstName [MiddleName] LastName

Format of Value in Database: LastName, FirstName MiddleName

数据库中的值格式:LastName,FirstName MiddleName

The complication arises when names like this are presented,

当出现这样的名字时会出现复杂情况,

Example:

Input: Dr. John A. Mc Donald

输入:John A. Mc Donald博士

Database: Mc Donald, Dr. John A.

数据库:Mc Donald,John A.博士

For last names that consist of 2 or more parts what logic would have to be put into place to ensure that the lastname in the input is being compared to the lastname in the database and likewise for the first name and middle name.

对于由2个或更多部分组成的姓氏,必须使用什么逻辑来确保将输入中的姓氏与数据库中的姓氏进行比较,同样对应名字和中间名。

I've thought about breaking the database values up into a temp HASH table since I know that everything before the ',' in the database is the last name. I could then check to see if the input contains the lastname and split out the FirstName [MiddleName] from it to perform another comparison to the database for the values that come after the ','.

我已经考虑过将数据库值分解为临时HASH表,因为我知道数据库中','之前的所有内容都是姓。然后,我可以检查输入是否包含lastname并从中拆分FirstName [MiddleName],以便对数据库执行另一个与','之后的值的比较。

There is a second part to this however. In the event that the input name has a completely New last name (i.e. if the name in the database is Mary Smith but the updated input name is now Mary Mc Donald). In this case comparing the database value of the last name before the ',' to the input name will result in no match which is correct, but at this point how does the code know where the last name even begins in the input value? How does it know that her Middle name isn't MC and her last name Donald?

然而,还有第二部分。如果输入名称具有完全新的姓氏(即,如果数据库中的名称是Mary Smith,但更新的输入名称现在是Mary Mc Donald)。在这种情况下,将','之前的姓氏的数据库值与输入名称进行比较将导致不匹配,这是正确的,但此时代码如何知道姓氏在输入值中的起始位置?怎么知道她的中间名不是MC而她的姓唐纳德?

Has anyone had to deal with a similar problem like this before? What solutions did you end up going with?

有没有人不得不像以前那样处理类似的问题?您最终采用了哪些解决方案?

I greatly appreciate your input and ideas.

我非常感谢您的意见和建议。

Thank you.

3 个解决方案

#1


2  

Realistically, it's extremely computationally difficult (if not impossible) to know if a name like "Mary Jane Evelyn Scott" is first-middle-last1-last2, first1-first2-middle-last, first1-first2-last1-last2, or some other combination... and that's not even getting into cultural considerations...

实际上,知道像“Mary Jane Evelyn Scott”这样的名字是第一个 - 中间 - 最后一个 - 最后一个2,第一个 - 第一个 - 中间 - 后一个,第一个 - 第一个 - 第二个 - 最后一个 - 最后一个,还是一些,这在计算上是非常困难的(如果不是不可能的话)其他组合......这甚至没有进入文化考虑......

So personally, I would suggest a change in the data structure (and, correspondingly, the application's input fields). Instead of a single string for name, break it into several fields, e.g.:

所以我个人建议改变数据结构(相应地,应用程序的输入字段)。而不是名称的单个字符串,将其分成几个字段,例如:

FullName{
  title,      //i.e. Dr., Professor, etc.
  firstName,  //or given name
  middleName, //doesn't exist in all countries!
  lastName,   //or surname
  qualifiers  //i.e. Sr., Jr., fils, D.D.S., PE, Ph.D., etc.
}

Then the user could choose that their first name is "Mary", their middle name is "Jane Evelyn", and their last name is "Scott".

然后用户可以选择他们的名字是“Mary”,他们的中间名是“Jane Evelyn”,他们的姓氏是“Scott”。

UPDATE
Based on your comments, if you must do this entirely in SQL, I'd do something like the following:

更新根据您的意见,如果您必须完全在SQL中执行此操作,我将执行以下操作:

  1. Build a table for all possible combinations of "lastname, firstname [middlename]" given an input string "firstname [middlename] lastname"
  2. 在给定输入字符串“firstname [middlename] lastname”的情况下,为“lastname,firstname [middlename]”的所有可能组合构建一个表

  3. Run a query based on the join of your original data and all possible orderings.
  4. 根据原始数据和所有可能的排序的连接运行查询。

So, step 1. would take the string "Dr. John A. Mc Donald" and create the table of values:

因此,第1步将采用字符串“Dr. John A. Mc Donald”并创建值表:

'Donald, Dr. John A. Mc'
'Mc Donald, Dr. John A.'
'A. Mc Donald, Dr. John'
'John A. Mc Donald, Dr.'

Then step 2. would search for all occurrences of any of those strings in the database.

然后步骤2.将搜索数据库中所有这些字符串的出现次数。

Assuming MSSQL 2005 or later, step 1. can be achieved using some recursive CTE, and a modification of a method I've used to split CSV strings (found here) (SQL isn't the ideal language for this form of string manipulation...):

假设MSSQL 2005或更高版本,步骤1可以使用一些递归CTE来实现,并且修改了我用来分割CSV字符串的方法(在这里找到)(SQL不是这种形式的字符串操作的理想语言。 ..):

declare @str varchar(200)
set @str = 'Dr. John A. Mc Donald'

--Create a numbers table
select [Number] = identity(int)
into #Numbers
from sysobjects s1
    cross join sysobjects s2

create unique clustered index Number_ind on #Numbers(Number) with IGNORE_DUP_KEY

;with nameParts as (
    --Split the name string at the spaces.
    select [ord] = row_number() over(order by Number),
        [part] = substring(fn1, Number, charindex(' ', fn1+' ', Number) - Number)
    from (select @str fn1) s
        join #Numbers n on substring(' '+fn1, Number, 1) = ' '
    where Number<=Len(fn1)+1

),
lastNames as (
    --Build all possible lastName strings.
    select [firstOrd]=ord, [lastOrd]=ord, [lastName]=cast(part as varchar(max))
    from nameParts
    where ord!=1 --remove the case where the whole string is the last name
    UNION ALL
    select firstOrd, p.ord, l.lastName+' '+p.part
    from lastNames l
        join nameParts p on l.lastOrd+1=p.ord
),
firstNames as (
    --Build all possible firstName strings.
    select [firstOrd]=ord, [lastOrd]=ord, [firstName]=cast(part as varchar(max))
    from nameParts
    where ord!=(select max(ord) from nameParts) --remove the case where the whole string is the first name
    UNION ALL
    select p.ord, f.lastOrd, p.part+' '+f.firstName
    from firstNames f
        join nameParts p on f.firstOrd-1 = p.ord
)
--Combine for all possible name strings.
select ln.lastName+', '+fn.firstName
from firstNames fn
    join lastNames ln on fn.lastOrd+1=ln.firstOrd
where fn.firstOrd=1
    and ln.lastOrd = (select max(ord) from nameParts)

drop table #Numbers

#2


1  

Since I had my share of terrible experience with data from third parties, it is almost guaranteed that the input data will contain lots of garbage not following the specified format.
When trying to match data multipart string data like in your case, I preprocessed both input and our data into something I called "normalized string" using the following method.

由于我对来自第三方的数据有着可怕的经验,因此几乎可以保证输入数据将包含大量不遵循指定格式的垃圾。当尝试匹配像你的情况一样的数据多部分字符串数据时,我使用以下方法将输入和数据预处理成我称为“规范化字符串”的东西。

  1. strip all non-ascii chars (leaving language-specific chars like "č" intact)
  2. 剥离所有非ascii字符(保留特定语言的字符,如“č”完整)

  3. compact spaces (replace multiple spaces with single one)
  4. 紧凑空间(用单个空格替换多个空格)

  5. lower case
  6. split into words
  7. 分成单词

  8. remove duplicates
  9. sort alphabetically
  10. join back to string separated by dashes
  11. 连接回由破折号分隔的字符串

Using you sample data, this function would produce:

使用样本数据,此函数将产生:

Dr. John A. Mc Donald -> a-donald-dr-john-mc
Mc Donald, Dr. John A.-> a-donald-dr-john-mc

John A. Mc Donald博士 - > a-donald-dr-john-mc Mc Donald,John A .-> a-donald-dr-john-mc

Unfortunaly it's not 100% bulletproof, there are cases where degenerated inputs produce invalid matches.

不幸的是,它不是100%防弹,有些情况下退化的输入会产生无效的匹配。

#3


0  

Your name field is bad in the database. Redesign and get rid of it. If you havea a first name, middlename, lastname, prefix and suffix sttructure, you can hava computed filed that has the structure you are using. But it is a very poor way to store data and your first priority should be to stop using it.

您的名称字段在数据库中不好。重新设计并摆脱它。如果您有名字,中间名,姓氏,前缀和后缀结构,则可以使用具有您正在使用的结构的hava计算字段。但这是一种非常差的存储数据的方式,您的首要任务应该是停止使用它。

Since you have a common customer Id why aren't you matching on that instead of name?

既然你有一个共同的客户ID,为什么你不匹配而不是名字?

#1


2  

Realistically, it's extremely computationally difficult (if not impossible) to know if a name like "Mary Jane Evelyn Scott" is first-middle-last1-last2, first1-first2-middle-last, first1-first2-last1-last2, or some other combination... and that's not even getting into cultural considerations...

实际上,知道像“Mary Jane Evelyn Scott”这样的名字是第一个 - 中间 - 最后一个 - 最后一个2,第一个 - 第一个 - 中间 - 后一个,第一个 - 第一个 - 第二个 - 最后一个 - 最后一个,还是一些,这在计算上是非常困难的(如果不是不可能的话)其他组合......这甚至没有进入文化考虑......

So personally, I would suggest a change in the data structure (and, correspondingly, the application's input fields). Instead of a single string for name, break it into several fields, e.g.:

所以我个人建议改变数据结构(相应地,应用程序的输入字段)。而不是名称的单个字符串,将其分成几个字段,例如:

FullName{
  title,      //i.e. Dr., Professor, etc.
  firstName,  //or given name
  middleName, //doesn't exist in all countries!
  lastName,   //or surname
  qualifiers  //i.e. Sr., Jr., fils, D.D.S., PE, Ph.D., etc.
}

Then the user could choose that their first name is "Mary", their middle name is "Jane Evelyn", and their last name is "Scott".

然后用户可以选择他们的名字是“Mary”,他们的中间名是“Jane Evelyn”,他们的姓氏是“Scott”。

UPDATE
Based on your comments, if you must do this entirely in SQL, I'd do something like the following:

更新根据您的意见,如果您必须完全在SQL中执行此操作,我将执行以下操作:

  1. Build a table for all possible combinations of "lastname, firstname [middlename]" given an input string "firstname [middlename] lastname"
  2. 在给定输入字符串“firstname [middlename] lastname”的情况下,为“lastname,firstname [middlename]”的所有可能组合构建一个表

  3. Run a query based on the join of your original data and all possible orderings.
  4. 根据原始数据和所有可能的排序的连接运行查询。

So, step 1. would take the string "Dr. John A. Mc Donald" and create the table of values:

因此,第1步将采用字符串“Dr. John A. Mc Donald”并创建值表:

'Donald, Dr. John A. Mc'
'Mc Donald, Dr. John A.'
'A. Mc Donald, Dr. John'
'John A. Mc Donald, Dr.'

Then step 2. would search for all occurrences of any of those strings in the database.

然后步骤2.将搜索数据库中所有这些字符串的出现次数。

Assuming MSSQL 2005 or later, step 1. can be achieved using some recursive CTE, and a modification of a method I've used to split CSV strings (found here) (SQL isn't the ideal language for this form of string manipulation...):

假设MSSQL 2005或更高版本,步骤1可以使用一些递归CTE来实现,并且修改了我用来分割CSV字符串的方法(在这里找到)(SQL不是这种形式的字符串操作的理想语言。 ..):

declare @str varchar(200)
set @str = 'Dr. John A. Mc Donald'

--Create a numbers table
select [Number] = identity(int)
into #Numbers
from sysobjects s1
    cross join sysobjects s2

create unique clustered index Number_ind on #Numbers(Number) with IGNORE_DUP_KEY

;with nameParts as (
    --Split the name string at the spaces.
    select [ord] = row_number() over(order by Number),
        [part] = substring(fn1, Number, charindex(' ', fn1+' ', Number) - Number)
    from (select @str fn1) s
        join #Numbers n on substring(' '+fn1, Number, 1) = ' '
    where Number<=Len(fn1)+1

),
lastNames as (
    --Build all possible lastName strings.
    select [firstOrd]=ord, [lastOrd]=ord, [lastName]=cast(part as varchar(max))
    from nameParts
    where ord!=1 --remove the case where the whole string is the last name
    UNION ALL
    select firstOrd, p.ord, l.lastName+' '+p.part
    from lastNames l
        join nameParts p on l.lastOrd+1=p.ord
),
firstNames as (
    --Build all possible firstName strings.
    select [firstOrd]=ord, [lastOrd]=ord, [firstName]=cast(part as varchar(max))
    from nameParts
    where ord!=(select max(ord) from nameParts) --remove the case where the whole string is the first name
    UNION ALL
    select p.ord, f.lastOrd, p.part+' '+f.firstName
    from firstNames f
        join nameParts p on f.firstOrd-1 = p.ord
)
--Combine for all possible name strings.
select ln.lastName+', '+fn.firstName
from firstNames fn
    join lastNames ln on fn.lastOrd+1=ln.firstOrd
where fn.firstOrd=1
    and ln.lastOrd = (select max(ord) from nameParts)

drop table #Numbers

#2


1  

Since I had my share of terrible experience with data from third parties, it is almost guaranteed that the input data will contain lots of garbage not following the specified format.
When trying to match data multipart string data like in your case, I preprocessed both input and our data into something I called "normalized string" using the following method.

由于我对来自第三方的数据有着可怕的经验,因此几乎可以保证输入数据将包含大量不遵循指定格式的垃圾。当尝试匹配像你的情况一样的数据多部分字符串数据时,我使用以下方法将输入和数据预处理成我称为“规范化字符串”的东西。

  1. strip all non-ascii chars (leaving language-specific chars like "č" intact)
  2. 剥离所有非ascii字符(保留特定语言的字符,如“č”完整)

  3. compact spaces (replace multiple spaces with single one)
  4. 紧凑空间(用单个空格替换多个空格)

  5. lower case
  6. split into words
  7. 分成单词

  8. remove duplicates
  9. sort alphabetically
  10. join back to string separated by dashes
  11. 连接回由破折号分隔的字符串

Using you sample data, this function would produce:

使用样本数据,此函数将产生:

Dr. John A. Mc Donald -> a-donald-dr-john-mc
Mc Donald, Dr. John A.-> a-donald-dr-john-mc

John A. Mc Donald博士 - > a-donald-dr-john-mc Mc Donald,John A .-> a-donald-dr-john-mc

Unfortunaly it's not 100% bulletproof, there are cases where degenerated inputs produce invalid matches.

不幸的是,它不是100%防弹,有些情况下退化的输入会产生无效的匹配。

#3


0  

Your name field is bad in the database. Redesign and get rid of it. If you havea a first name, middlename, lastname, prefix and suffix sttructure, you can hava computed filed that has the structure you are using. But it is a very poor way to store data and your first priority should be to stop using it.

您的名称字段在数据库中不好。重新设计并摆脱它。如果您有名字,中间名,姓氏,前缀和后缀结构,则可以使用具有您正在使用的结构的hava计算字段。但这是一种非常差的存储数据的方式,您的首要任务应该是停止使用它。

Since you have a common customer Id why aren't you matching on that instead of name?

既然你有一个共同的客户ID,为什么你不匹配而不是名字?