如何规范用户生成的大型公司名称数据集?

时间:2021-06-14 04:16:46

Use case: User 1 uploads 100 company names (e.g. Microsoft, Bank of Sierra)

使用案例:用户1上传100个公司名称(例如Microsoft,Sierra of Sierra)

User 2 uploads 100 company names (e.g. The Gap, Uservoice, Microsoft, Inc.)

用户2上传100个公司名称(例如The Gap,Uservoice,Microsoft,Inc。)

I want User 1's notion of Microsoft and User 2's notion of Microsoft to map to a centrally maintained entity with a unique index for Microsoft.

我希望用户1的微软概念和用户2的微软概念映射到具有Microsoft独特索引的集中维护实体。

If someone uploads a name which isn't in the central repository, I guess I'd like it to be entered as is. But then what happens if that first entry is incorrectly spelled (e.g. Vergin Mobile instead of Virgin Mobile?) How can we best correct it and correlate new uploads to that same index?

如果有人上传了一个不在*存储库中的名称,我想我希望它按原样输入。但是,如果第一个条目拼写错误(例如Vergin Mobile而不是Virgin Mobile?)会发生什么?我们如何才能最好地纠正它并将新上传与相同的索引相关联?

Technically, should the central repository be a separate database altogether? Should even the user generated information be in a separate database, as well, from the business transactions that will occur against it?

从技术上讲,*存储库是否应该是一个单独的数据库?即使用户生成的信息也应该在一个单独的数据库中,也可以来自针对它的业务事务?

Starting out with a large definition of the problem and hoping to chunk it up with your input, thanks.

从问题的大定义开始,并希望通过您的输入将其打包,谢谢。

5 个解决方案

#1


FWIW, this has nothing to do with database normalization. This is a data cleanup task.

FWIW,这与数据库规范化无关。这是一项数据清理任务。

Data cleanup cannot be fully automated in the general case. Many people try, but it's impossible to detect all the ways that the input data might be malformed. You can automate some percentage of the cases with techniques such as:

在一般情况下,数据清理无法完全自动化。许多人尝试,但不可能检测到输入数据可能出错的所有方式。您可以使用以下技术自动执行某些百分比的案例:

  • Force users to select company names from a list instead of typing them. Of course this is best for single entries, not for bulk uploads.
  • 强制用户从列表中选择公司名称,而不是键入它们。当然,这对于单个条目最好,而不是批量上传。

  • Compare the SOUNDEX of the input company names to the SOUNDEX of company names already in the database. This is useful for identifying possible matches, but it can also give false positives. So you need a human to review them.
  • 将输入公司名称的SOUNDEX与数据库中已有的公司名称的SOUNDEX进行比较。这对于识别可能的匹配很有用,但它也可以给出误报。所以你需要一个人来审查它们。

Ultimately, you need to design your software to make it easy for an administrator to "merge" entries (and update any references from other database tables) as they are discovered to be duplicates of one another. There's no elegant way to do this with cascading foreign keys, you just have to write a bunch of UPDATE statements.

最终,您需要设计软件,以便管理员可以轻松地“合并”条目(并更新来自其他数据库表的任何引用),因为它们被发现彼此重复。使用级联外键没有优雅的方法,你只需编写一堆UPDATE语句。

#2


There is a whole type of systems called Master Data Management trying do this for different domains, such as partners, addresses, products. Typically large, full-featured systems, nothing that can be properly done in an ad-hoc fashion. These things sound easy at first, but get very difficult very soon.

有一种称为主数据管理的完整类型的系统尝试针对不同的域(例如合作伙伴,地址,产品)执行此操作。通常是大型,功能齐全的系统,没有什么能够以特别的方式正确完成。这些事情起初听起来很容易,但很快就变得非常困难。

Sorry I'm not being too cheery here, but this can quickly turn into a nightmare .. similar to trying to solve an np-complete problem ...

对不起,我在这里不是太讨厌,但这很快就会变成一场噩梦......类似于试图解决一个完整的问题......

#3


Do you see what happens when you try to enter a new question on this site? All those previous questions that might be the same?

您是否看到当您尝试在此网站上输入新问题时会发生什么?所有以前的问题可能都是一样的吗?

Probably even that will be insufficient. It's insufficient here.

甚至可能还不够。这里不够。

#4


Linked in does this somehow. However, they don't do batch uploads... Basically you want to set some sort of difference calculator that will cause an action on some potential matches.

链接在某种程度上这样做。但是,他们不进行批量上传...基本上你想设置一些差异计算器,它会对某些潜在的匹配产生一个动作。

dropping words like "Inc", "The" and others is one rule, and then there is pattern matching or closely matching words that are misspelled.

丢弃诸如“Inc”,“The”等词语是一条规则,然后存在拼写错误的模式匹配或紧密匹配的单词。

Not an easy thing to do with batch uploads from a workflow standpoint. You will need a known data dictionary that is approved and then each upload/addition has to be vetted. Eventually the number of additions will dwindle.

从工作流的角度来看,批处理上传并不容易。您需要一个已批准的已知数据字典,然后必须对每个上传/添加进行审查。最终增加的数量将减少。

I agree that this is not a database issue - it is a workflow issue.

我同意这不是数据库问题 - 这是一个工作流问题。

EDIT

I would have an approved list, and then some rules that propagate a potential "good" name to the approved list. How you implement that is left as an exercise for the reader...

我会有一个批准的列表,然后是一些将潜在的“好”名称传播到批准列表的规则。你如何实现这个作为读者的练习......

#5


company table    
  id
  name

company_synonym table
  company_id
  name

This schema structure solves the problems you have listed.

此架构结构解决了您列出的问题。

#1


FWIW, this has nothing to do with database normalization. This is a data cleanup task.

FWIW,这与数据库规范化无关。这是一项数据清理任务。

Data cleanup cannot be fully automated in the general case. Many people try, but it's impossible to detect all the ways that the input data might be malformed. You can automate some percentage of the cases with techniques such as:

在一般情况下,数据清理无法完全自动化。许多人尝试,但不可能检测到输入数据可能出错的所有方式。您可以使用以下技术自动执行某些百分比的案例:

  • Force users to select company names from a list instead of typing them. Of course this is best for single entries, not for bulk uploads.
  • 强制用户从列表中选择公司名称,而不是键入它们。当然,这对于单个条目最好,而不是批量上传。

  • Compare the SOUNDEX of the input company names to the SOUNDEX of company names already in the database. This is useful for identifying possible matches, but it can also give false positives. So you need a human to review them.
  • 将输入公司名称的SOUNDEX与数据库中已有的公司名称的SOUNDEX进行比较。这对于识别可能的匹配很有用,但它也可以给出误报。所以你需要一个人来审查它们。

Ultimately, you need to design your software to make it easy for an administrator to "merge" entries (and update any references from other database tables) as they are discovered to be duplicates of one another. There's no elegant way to do this with cascading foreign keys, you just have to write a bunch of UPDATE statements.

最终,您需要设计软件,以便管理员可以轻松地“合并”条目(并更新来自其他数据库表的任何引用),因为它们被发现彼此重复。使用级联外键没有优雅的方法,你只需编写一堆UPDATE语句。

#2


There is a whole type of systems called Master Data Management trying do this for different domains, such as partners, addresses, products. Typically large, full-featured systems, nothing that can be properly done in an ad-hoc fashion. These things sound easy at first, but get very difficult very soon.

有一种称为主数据管理的完整类型的系统尝试针对不同的域(例如合作伙伴,地址,产品)执行此操作。通常是大型,功能齐全的系统,没有什么能够以特别的方式正确完成。这些事情起初听起来很容易,但很快就变得非常困难。

Sorry I'm not being too cheery here, but this can quickly turn into a nightmare .. similar to trying to solve an np-complete problem ...

对不起,我在这里不是太讨厌,但这很快就会变成一场噩梦......类似于试图解决一个完整的问题......

#3


Do you see what happens when you try to enter a new question on this site? All those previous questions that might be the same?

您是否看到当您尝试在此网站上输入新问题时会发生什么?所有以前的问题可能都是一样的吗?

Probably even that will be insufficient. It's insufficient here.

甚至可能还不够。这里不够。

#4


Linked in does this somehow. However, they don't do batch uploads... Basically you want to set some sort of difference calculator that will cause an action on some potential matches.

链接在某种程度上这样做。但是,他们不进行批量上传...基本上你想设置一些差异计算器,它会对某些潜在的匹配产生一个动作。

dropping words like "Inc", "The" and others is one rule, and then there is pattern matching or closely matching words that are misspelled.

丢弃诸如“Inc”,“The”等词语是一条规则,然后存在拼写错误的模式匹配或紧密匹配的单词。

Not an easy thing to do with batch uploads from a workflow standpoint. You will need a known data dictionary that is approved and then each upload/addition has to be vetted. Eventually the number of additions will dwindle.

从工作流的角度来看,批处理上传并不容易。您需要一个已批准的已知数据字典,然后必须对每个上传/添加进行审查。最终增加的数量将减少。

I agree that this is not a database issue - it is a workflow issue.

我同意这不是数据库问题 - 这是一个工作流问题。

EDIT

I would have an approved list, and then some rules that propagate a potential "good" name to the approved list. How you implement that is left as an exercise for the reader...

我会有一个批准的列表,然后是一些将潜在的“好”名称传播到批准列表的规则。你如何实现这个作为读者的练习......

#5


company table    
  id
  name

company_synonym table
  company_id
  name

This schema structure solves the problems you have listed.

此架构结构解决了您列出的问题。