I've been working on a way to join two datasets based on a imperfect string, such as a name of a company. In the past I had to match two very dirty lists, one list had names and financial information, another list had names and address. Neither had unique IDs to match on! ASSUME THAT CLEANING HAS ALREADY BEEN APPLIED AND THERE MAYBE TYPOS AND INSERTIONS.
我一直在研究一种基于不完美字符串(例如公司名称)连接两个数据集的方法。在过去,我必须匹配两个非常脏的列表,一个列表有名称和财务信息,另一个列表有名称和地址。没有唯一的ID匹配!假设清洁已经应用,并且可能有类型和插入。
So far AGREP is the closest tool I've found that might work. I can use levenshtein distances in the AGREP package, which measure the number of deletions, insertions and substitutions between two strings. AGREP will return the string with the smallest distance (the most similar).
到目前为止,AGREP是我发现的最接近的工具。我可以在AGREP包中使用levenshtein距离,它测量两个字符串之间的删除,插入和替换的数量。 AGREP将返回距离最小的字符串(最相似)。
However, I've been having trouble turning this command from a single value to apply it to an entire data frame. I've crudely used a for loop to repeat the AGREP function, but there's gotta be an easier way.
但是,我一直无法从单个值转换此命令以将其应用于整个数据帧。我粗略地使用for循环来重复AGREP函数,但是必须有一个更简单的方法。
See the following code:
请参阅以下代码:
a<-data.frame(name=c('Ace Co','Bayes', 'asd', 'Bcy', 'Baes', 'Bays'),price=c(10,13,2,1,15,1))
b<-data.frame(name=c('Ace Co.','Bayes Inc.','asdf'),qty=c(9,99,10))
for (i in 1:6){
a$x[i] = agrep(a$name[i], b$name, value = TRUE, max = list(del = 0.2, ins = 0.3, sub = 0.4))
a$Y[i] = agrep(a$name[i], b$name, value = FALSE, max = list(del = 0.2, ins = 0.3, sub = 0.4))
}
6 个解决方案
#1
6
The solution depends on the desired cardinality of your matching a
to b
. If it's one-to-one, you will get the three closest matches above. If it's many-to-one, you will get six.
解决方案取决于匹配a到b的所需基数。如果是一对一的话,你将得到上面三个最接近的比赛。如果它是多对一的,你会得到六个。
One-to-one case (requires assignment algorithm):
一对一案例(需要分配算法):
When I've had to do this before I treat it as an assignment problem with a distance matrix and an assignment heuristic (greedy assignment used below). If you want an "optimal" solution you'd be better off with optim
.
当我把它作为带有距离矩阵和赋值启发式(下面使用的贪婪赋值)的赋值问题处理之前我必须这样做。如果你想要一个“最佳”解决方案,那么你最好还是选择优化。
Not familiar with AGREP but here's example using stringdist
for your distance matrix.
不熟悉AGREP,但这里是使用stringdist作为距离矩阵的示例。
library(stringdist)
d <- expand.grid(a$name,b$name) # Distance matrix in long form
names(d) <- c("a_name","b_name")
d$dist <- stringdist(d$a_name,d$b_name, method="jw") # String edit distance (use your favorite function here)
# Greedy assignment heuristic (Your favorite heuristic here)
greedyAssign <- function(a,b,d){
x <- numeric(length(a)) # assgn variable: 0 for unassigned but assignable,
# 1 for already assigned, -1 for unassigned and unassignable
while(any(x==0)){
min_d <- min(d[x==0]) # identify closest pair, arbitrarily selecting 1st if multiple pairs
a_sel <- a[d==min_d & x==0][1]
b_sel <- b[d==min_d & a == a_sel & x==0][1]
x[a==a_sel & b == b_sel] <- 1
x[x==0 & (a==a_sel|b==b_sel)] <- -1
}
cbind(a=a[x==1],b=b[x==1],d=d[x==1])
}
data.frame(greedyAssign(as.character(d$a_name),as.character(d$b_name),d$dist))
Produces the assignment:
产生作业:
a b d
1 Ace Co Ace Co. 0.04762
2 Bayes Bayes Inc. 0.16667
3 asd asdf 0.08333
I'm sure there's a much more elegant way to do the greedy assignment heuristic, but the above works for me.
我确信有一个更优雅的方式来做贪婪的任务启发式,但上面的工作对我来说。
Many-to-one case (not an assignment problem):
多对一案例(不是分配问题):
do.call(rbind, unname(by(d, d$a_name, function(x) x[x$dist == min(x$dist),])))
Produces the result:
产生结果:
a_name b_name dist
1 Ace Co Ace Co. 0.04762
11 Baes Bayes Inc. 0.20000
8 Bayes Bayes Inc. 0.16667
12 Bays Bayes Inc. 0.20000
10 Bcy Bayes Inc. 0.37778
15 asd asdf 0.08333
Edit: use method="jw"
to produce desired results. See help("stringdist-package")
编辑:使用method =“jw”产生所需的结果。请参阅help(“stringdist-package”)
#2
2
I am not sure if this is a useful direction for you, John Andrews, but it gives you another tool (from the RecordLinkage
package) and might help.
我不确定这对你是一个有用的方向,约翰安德鲁斯,但它给你另一个工具(来自RecordLinkage包),可能会有所帮助。
install.packages("ipred")
install.packages("evd")
install.packages("RSQLite")
install.packages("ff")
install.packages("ffbase")
install.packages("ada")
install.packages("~/RecordLinkage_0.4-1.tar.gz", repos = NULL, type = "source")
require(RecordLinkage) # it is not on CRAN so you must load source from Github, and there are 7 dependent packages, as per above
compareJW <- function(string, vec, cutoff) {
require(RecordLinkage)
jarowinkler(string, vec) > cutoff
}
a<-data.frame(name=c('Ace Co','Bayes', 'asd', 'Bcy', 'Baes', 'Bays'),price=c(10,13,2,1,15,1))
b<-data.frame(name=c('Ace Co.','Bayes Inc.','asdf'),qty=c(9,99,10))
a$name <- as.character(a$name)
b$name <- as.character(b$name)
test <- compareJW(string = a$name, vec = b$name, cutoff = 0.8) # pick your level of cutoff, of course
data.frame(name = a$name, price = a$price, test = test)
> data.frame(name = a$name, price = a$price, test = test)
name price test
1 Ace Co 10 TRUE
2 Bayes 13 TRUE
3 asd 2 TRUE
4 Bcy 1 FALSE
5 Baes 15 TRUE
6 Bays 1 FALSE
#3
1
Agreed with above answer "Not familiar with AGREP but here's example using stringdist for your distance matrix." but add-on the signature function as below from Merging Data Sets Based on Partially Matched Data Elements will be more accurate since the calculation of LV is based on position/addition/deletion
同意上面的答案“不熟悉AGREP,但这里是使用stringdist作为距离矩阵的例子。”但是,由于LV的计算基于位置/添加/删除,因此基于部分匹配数据元素的合并数据集的附加签名功能将更准确
##Here's where the algorithm starts...
##I'm going to generate a signature from country names to reduce some of the minor differences between strings
##In this case, convert all characters to lower case, sort the words alphabetically, and then concatenate them with no spaces.
##So for example, United Kingdom would become kingdomunited
##We might also remove stopwords such as 'the' and 'of'.
signature=function(x){
sig=paste(sort(unlist(strsplit(tolower(x)," "))),collapse='')
return(sig)
}
#4
1
I use lapply
for those circumstances:
我在这些情况下使用lapply:
yournewvector: lapply(yourvector$yourvariable, agrep, yourothervector$yourothervariable, max.distance=0.01),
then to write it as a csv it's not so straightforward:
然后把它写成csv它不是那么简单:
write.csv(matrix(yournewvector, ncol=1), file="yournewvector.csv", row.names=FALSE)
#5
1
Here is a solution using the fuzzyjoin
package. It uses dplyr
-like syntax and stringdist
as one of the possible types of fuzzy matching.
这是使用fuzzyjoin包的解决方案。它使用类似dplyr的语法和stringdist作为模糊匹配的可能类型之一。
As suggested by C8H10N4O2, the stringdist
method="jw" creates the best matches for your example.
正如C8H10N4O2所建议的,stringdist方法=“jw”为您的示例创建最佳匹配。
As suggested by dgrtwo, the developer of fuzzyjoin, I used a large max_dist and then used dplyr::group_by
and dplyr::top_n
to get only the best match with minimum distance.
正如模糊连接的开发者dgrtwo所建议的那样,我使用了一个大的max_dist然后使用dplyr :: group_by和dplyr :: top_n来获得最小距离的最佳匹配。
a <- data.frame(name = c('Ace Co', 'Bayes', 'asd', 'Bcy', 'Baes', 'Bays'),
price = c(10, 13, 2, 1, 15, 1))
b <- data.frame(name = c('Ace Co.', 'Bayes Inc.', 'asdf'),
qty = c(9, 99, 10))
library(fuzzyjoin)
library(dplyr)
stringdist_join(a, b,
by = "name",
mode = "left",
ignore_case = FALSE,
method = "jw",
max_dist = 99,
distance_col = "dist"
) %>%
group_by(name.x) %>%
top_n(1, -dist)
#> # A tibble: 6 x 5
#> # Groups: name.x [6]
#> name.x price name.y qty dist
#> <fctr> <dbl> <fctr> <dbl> <dbl>
#> 1 Ace Co 10 Ace Co. 9 0.04761905
#> 2 Bayes 13 Bayes Inc. 99 0.16666667
#> 3 asd 2 asdf 10 0.08333333
#> 4 Bcy 1 Bayes Inc. 99 0.37777778
#> 5 Baes 15 Bayes Inc. 99 0.20000000
#> 6 Bays 1 Bayes Inc. 99 0.20000000
#6
-1
Here is what I used for getting number of times a company appears in a list though the company names are inexact matches,
这是我用来获取公司出现在列表中的次数,虽然公司名称是不精确的匹配,
step.1 Install phonics Package
step.1安装拼音包
step.2 create a new column called "soundexcodes" in "mylistofcompanynames"
step.2在“mylistofcompanynames”中创建一个名为“soundexcodes”的新列
step.3 Use soundex function to return soundex codes of the company names in "soundexcodes"
step.3使用soundex函数在“soundexcodes”中返回公司名称的soundex代码
step.4 Copy the company names AND corresponding soundex code into a new file (2 columns called "companynames" and "soundexcode") called "companysoundexcodestrainingfile"
步骤4将公司名称和相应的soundex代码复制到一个名为“companysoundexcodestrainingfile”的新文件(名为“companynames”和“soundexcode”的2列)中
step.5 Remove duplicates of soundexcodes in "companysoundexcodestrainingfile"
step.5在“companysoundexcodestrainingfile”中删除soundexcodes的重复项
step.6 Go through the list of remaining company names and change the names as you want it to appear in your original company
步骤6浏览剩余公司名称列表,并根据您希望它出现在原始公司中的名称进行更改
example: Amazon Inc A625 can be Amazon A625 Accenture Limited A455 can be Accenture A455
例如:亚马逊公司A625可以是亚马逊A625埃森哲有限公司A455可以是埃森哲A455
step.6 Perform a left_join or (simple vlookup) between companysoundexcodestrainingfile$soundexcodes and mylistofcompanynames$soundexcodes by "soundexcodes"
step.6通过“soundexcodes”在companysysdexcodestrainingfile $ soundexcodes和mylistofcompanynames $ soundexcodes之间执行left_join或(simple vlookup)
step.7 The result should have the original list with a new column called "co.y" which has the name of the company the way you left it in the training file.
step.7结果应该包含一个名为“co.y”的新列的原始列表,该列具有您在培训文件中保留的方式的公司名称。
step.8 Sort "co.y" and check if most of the company names are matched correctly,if so replace the old company names with the new ones given by vlookup of the soundex code.
step.8对“co.y”进行排序并检查大多数公司名称是否正确匹配,如果是这样,用soundex代码的vlookup给出的新公司名称替换旧的公司名称。
#1
6
The solution depends on the desired cardinality of your matching a
to b
. If it's one-to-one, you will get the three closest matches above. If it's many-to-one, you will get six.
解决方案取决于匹配a到b的所需基数。如果是一对一的话,你将得到上面三个最接近的比赛。如果它是多对一的,你会得到六个。
One-to-one case (requires assignment algorithm):
一对一案例(需要分配算法):
When I've had to do this before I treat it as an assignment problem with a distance matrix and an assignment heuristic (greedy assignment used below). If you want an "optimal" solution you'd be better off with optim
.
当我把它作为带有距离矩阵和赋值启发式(下面使用的贪婪赋值)的赋值问题处理之前我必须这样做。如果你想要一个“最佳”解决方案,那么你最好还是选择优化。
Not familiar with AGREP but here's example using stringdist
for your distance matrix.
不熟悉AGREP,但这里是使用stringdist作为距离矩阵的示例。
library(stringdist)
d <- expand.grid(a$name,b$name) # Distance matrix in long form
names(d) <- c("a_name","b_name")
d$dist <- stringdist(d$a_name,d$b_name, method="jw") # String edit distance (use your favorite function here)
# Greedy assignment heuristic (Your favorite heuristic here)
greedyAssign <- function(a,b,d){
x <- numeric(length(a)) # assgn variable: 0 for unassigned but assignable,
# 1 for already assigned, -1 for unassigned and unassignable
while(any(x==0)){
min_d <- min(d[x==0]) # identify closest pair, arbitrarily selecting 1st if multiple pairs
a_sel <- a[d==min_d & x==0][1]
b_sel <- b[d==min_d & a == a_sel & x==0][1]
x[a==a_sel & b == b_sel] <- 1
x[x==0 & (a==a_sel|b==b_sel)] <- -1
}
cbind(a=a[x==1],b=b[x==1],d=d[x==1])
}
data.frame(greedyAssign(as.character(d$a_name),as.character(d$b_name),d$dist))
Produces the assignment:
产生作业:
a b d
1 Ace Co Ace Co. 0.04762
2 Bayes Bayes Inc. 0.16667
3 asd asdf 0.08333
I'm sure there's a much more elegant way to do the greedy assignment heuristic, but the above works for me.
我确信有一个更优雅的方式来做贪婪的任务启发式,但上面的工作对我来说。
Many-to-one case (not an assignment problem):
多对一案例(不是分配问题):
do.call(rbind, unname(by(d, d$a_name, function(x) x[x$dist == min(x$dist),])))
Produces the result:
产生结果:
a_name b_name dist
1 Ace Co Ace Co. 0.04762
11 Baes Bayes Inc. 0.20000
8 Bayes Bayes Inc. 0.16667
12 Bays Bayes Inc. 0.20000
10 Bcy Bayes Inc. 0.37778
15 asd asdf 0.08333
Edit: use method="jw"
to produce desired results. See help("stringdist-package")
编辑:使用method =“jw”产生所需的结果。请参阅help(“stringdist-package”)
#2
2
I am not sure if this is a useful direction for you, John Andrews, but it gives you another tool (from the RecordLinkage
package) and might help.
我不确定这对你是一个有用的方向,约翰安德鲁斯,但它给你另一个工具(来自RecordLinkage包),可能会有所帮助。
install.packages("ipred")
install.packages("evd")
install.packages("RSQLite")
install.packages("ff")
install.packages("ffbase")
install.packages("ada")
install.packages("~/RecordLinkage_0.4-1.tar.gz", repos = NULL, type = "source")
require(RecordLinkage) # it is not on CRAN so you must load source from Github, and there are 7 dependent packages, as per above
compareJW <- function(string, vec, cutoff) {
require(RecordLinkage)
jarowinkler(string, vec) > cutoff
}
a<-data.frame(name=c('Ace Co','Bayes', 'asd', 'Bcy', 'Baes', 'Bays'),price=c(10,13,2,1,15,1))
b<-data.frame(name=c('Ace Co.','Bayes Inc.','asdf'),qty=c(9,99,10))
a$name <- as.character(a$name)
b$name <- as.character(b$name)
test <- compareJW(string = a$name, vec = b$name, cutoff = 0.8) # pick your level of cutoff, of course
data.frame(name = a$name, price = a$price, test = test)
> data.frame(name = a$name, price = a$price, test = test)
name price test
1 Ace Co 10 TRUE
2 Bayes 13 TRUE
3 asd 2 TRUE
4 Bcy 1 FALSE
5 Baes 15 TRUE
6 Bays 1 FALSE
#3
1
Agreed with above answer "Not familiar with AGREP but here's example using stringdist for your distance matrix." but add-on the signature function as below from Merging Data Sets Based on Partially Matched Data Elements will be more accurate since the calculation of LV is based on position/addition/deletion
同意上面的答案“不熟悉AGREP,但这里是使用stringdist作为距离矩阵的例子。”但是,由于LV的计算基于位置/添加/删除,因此基于部分匹配数据元素的合并数据集的附加签名功能将更准确
##Here's where the algorithm starts...
##I'm going to generate a signature from country names to reduce some of the minor differences between strings
##In this case, convert all characters to lower case, sort the words alphabetically, and then concatenate them with no spaces.
##So for example, United Kingdom would become kingdomunited
##We might also remove stopwords such as 'the' and 'of'.
signature=function(x){
sig=paste(sort(unlist(strsplit(tolower(x)," "))),collapse='')
return(sig)
}
#4
1
I use lapply
for those circumstances:
我在这些情况下使用lapply:
yournewvector: lapply(yourvector$yourvariable, agrep, yourothervector$yourothervariable, max.distance=0.01),
then to write it as a csv it's not so straightforward:
然后把它写成csv它不是那么简单:
write.csv(matrix(yournewvector, ncol=1), file="yournewvector.csv", row.names=FALSE)
#5
1
Here is a solution using the fuzzyjoin
package. It uses dplyr
-like syntax and stringdist
as one of the possible types of fuzzy matching.
这是使用fuzzyjoin包的解决方案。它使用类似dplyr的语法和stringdist作为模糊匹配的可能类型之一。
As suggested by C8H10N4O2, the stringdist
method="jw" creates the best matches for your example.
正如C8H10N4O2所建议的,stringdist方法=“jw”为您的示例创建最佳匹配。
As suggested by dgrtwo, the developer of fuzzyjoin, I used a large max_dist and then used dplyr::group_by
and dplyr::top_n
to get only the best match with minimum distance.
正如模糊连接的开发者dgrtwo所建议的那样,我使用了一个大的max_dist然后使用dplyr :: group_by和dplyr :: top_n来获得最小距离的最佳匹配。
a <- data.frame(name = c('Ace Co', 'Bayes', 'asd', 'Bcy', 'Baes', 'Bays'),
price = c(10, 13, 2, 1, 15, 1))
b <- data.frame(name = c('Ace Co.', 'Bayes Inc.', 'asdf'),
qty = c(9, 99, 10))
library(fuzzyjoin)
library(dplyr)
stringdist_join(a, b,
by = "name",
mode = "left",
ignore_case = FALSE,
method = "jw",
max_dist = 99,
distance_col = "dist"
) %>%
group_by(name.x) %>%
top_n(1, -dist)
#> # A tibble: 6 x 5
#> # Groups: name.x [6]
#> name.x price name.y qty dist
#> <fctr> <dbl> <fctr> <dbl> <dbl>
#> 1 Ace Co 10 Ace Co. 9 0.04761905
#> 2 Bayes 13 Bayes Inc. 99 0.16666667
#> 3 asd 2 asdf 10 0.08333333
#> 4 Bcy 1 Bayes Inc. 99 0.37777778
#> 5 Baes 15 Bayes Inc. 99 0.20000000
#> 6 Bays 1 Bayes Inc. 99 0.20000000
#6
-1
Here is what I used for getting number of times a company appears in a list though the company names are inexact matches,
这是我用来获取公司出现在列表中的次数,虽然公司名称是不精确的匹配,
step.1 Install phonics Package
step.1安装拼音包
step.2 create a new column called "soundexcodes" in "mylistofcompanynames"
step.2在“mylistofcompanynames”中创建一个名为“soundexcodes”的新列
step.3 Use soundex function to return soundex codes of the company names in "soundexcodes"
step.3使用soundex函数在“soundexcodes”中返回公司名称的soundex代码
step.4 Copy the company names AND corresponding soundex code into a new file (2 columns called "companynames" and "soundexcode") called "companysoundexcodestrainingfile"
步骤4将公司名称和相应的soundex代码复制到一个名为“companysoundexcodestrainingfile”的新文件(名为“companynames”和“soundexcode”的2列)中
step.5 Remove duplicates of soundexcodes in "companysoundexcodestrainingfile"
step.5在“companysoundexcodestrainingfile”中删除soundexcodes的重复项
step.6 Go through the list of remaining company names and change the names as you want it to appear in your original company
步骤6浏览剩余公司名称列表,并根据您希望它出现在原始公司中的名称进行更改
example: Amazon Inc A625 can be Amazon A625 Accenture Limited A455 can be Accenture A455
例如:亚马逊公司A625可以是亚马逊A625埃森哲有限公司A455可以是埃森哲A455
step.6 Perform a left_join or (simple vlookup) between companysoundexcodestrainingfile$soundexcodes and mylistofcompanynames$soundexcodes by "soundexcodes"
step.6通过“soundexcodes”在companysysdexcodestrainingfile $ soundexcodes和mylistofcompanynames $ soundexcodes之间执行left_join或(simple vlookup)
step.7 The result should have the original list with a new column called "co.y" which has the name of the company the way you left it in the training file.
step.7结果应该包含一个名为“co.y”的新列的原始列表,该列具有您在培训文件中保留的方式的公司名称。
step.8 Sort "co.y" and check if most of the company names are matched correctly,if so replace the old company names with the new ones given by vlookup of the soundex code.
step.8对“co.y”进行排序并检查大多数公司名称是否正确匹配,如果是这样,用soundex代码的vlookup给出的新公司名称替换旧的公司名称。