I have two datasets
我有两个数据集
datf1 <- data.frame (name = c("regular", "kklmin", "notSo", "Jijoh",
"Kish", "Lissp", "Kcn", "CCCa"),
number1 = c(1, 8, 9, 2, 18, 25, 33, 8))
#-----------
name number1
1 regular 1
2 kklmin 8
3 notSo 9
4 Jijoh 2
5 Kish 18
6 Lissp 25
7 Kcn 33
8 CCCa 8
datf2 <- data.frame (name = c("reGulr", "ntSo", "Jijoh", "sean", "LiSsp",
"KcN", "CaPN"),
number2 = c(2, 8, 12, 13, 20, 18, 13))
#-------------
name number2
1 reGulr 2
2 ntSo 8
3 Jijoh 12
4 sean 13
5 LiSsp 20
6 KcN 18
7 CaPN 13
I want to merge them by name column, however with partial match is allowed (to avoid hampering merging spelling errors in large data set and even to detect such spelling errors) and for example
我想通过名称列合并它们,但允许部分匹配(以避免在大型数据集中合并拼写错误,甚至检测此类拼写错误),例如
(1) If consecutive four letters (all if the number of letters are less than 4) at any position - match that is fine
(1)如果在任何位置连续四个字母(所有字母数小于4) - 匹配即可
ABBCD = BBCDK = aBBCD = ramABBBCD = ABB
(2) Case sensitivity is off in the match e.g ABBCD = aBbCd
(2)在匹配中关闭区分大小写,例如ABBCD = aBbCd
(3) The new dataset will have both names (names from datf1 and datf2) preserved. So that letter we can detect if the match is perfect (may a separate column with how many letter do match)
(3)新数据集将保留两个名称(来自datf1和datf2的名称)。所以这封信我们可以检测到匹配是否完美(可能是一个单独的列,其中有多少个字母匹配)
Is such merge possible ?
这种合并是否可行?
Edits:
datf1 <- data.frame (name = c("xxregular", "kklmin", "notSo", "Jijoh",
"Kish", "Lissp", "Kcn", "CCCa"),
number1 = c(1, 8, 9, 2, 18, 25, 33, 8))
datf2 <- data.frame (name = c("reGulr", "ntSo", "Jijoh", "sean",
"LiSsp", "KcN", "CaPN"),
number2 = c(2, 8, 12, 13, 20, 18, 13))
uglyMerge(datf1, datf2)
name1 name2 number1 number2 matches
1 xxregular <NA> 1 NA 0
2 kklmin <NA> 8 NA 0
3 notSo <NA> 9 NA 0
4 Jijoh Jijoh 2 12 5
5 Kish <NA> 18 NA 0
6 Lissp LiSsp 25 20 5
7 Kcn KcN 33 18 3
8 CCCa <NA> 8 NA 0
9 <NA> reGulr NA 2 0
10 <NA> ntSo NA 8 0
11 <NA> sean NA 13 0
12 <NA> CaPN NA 13 0
2 个解决方案
#1
7
Maybe there is a simple solution but I can't find any.
IMHO you have to implement this kind of merging for your own.
Please find an ugly example below (there is a lot of space for improvements):
也许有一个简单的解决方案,但我找不到任何。恕我直言,你必须为自己实现这种合并。请在下面找到一个丑陋的例子(有很多改进空间):
uglyMerge <- function(df1, df2) {
## lower all strings to allow case-insensitive comparison
lowerNames1 <- tolower(df1[, 1]);
lowerNames2 <- tolower(df2[, 1]);
## split strings into single characters
names1 <- strsplit(lowerNames1, "");
names2 <- strsplit(lowerNames2, "");
## create the final dataframe
mergedDf <- data.frame(name1=as.character(df1[,1]), name2=NA,
number1=df1[,2], number2=NA, matches=0,
stringsAsFactors=FALSE);
## store names of dataframe2 (to remember which strings have no match)
toMerge <- df2[, 1];
for (i in seq(along=names1)) {
for (j in seq(along=names2)) {
## set minimal match to 4 or to string length
minMatch <- min(4, length(names2[[j]]));
## find single matches
matches <- names1[[i]] %in% names2[[j]];
## look for consecutive matches
r <- rle(matches);
## any matches found?
if (any(r$values)) {
## find max consecutive match
possibleMatch <- r$value == TRUE;
maxPos <- which(which.max(r$length[possibleMatch]) & possibleMatch)[1];
## store max conscutive match length
maxMatch <- r$length[maxPos];
## to remove FALSE-POSITIVES (e.g. CCC and kcn) find
## largest substring
start <- sum(r$length[0:(maxPos-1)]) + 1;
stop <- start + r$length[maxPos] - 1;
maxSubStr <- substr(lowerNames1[i], start, stop);
## all matching criteria fulfilled
isConsecutiveMatch <- maxMatch >= minMatch &&
grepl(pattern=maxSubStr, x=lowerNames2[j], fixed=TRUE) &&
nchar(maxSubStr) > 0;
if (isConsecutiveMatch) {
## merging
mergedDf[i, "matches"] <- maxMatch
mergedDf[i, "name2"] <- as.character(df2[j, 1]);
mergedDf[i, "number2"] <- df2[j, 2];
## don't append this row to mergedDf because already merged
toMerge[j] <- NA;
## stop inner for loop here to avoid possible second match
break;
}
}
}
}
## append not matched rows to mergedDf
toMerge <- which(df2[, 1] == toMerge);
df2 <- data.frame(name1=NA, name2=as.character(df2[toMerge, 1]),
number1=NA, number2=df2[toMerge, 2], matches=0,
stringsAsFactors=FALSE);
mergedDf <- rbind(mergedDf, df2);
return (mergedDf);
}
Output:
> uglyMerge(datf1, datf2)
name1 name2 number1 number2 matches
1 xxregular reGulr 1 2 5
2 kklmin <NA> 8 NA 0
3 notSo <NA> 9 NA 0
4 Jijoh Jijoh 2 12 5
5 Kish <NA> 18 NA 0
6 Lissp LiSsp 25 20 5
7 Kcn KcN 33 18 3
8 CCCa <NA> 8 NA 0
9 <NA> ntSo NA 8 0
10 <NA> sean NA 13 0
11 <NA> CaPN NA 13 0
#2
5
agrep
will get you started.
agrep会让你入门。
something like:
lapply(tolower(datf1$name), function(x) agrep(x, tolower(datf2$name)))
then you can adjust the max.distance
parameter until you get the appropriate amount of matching. then merge however you like.
然后你可以调整max.distance参数,直到你得到适当的匹配量。然后合并,但你喜欢。
#1
7
Maybe there is a simple solution but I can't find any.
IMHO you have to implement this kind of merging for your own.
Please find an ugly example below (there is a lot of space for improvements):
也许有一个简单的解决方案,但我找不到任何。恕我直言,你必须为自己实现这种合并。请在下面找到一个丑陋的例子(有很多改进空间):
uglyMerge <- function(df1, df2) {
## lower all strings to allow case-insensitive comparison
lowerNames1 <- tolower(df1[, 1]);
lowerNames2 <- tolower(df2[, 1]);
## split strings into single characters
names1 <- strsplit(lowerNames1, "");
names2 <- strsplit(lowerNames2, "");
## create the final dataframe
mergedDf <- data.frame(name1=as.character(df1[,1]), name2=NA,
number1=df1[,2], number2=NA, matches=0,
stringsAsFactors=FALSE);
## store names of dataframe2 (to remember which strings have no match)
toMerge <- df2[, 1];
for (i in seq(along=names1)) {
for (j in seq(along=names2)) {
## set minimal match to 4 or to string length
minMatch <- min(4, length(names2[[j]]));
## find single matches
matches <- names1[[i]] %in% names2[[j]];
## look for consecutive matches
r <- rle(matches);
## any matches found?
if (any(r$values)) {
## find max consecutive match
possibleMatch <- r$value == TRUE;
maxPos <- which(which.max(r$length[possibleMatch]) & possibleMatch)[1];
## store max conscutive match length
maxMatch <- r$length[maxPos];
## to remove FALSE-POSITIVES (e.g. CCC and kcn) find
## largest substring
start <- sum(r$length[0:(maxPos-1)]) + 1;
stop <- start + r$length[maxPos] - 1;
maxSubStr <- substr(lowerNames1[i], start, stop);
## all matching criteria fulfilled
isConsecutiveMatch <- maxMatch >= minMatch &&
grepl(pattern=maxSubStr, x=lowerNames2[j], fixed=TRUE) &&
nchar(maxSubStr) > 0;
if (isConsecutiveMatch) {
## merging
mergedDf[i, "matches"] <- maxMatch
mergedDf[i, "name2"] <- as.character(df2[j, 1]);
mergedDf[i, "number2"] <- df2[j, 2];
## don't append this row to mergedDf because already merged
toMerge[j] <- NA;
## stop inner for loop here to avoid possible second match
break;
}
}
}
}
## append not matched rows to mergedDf
toMerge <- which(df2[, 1] == toMerge);
df2 <- data.frame(name1=NA, name2=as.character(df2[toMerge, 1]),
number1=NA, number2=df2[toMerge, 2], matches=0,
stringsAsFactors=FALSE);
mergedDf <- rbind(mergedDf, df2);
return (mergedDf);
}
Output:
> uglyMerge(datf1, datf2)
name1 name2 number1 number2 matches
1 xxregular reGulr 1 2 5
2 kklmin <NA> 8 NA 0
3 notSo <NA> 9 NA 0
4 Jijoh Jijoh 2 12 5
5 Kish <NA> 18 NA 0
6 Lissp LiSsp 25 20 5
7 Kcn KcN 33 18 3
8 CCCa <NA> 8 NA 0
9 <NA> ntSo NA 8 0
10 <NA> sean NA 13 0
11 <NA> CaPN NA 13 0
#2
5
agrep
will get you started.
agrep会让你入门。
something like:
lapply(tolower(datf1$name), function(x) agrep(x, tolower(datf2$name)))
then you can adjust the max.distance
parameter until you get the appropriate amount of matching. then merge however you like.
然后你可以调整max.distance参数,直到你得到适当的匹配量。然后合并,但你喜欢。