将数据与r中的部分匹配合并

时间:2021-05-09 18:57:20

I have two datasets

我有两个数据集

datf1 <- data.frame (name = c("regular", "kklmin", "notSo", "Jijoh",
 "Kish", "Lissp", "Kcn", "CCCa"),
 number1 = c(1, 8, 9,  2,  18, 25, 33,   8))
#-----------
    name number1
1 regular       1
2  kklmin       8
3   notSo       9
4   Jijoh       2
5    Kish      18
6   Lissp      25
7     Kcn      33
8    CCCa       8

 datf2 <- data.frame (name = c("reGulr", "ntSo", "Jijoh", "sean", "LiSsp",
 "KcN", "CaPN"),
   number2 = c(2, 8, 12,    13, 20, 18,   13))
#-------------
   name number2
1 reGulr       2
2   ntSo       8
3  Jijoh      12
4   sean      13
5  LiSsp      20
6    KcN      18
7   CaPN      13

I want to merge them by name column, however with partial match is allowed (to avoid hampering merging spelling errors in large data set and even to detect such spelling errors) and for example

我想通过名称列合并它们,但允许部分匹配(以避免在大型数据集中合并拼写错误,甚至检测此类拼写错误),例如

(1) If consecutive four letters (all if the number of letters are less than 4) at any position - match that is fine

(1)如果在任何位置连续四个字母(所有字母数小于4) - 匹配即可

 ABBCD = BBCDK = aBBCD = ramABBBCD = ABB 

(2) Case sensitivity is off in the match e.g ABBCD = aBbCd

(2)在匹配中关闭区分大小写,例如ABBCD = aBbCd

(3) The new dataset will have both names (names from datf1 and datf2) preserved. So that letter we can detect if the match is perfect (may a separate column with how many letter do match)

(3)新数据集将保留两个名称(来自datf1和datf2的名称)。所以这封信我们可以检测到匹配是否完美(可能是一个单独的列,其中有多少个字母匹配)

Is such merge possible ?

这种合并是否可行?

Edits:

datf1 <- data.frame (name = c("xxregular", "kklmin", "notSo", "Jijoh",
             "Kish", "Lissp", "Kcn", "CCCa"),
                     number1 = c(1, 8, 9,  2,  18, 25, 33,   8))
datf2 <- data.frame (name = c("reGulr", "ntSo", "Jijoh", "sean", 
             "LiSsp", "KcN", "CaPN"),
                     number2 = c(2, 8, 12,  13, 20, 18,   13))


uglyMerge(datf1, datf2)
       name1  name2 number1 number2 matches
1  xxregular   <NA>       1      NA       0
2     kklmin   <NA>       8      NA       0
3      notSo   <NA>       9      NA       0
4      Jijoh  Jijoh       2      12       5
5       Kish   <NA>      18      NA       0
6      Lissp  LiSsp      25      20       5
7        Kcn    KcN      33      18       3
8       CCCa   <NA>       8      NA       0
9       <NA> reGulr      NA       2       0
10      <NA>   ntSo      NA       8       0
11      <NA>   sean      NA      13       0
12      <NA>   CaPN      NA      13       0

2 个解决方案

#1


7  

Maybe there is a simple solution but I can't find any.
IMHO you have to implement this kind of merging for your own.
Please find an ugly example below (there is a lot of space for improvements):

也许有一个简单的解决方案,但我找不到任何。恕我直言,你必须为自己实现这种合并。请在下面找到一个丑陋的例子(有很多改进空间):

uglyMerge <- function(df1, df2) {

    ## lower all strings to allow case-insensitive comparison
    lowerNames1 <- tolower(df1[, 1]);
    lowerNames2 <- tolower(df2[, 1]);

    ## split strings into single characters
    names1 <- strsplit(lowerNames1, "");
    names2 <- strsplit(lowerNames2, "");

    ## create the final dataframe
    mergedDf <- data.frame(name1=as.character(df1[,1]), name2=NA, 
                        number1=df1[,2], number2=NA, matches=0,
                        stringsAsFactors=FALSE);

    ## store names of dataframe2 (to remember which strings have no match)
    toMerge <- df2[, 1];

    for (i in seq(along=names1)) {
        for (j in seq(along=names2)) {
            ## set minimal match to 4 or to string length
            minMatch <- min(4, length(names2[[j]]));

            ## find single matches
            matches <- names1[[i]] %in% names2[[j]];

            ## look for consecutive matches
            r <- rle(matches);

            ## any matches found?
            if (any(r$values)) {
                ## find max consecutive match
                possibleMatch <- r$value == TRUE;
                maxPos <- which(which.max(r$length[possibleMatch]) & possibleMatch)[1];

                ## store max conscutive match length
                maxMatch <- r$length[maxPos];

                ## to remove FALSE-POSITIVES (e.g. CCC and kcn) find 
                ## largest substring
                start <- sum(r$length[0:(maxPos-1)]) + 1;
                stop <- start + r$length[maxPos] - 1;
                maxSubStr <- substr(lowerNames1[i], start, stop);

                ## all matching criteria fulfilled
                isConsecutiveMatch <- maxMatch >= minMatch &&
                                    grepl(pattern=maxSubStr, x=lowerNames2[j], fixed=TRUE) &&
                                    nchar(maxSubStr) > 0;

                if (isConsecutiveMatch) {
                    ## merging
                    mergedDf[i, "matches"] <- maxMatch
                    mergedDf[i, "name2"] <- as.character(df2[j, 1]);
                    mergedDf[i, "number2"] <- df2[j, 2];

                    ## don't append this row to mergedDf because already merged
                    toMerge[j] <- NA;

                    ## stop inner for loop here to avoid possible second match
                    break;
                }
            }
        } 
    }

    ## append not matched rows to mergedDf
    toMerge <- which(df2[, 1] == toMerge);
    df2 <- data.frame(name1=NA, name2=as.character(df2[toMerge, 1]), 
                    number1=NA, number2=df2[toMerge, 2], matches=0, 
                    stringsAsFactors=FALSE);
    mergedDf <- rbind(mergedDf, df2);

    return (mergedDf);
}

Output:

> uglyMerge(datf1, datf2)
    name1  name2 number1 number2 matches
1  xxregular reGulr       1       2       5
2     kklmin   <NA>       8      NA       0
3      notSo   <NA>       9      NA       0
4      Jijoh  Jijoh       2      12       5
5       Kish   <NA>      18      NA       0
6      Lissp  LiSsp      25      20       5
7        Kcn    KcN      33      18       3
8       CCCa   <NA>       8      NA       0
9       <NA>   ntSo      NA       8       0
10      <NA>   sean      NA      13       0
11      <NA>   CaPN      NA      13       0

#2


5  

agrep will get you started.

agrep会让你入门。

something like:

lapply(tolower(datf1$name), function(x) agrep(x, tolower(datf2$name)))

then you can adjust the max.distance parameter until you get the appropriate amount of matching. then merge however you like.

然后你可以调整max.distance参数,直到你得到适当的匹配量。然后合并,但你喜欢。

#1


7  

Maybe there is a simple solution but I can't find any.
IMHO you have to implement this kind of merging for your own.
Please find an ugly example below (there is a lot of space for improvements):

也许有一个简单的解决方案,但我找不到任何。恕我直言,你必须为自己实现这种合并。请在下面找到一个丑陋的例子(有很多改进空间):

uglyMerge <- function(df1, df2) {

    ## lower all strings to allow case-insensitive comparison
    lowerNames1 <- tolower(df1[, 1]);
    lowerNames2 <- tolower(df2[, 1]);

    ## split strings into single characters
    names1 <- strsplit(lowerNames1, "");
    names2 <- strsplit(lowerNames2, "");

    ## create the final dataframe
    mergedDf <- data.frame(name1=as.character(df1[,1]), name2=NA, 
                        number1=df1[,2], number2=NA, matches=0,
                        stringsAsFactors=FALSE);

    ## store names of dataframe2 (to remember which strings have no match)
    toMerge <- df2[, 1];

    for (i in seq(along=names1)) {
        for (j in seq(along=names2)) {
            ## set minimal match to 4 or to string length
            minMatch <- min(4, length(names2[[j]]));

            ## find single matches
            matches <- names1[[i]] %in% names2[[j]];

            ## look for consecutive matches
            r <- rle(matches);

            ## any matches found?
            if (any(r$values)) {
                ## find max consecutive match
                possibleMatch <- r$value == TRUE;
                maxPos <- which(which.max(r$length[possibleMatch]) & possibleMatch)[1];

                ## store max conscutive match length
                maxMatch <- r$length[maxPos];

                ## to remove FALSE-POSITIVES (e.g. CCC and kcn) find 
                ## largest substring
                start <- sum(r$length[0:(maxPos-1)]) + 1;
                stop <- start + r$length[maxPos] - 1;
                maxSubStr <- substr(lowerNames1[i], start, stop);

                ## all matching criteria fulfilled
                isConsecutiveMatch <- maxMatch >= minMatch &&
                                    grepl(pattern=maxSubStr, x=lowerNames2[j], fixed=TRUE) &&
                                    nchar(maxSubStr) > 0;

                if (isConsecutiveMatch) {
                    ## merging
                    mergedDf[i, "matches"] <- maxMatch
                    mergedDf[i, "name2"] <- as.character(df2[j, 1]);
                    mergedDf[i, "number2"] <- df2[j, 2];

                    ## don't append this row to mergedDf because already merged
                    toMerge[j] <- NA;

                    ## stop inner for loop here to avoid possible second match
                    break;
                }
            }
        } 
    }

    ## append not matched rows to mergedDf
    toMerge <- which(df2[, 1] == toMerge);
    df2 <- data.frame(name1=NA, name2=as.character(df2[toMerge, 1]), 
                    number1=NA, number2=df2[toMerge, 2], matches=0, 
                    stringsAsFactors=FALSE);
    mergedDf <- rbind(mergedDf, df2);

    return (mergedDf);
}

Output:

> uglyMerge(datf1, datf2)
    name1  name2 number1 number2 matches
1  xxregular reGulr       1       2       5
2     kklmin   <NA>       8      NA       0
3      notSo   <NA>       9      NA       0
4      Jijoh  Jijoh       2      12       5
5       Kish   <NA>      18      NA       0
6      Lissp  LiSsp      25      20       5
7        Kcn    KcN      33      18       3
8       CCCa   <NA>       8      NA       0
9       <NA>   ntSo      NA       8       0
10      <NA>   sean      NA      13       0
11      <NA>   CaPN      NA      13       0

#2


5  

agrep will get you started.

agrep会让你入门。

something like:

lapply(tolower(datf1$name), function(x) agrep(x, tolower(datf2$name)))

then you can adjust the max.distance parameter until you get the appropriate amount of matching. then merge however you like.

然后你可以调整max.distance参数,直到你得到适当的匹配量。然后合并,但你喜欢。