在多个列上合并会导致奇怪的顺序

时间:2022-07-31 04:05:49

When two data frames are merged by a numerical column then (by default) they are ordered by that column as a number. However, if two numerical columns are used as the by then it results in a different ordering (in fact it seems as if the numerical columns are converted to strings and sorted as such). Is this expected, or a bug?

当两个数据帧被一个数字列合并时(默认情况下),它们被该列作为数字排序。但是,如果使用两个数字列作为the,则会产生不同的顺序(实际上,似乎数字列被转换为string并按string进行排序)。这是预期的,还是一个bug?

For example, consider the following two data frames:

例如,考虑以下两个数据框架:

A <- data.frame(a = 1:12, b = 1, x = runif(12))
B <- data.frame(a = 1:12, b = 1, y = runif(12))

Then merge(A, B, by = 'a') results in a data frame with a column a with values 1, 2, ..., 9, 10, 11, 12 (i.e., the expected numerical ordering). However merge(A, B, by = c('a', 'b')) results in a data frame with a column a with values 1, 10, 11, 12, 2, 3, ..., 8, 9 (i.e., the same ordering as sort(as.character(1:12))).

然后合并(A, B, by = ' A ')得到一个数据框,其中列A的值为1,2,……, 9、10、11、12(即,期望的数值排序)。然而,合并(A, B, by = c(' A ', ' B '))会导致数据框中列A的值为1、10、11、12、2、3、……8 9(即。,与排序的顺序相同(as.character(1:12)))。

2 个解决方案

#1


2  

I guess it's rather a feature than a bug of merge.

我猜这与其说是合并的问题,不如说是一个特性。

Inspection of the source code of merge showed that in the case when multiple columns are used for merging, the 'key' columns are internally combined into a vector by using paste().

对merge源代码的检查显示,在使用多个列进行合并时,使用paste()将'key'列内部组合为向量。

For example, columns a and b from your data frame A will be represented by the string "1\r1" "2\r1" "3\r1" "4\r1" "5\r1" "6\r1" "7\r1" "8\r1" "9\r1" "10\r1" "11\r1" "12\r1".

例如,数据框a中的a和b列将由字符串“1\r1”“2\r1”“3\r1”“4\r1”“5\r1”“6\r1”“7\r1”“8\r1”“9\r1”“10\r1”“11\r1”“12\r1”表示。

merge uses this string to sort the resulting data frame, and that is how it ends up with the alphabetical ordering.

merge使用这个字符串对结果数据帧进行排序,这就是它以字母排序结束的方式。

In the case when you merge only by one column, there is no need for using paste, and therefore sorting is performed by using the original type of the column.

在只合并一列的情况下,不需要使用粘贴,因此使用列的原始类型进行排序。

Here is the relevant piece of the source code of merge (full text can be obtained by running merge.data.frame without parentheses in R console)

下面是merge的相关源代码(可以通过运行merge.data.frame而无需在R控制台使用圆括号获得完整的文本)

    if (l.b == 1L) {
        bx <- x[, by.x]
        if (is.factor(bx)) 
            bx <- as.character(bx)
        by <- y[, by.y]
        if (is.factor(by)) 
            by <- as.character(by)
    }
    else {
        if (!is.null(incomparables)) 
            stop("'incomparables' is supported only for merging on a single column")
        bx <- x[, by.x, drop = FALSE]
        by <- y[, by.y, drop = FALSE]
        names(bx) <- names(by) <- paste0("V", seq_len(ncol(bx)))
        bz <- do.call("paste", c(rbind(bx, by), sep = "\r"))
        bx <- bz[seq_len(nx)]
        by <- bz[nx + seq_len(ny)]
    }

#2


0  

Using the dplyr package, we can get the following result

使用dplyr包,我们可以得到以下结果

library("dplyr", lib.loc="~/R/win-library/3.2")

full_join(A, B, by=c("a", "b"))

     a b          x           y
    1   1 1 0.39907404 0.700782559
    2   2 1 0.84429488 0.600727090
    3   3 1 0.32232471 0.141495156
    4   4 1 0.74214210 0.262601640
    5   5 1 0.92944116 0.779255689
    6   6 1 0.10902661 0.001185645
    7   7 1 0.46336478 0.961711785
    8   8 1 0.58396008 0.211824751
    9   9 1 0.63126074 0.422233784
    10 10 1 0.09995935 0.179069642
    11 11 1 0.40832159 0.581116173
    12 12 1 0.48440814 0.004372634

#1


2  

I guess it's rather a feature than a bug of merge.

我猜这与其说是合并的问题,不如说是一个特性。

Inspection of the source code of merge showed that in the case when multiple columns are used for merging, the 'key' columns are internally combined into a vector by using paste().

对merge源代码的检查显示,在使用多个列进行合并时,使用paste()将'key'列内部组合为向量。

For example, columns a and b from your data frame A will be represented by the string "1\r1" "2\r1" "3\r1" "4\r1" "5\r1" "6\r1" "7\r1" "8\r1" "9\r1" "10\r1" "11\r1" "12\r1".

例如,数据框a中的a和b列将由字符串“1\r1”“2\r1”“3\r1”“4\r1”“5\r1”“6\r1”“7\r1”“8\r1”“9\r1”“10\r1”“11\r1”“12\r1”表示。

merge uses this string to sort the resulting data frame, and that is how it ends up with the alphabetical ordering.

merge使用这个字符串对结果数据帧进行排序,这就是它以字母排序结束的方式。

In the case when you merge only by one column, there is no need for using paste, and therefore sorting is performed by using the original type of the column.

在只合并一列的情况下,不需要使用粘贴,因此使用列的原始类型进行排序。

Here is the relevant piece of the source code of merge (full text can be obtained by running merge.data.frame without parentheses in R console)

下面是merge的相关源代码(可以通过运行merge.data.frame而无需在R控制台使用圆括号获得完整的文本)

    if (l.b == 1L) {
        bx <- x[, by.x]
        if (is.factor(bx)) 
            bx <- as.character(bx)
        by <- y[, by.y]
        if (is.factor(by)) 
            by <- as.character(by)
    }
    else {
        if (!is.null(incomparables)) 
            stop("'incomparables' is supported only for merging on a single column")
        bx <- x[, by.x, drop = FALSE]
        by <- y[, by.y, drop = FALSE]
        names(bx) <- names(by) <- paste0("V", seq_len(ncol(bx)))
        bz <- do.call("paste", c(rbind(bx, by), sep = "\r"))
        bx <- bz[seq_len(nx)]
        by <- bz[nx + seq_len(ny)]
    }

#2


0  

Using the dplyr package, we can get the following result

使用dplyr包,我们可以得到以下结果

library("dplyr", lib.loc="~/R/win-library/3.2")

full_join(A, B, by=c("a", "b"))

     a b          x           y
    1   1 1 0.39907404 0.700782559
    2   2 1 0.84429488 0.600727090
    3   3 1 0.32232471 0.141495156
    4   4 1 0.74214210 0.262601640
    5   5 1 0.92944116 0.779255689
    6   6 1 0.10902661 0.001185645
    7   7 1 0.46336478 0.961711785
    8   8 1 0.58396008 0.211824751
    9   9 1 0.63126074 0.422233784
    10 10 1 0.09995935 0.179069642
    11 11 1 0.40832159 0.581116173
    12 12 1 0.48440814 0.004372634