如何在R中交叉连接?

How can I achieve a cross join in R ? I know that "merge" can do inner join, outer join. But I do not know how to achieve a cross join in R.

如何在R中实现交叉连接?我知道“合并”可以做内连接，外连接。但是我不知道如何实现R中的交叉连接。

Thanks

谢谢

6 个解决方案

#1

Is it just all=TRUE?

只是所有= TRUE吗?

x<-data.frame(id1=c("a","b","c"),vals1=1:3)
y<-data.frame(id2=c("d","e","f"),vals2=4:6)
merge(x,y,all=TRUE)

From documentation of merge:

从合并的文档:

If by or both by.x and by.y are of length 0 (a length zero vector or NULL), the result, r, is the Cartesian product of x and y, i.e., dim(r) = c(nrow(x)*nrow(y), ncol(x) + ncol(y)).

或者两者兼有。x和。y的长度为0(一个长度为0的向量或零)，结果r是x和y的笛卡儿积，即。， dim(r) = c(nrow(x)*nrow(y)， ncol(x) + ncol(y)))

#2

If speed is an issue, I suggest checking out the excellent data.table package. In the example at the end it's ~90x faster than merge.

如果速度是个问题，我建议你去看看那些优秀的数据。表方案。在最后的示例中，它比merge快了大约90x。

You didn't provide example data. If you just want to get all combinations of two (or more individual) columns, you can use CJ (cross join):

您没有提供示例数据。如果您只想获得两个(或多个)列的所有组合，您可以使用CJ(交叉连接):

library(data.table)
CJ(x=1:2,y=letters[1:3])
#   x y
#1: 1 a
#2: 1 b
#3: 1 c
#4: 2 a
#5: 2 b
#6: 2 c

If you want to do a cross join on two tables, I haven't found a way to use CJ(). But you can still use data.table:

如果您想在两个表上执行交叉连接，我还没有找到使用CJ()的方法。但是你仍然可以使用data.table:

x2<-data.table(id1=letters[1:3],vals1=1:3)
y2<-data.table(id2=letters[4:7],vals2=4:7)

res<-setkey(x2[,c(k=1,.SD)],k)[y2[,c(k=1,.SD)],allow.cartesian=TRUE][,k:=NULL]
res
#    id1 vals1 id2 vals2
# 1:   a     1   d     4
# 2:   b     2   d     4
# 3:   c     3   d     4
# 4:   a     1   e     5
# 5:   b     2   e     5
# 6:   c     3   e     5
# 7:   a     1   f     6
# 8:   b     2   f     6
# 9:   c     3   f     6
#10:   a     1   g     7
#11:   b     2   g     7
#12:   c     3   g     7

Explanation of the res line:

res线解释:

Basically you add a dummy column (k in this example) to one table and set it as the key (setkey(tablename,keycolumns)), add the dummy column to the other table, and then join them.
基本上，您将一个哑列(本例中的k)添加到一个表中，并将其设置为key(setkey(tablename,keycolumns)，将哑列添加到另一个表中，然后将它们连接起来。
The data.table structure uses column positions and not names in the join, so you have to put the dummy column at the beginning. The c(k=1,.SD) part is one way that I have found to add columns at the beginning (the default is to add them to the end).
数据。表结构在连接中使用列的位置而不是名称，因此必须在开始时放置假列。c(k=1，. sd)部分是我发现在开头添加列的一种方法(默认情况是将列添加到末尾)。
A standard data.table join has a format of X[Y]. The X in this case is setkey(x2[,c(k=1,.SD)],k), and the Y is y2[,c(k=1,.SD)].
一个标准的数据。表连接的格式为X[Y]。这里的X是setkey(x2[，c(k=1，. sd)]，k)， Y是y2[，c(k=1，. sd)]。
allow.cartesian=TRUE tells data.table to ignore the duplicate key values, and perform a cartesian join (prior versions didn't require this)
允许的。笛卡儿= TRUE告诉数据。表来忽略重复的键值，并执行笛卡尔连接(以前的版本不需要这个)
The [,k:=NULL] at the end just removes the dummy key from the result.
最后的[，k:=NULL]将从结果中删除哑键。

You can also turn this into a function, so it's cleaner to use:

你也可以把它变成一个函数，所以用起来更简洁:

# Version 1; easier to write:
CJ.table.1 <- function(X,Y)
  setkey(X[,c(k=1,.SD)],k)[Y[,c(k=1,.SD)],allow.cartesian=TRUE][,k:=NULL]

CJ.table.1(x2,y2)
#    id1 vals1 id2 vals2
# 1:   a     1   d     4
# 2:   b     2   d     4
# 3:   c     3   d     4
# 4:   a     1   e     5
# 5:   b     2   e     5
# 6:   c     3   e     5
# 7:   a     1   f     6
# 8:   b     2   f     6
# 9:   c     3   f     6
#10:   a     1   g     7
#11:   b     2   g     7
#12:   c     3   g     7

# Version 2; faster but messier:
CJ.table.2 <- function(X,Y) {
  eval(parse(text=paste0("setkey(X[,c(k=1,.SD)],k)[Y[,c(k=1,.SD)],list(",paste0(unique(c(names(X),names(Y))),collapse=","),")][,k:=NULL]")))
}

Here are some speed benchmarks:

以下是一些速度基准:

# Create a bigger (but still very small) example:
n<-1e3
x3<-data.table(id1=1L:n,vals1=sample(letters,n,replace=T))
y3<-data.table(id2=1L:n,vals2=sample(LETTERS,n,replace=T))

library(microbenchmark)
microbenchmark(merge=merge.data.frame(x3,y3,all=TRUE),
               CJ.table.1=CJ.table.1(x3,y3),
               CJ.table.2=CJ.table.2(x3,y3),
               times=3, unit="s")
#Unit: seconds
#       expr        min         lq     median         uq        max neval
#      merge 4.03710225 4.23233688 4.42757152 5.57854711 6.72952271     3
# CJ.table.1 0.06227603 0.06264222 0.06300842 0.06701880 0.07102917     3
# CJ.table.2 0.04740142 0.04812997 0.04885853 0.05433146 0.05980440     3

Note that these data.table methods are much faster than the merge method suggested by @danas.zuokas. The two tables with 1,000 rows in this example result in a cross-joined table with 1 million rows. So even if your original tables are small, the result can get big quickly and speed becomes important.

注意,这些数据。表方法比@danas.zuokas提出的合并方法要快得多。在本例中，这两个具有1,000行的表产生了一个具有100万行的交叉连接表。因此，即使你的原始表很小，结果也会很快变大，速度也变得非常重要。

Lastly, recent versions of data.table require you to add the allow.cartesian=TRUE (as in CJ.table.1) or specify the names of the columns that should be returned (CJ.table.2). The second method (CJ.table.2) seems to be faster, but requires some more complicated code if you want to automatically specify all the column names. And it may not work with duplicate column names. (Feel free to suggest a simpler version of CJ.table.2)

最后，最新版本的数据。表要求您添加允许。cartesian=TRUE(如CJ.table.1所示)或指定应该返回的列的名称(CJ.table.2)。第二个方法(CJ.table.2)似乎更快，但是如果您想自动指定所有列名，需要一些更复杂的代码。而且它可能不能使用重复的列名。(请随意推荐一个更简单的cj . table2)

#3

If you want to do it via data.table, this is one way:

如果你想通过数据来做。表，这是一种方式:

cjdt <- function(a,b){
  cj = CJ(1:nrow(a),1:nrow(b))
  cbind(a[cj[[1]],],b[cj[[2]],])
}

A = data.table(ida = 1:10)
B = data.table(idb = 1:10)
cjdt(A,B)

Having said the above, if you are doing many little joins, and you don't need a data.table object and the overhead of producing it, a significant speed increase can be achieved by writing a c++ code block using Rcpp and the like:

如上所述，如果您正在执行许多小连接，并且不需要数据。表对象及其产生的开销，可以通过使用Rcpp编写c++代码块等实现显著的速度提升:

// [[Rcpp::export]]
NumericMatrix crossJoin(NumericVector a, NumericVector b){
  int szA = a.size(), 
      szB = b.size();
  int i,j,r;
  NumericMatrix ret(szA*szB,2);
  for(i = 0, r = 0; i < szA; i++){
    for(j = 0; j < szB; j++, r++){
      ret(r,0) = a(i);
      ret(r,1) = b(j);
    }
  }
  return ret;
}

To compare, firstly for a large join:

C++

c++

n = 1
a = runif(10000)
b = runif(10000)
system.time({for(i in 1:n){
  crossJoin(a,b)
}})

user system elapsed 1.033 0.424 1.462

用户系统运行1.033 0.424 1.462

data.table

system.time({for(i in 1:n){
  CJ(a,b)
}})

user system elapsed 0.602 0.569 2.452

用户系统运行0.602 0.569 2.452

Now for lots of little joins:

C++

c++

n = 1e5
a = runif(10)
b = runif(10)
system.time({for(i in 1:n){
  crossJoin(a,b)
}})

user system elapsed 0.660 0.077 0.739

用户系统运行时间0.660 0.077 0.739

data.table

system.time({for(i in 1:n){
  CJ(a,b)
}})

user system elapsed 26.164 0.056 26.271

用户系统26.164流逝0.056 26.271

#4

Usig sqldf:

Usig sqldf:

x <- data.frame(id1 = c("a", "b", "c"), vals1 = 1:3)
y <- data.frame(id2 = c("d", "e", "f"), vals2 = 4:6) 

library(sqldf)
sqldf("SELECT * FROM x
      CROSS JOIN y")

Output:

输出:

  id1 vals1 id2 vals2
1   a     1   d     4
2   a     1   e     5
3   a     1   f     6
4   b     2   d     4
5   b     2   e     5
6   b     2   f     6
7   c     3   d     4
8   c     3   e     5
9   c     3   f     6

Just for the record, with the base package, we can use the by= NULL instead of all=TRUE:

对于记录，对于基包，我们可以使用by= NULL而不是all=TRUE:

merge(x, y, by= NULL)

#5

By using the merge function and its optional parameters:

使用合并函数及其可选参数:

Inner join: merge(df1, df2) will work for these examples because R automatically joins the frames by common variable names, but you would most likely want to specify merge(df1, df2, by = "CustomerId") to make sure that you were matching on only the fields you desired. You can also use the by.x and by.y parameters if the matching variables have different names in the different data frames.

内部连接:merge(df1, df2)将适用于这些示例，因为R通过常见的变量名自动连接帧，但您很可能希望指定merge(df1, df2, by = "CustomerId")，以确保仅匹配所需的字段。你也可以用by。x和。如果匹配的变量在不同的数据帧中有不同的名称，则为y参数。

Outer join: merge(x = df1, y = df2, by = "CustomerId", all = TRUE)

Left outer: merge(x = df1, y = df2, by = "CustomerId", all.x = TRUE)

Right outer: merge(x = df1, y = df2, by = "CustomerId", all.y = TRUE)

Cross join: merge(x = df1, y = df2, by = NULL)

#6

~~I don't know of a built-in way to do it with data.frame's but it isn't hard to make.~~

我不知道有什么内置的方法可以用数据。frame's，但这并不难做。

@danas showed there is an easy built-in way, but I'll leave my answer here in case it is useful for other purposes.

@danas显示了一种简单的内置方法，但我将在这里留下我的答案，以防它对其他用途有用。

cross.join <- function(a, b) {
    idx <- expand.grid(seq(length=nrow(a)), seq(length=nrow(b)))
    cbind(a[idx[,1],], b[idx[,2],])
}

and showing that it works with some built-in data sets:

并显示它与一些内置数据集一起工作:

> tmp <- cross.join(mtcars, iris)
> dim(mtcars)
[1] 32 11
> dim(iris)
[1] 150   5
> dim(tmp)
[1] 4800   16
> str(tmp)
'data.frame':   4800 obs. of  16 variables:
 $ mpg         : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
 $ cyl         : num  6 6 4 6 8 6 8 4 4 6 ...
 $ disp        : num  160 160 108 258 360 ...
 $ hp          : num  110 110 93 110 175 105 245 62 95 123 ...
 $ drat        : num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
 $ wt          : num  2.62 2.88 2.32 3.21 3.44 ...
 $ qsec        : num  16.5 17 18.6 19.4 17 ...
 $ vs          : num  0 0 1 1 0 1 0 1 1 1 ...
 $ am          : num  1 1 1 0 0 0 0 0 0 0 ...
 $ gear        : num  4 4 4 3 3 3 3 4 4 4 ...
 $ carb        : num  4 4 1 1 2 1 4 2 2 4 ...
 $ Sepal.Length: num  5.1 5.1 5.1 5.1 5.1 5.1 5.1 5.1 5.1 5.1 ...
 $ Sepal.Width : num  3.5 3.5 3.5 3.5 3.5 3.5 3.5 3.5 3.5 3.5 ...
 $ Petal.Length: num  1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 ...
 $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 ...
 $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

#1

Is it just all=TRUE?

只是所有= TRUE吗?

x<-data.frame(id1=c("a","b","c"),vals1=1:3)
y<-data.frame(id2=c("d","e","f"),vals2=4:6)
merge(x,y,all=TRUE)

From documentation of merge:

从合并的文档:

If by or both by.x and by.y are of length 0 (a length zero vector or NULL), the result, r, is the Cartesian product of x and y, i.e., dim(r) = c(nrow(x)*nrow(y), ncol(x) + ncol(y)).

或者两者兼有。x和。y的长度为0(一个长度为0的向量或零)，结果r是x和y的笛卡儿积，即。， dim(r) = c(nrow(x)*nrow(y)， ncol(x) + ncol(y)))

#2

If speed is an issue, I suggest checking out the excellent data.table package. In the example at the end it's ~90x faster than merge.

如果速度是个问题，我建议你去看看那些优秀的数据。表方案。在最后的示例中，它比merge快了大约90x。

You didn't provide example data. If you just want to get all combinations of two (or more individual) columns, you can use CJ (cross join):

您没有提供示例数据。如果您只想获得两个(或多个)列的所有组合，您可以使用CJ(交叉连接):

library(data.table)
CJ(x=1:2,y=letters[1:3])
#   x y
#1: 1 a
#2: 1 b
#3: 1 c
#4: 2 a
#5: 2 b
#6: 2 c

If you want to do a cross join on two tables, I haven't found a way to use CJ(). But you can still use data.table:

如果您想在两个表上执行交叉连接，我还没有找到使用CJ()的方法。但是你仍然可以使用data.table:

x2<-data.table(id1=letters[1:3],vals1=1:3)
y2<-data.table(id2=letters[4:7],vals2=4:7)

res<-setkey(x2[,c(k=1,.SD)],k)[y2[,c(k=1,.SD)],allow.cartesian=TRUE][,k:=NULL]
res
#    id1 vals1 id2 vals2
# 1:   a     1   d     4
# 2:   b     2   d     4
# 3:   c     3   d     4
# 4:   a     1   e     5
# 5:   b     2   e     5
# 6:   c     3   e     5
# 7:   a     1   f     6
# 8:   b     2   f     6
# 9:   c     3   f     6
#10:   a     1   g     7
#11:   b     2   g     7
#12:   c     3   g     7

Explanation of the res line:

res线解释:

Basically you add a dummy column (k in this example) to one table and set it as the key (setkey(tablename,keycolumns)), add the dummy column to the other table, and then join them.
基本上，您将一个哑列(本例中的k)添加到一个表中，并将其设置为key(setkey(tablename,keycolumns)，将哑列添加到另一个表中，然后将它们连接起来。
The data.table structure uses column positions and not names in the join, so you have to put the dummy column at the beginning. The c(k=1,.SD) part is one way that I have found to add columns at the beginning (the default is to add them to the end).
数据。表结构在连接中使用列的位置而不是名称，因此必须在开始时放置假列。c(k=1，. sd)部分是我发现在开头添加列的一种方法(默认情况是将列添加到末尾)。
A standard data.table join has a format of X[Y]. The X in this case is setkey(x2[,c(k=1,.SD)],k), and the Y is y2[,c(k=1,.SD)].
一个标准的数据。表连接的格式为X[Y]。这里的X是setkey(x2[，c(k=1，. sd)]，k)， Y是y2[，c(k=1，. sd)]。
allow.cartesian=TRUE tells data.table to ignore the duplicate key values, and perform a cartesian join (prior versions didn't require this)
允许的。笛卡儿= TRUE告诉数据。表来忽略重复的键值，并执行笛卡尔连接(以前的版本不需要这个)
The [,k:=NULL] at the end just removes the dummy key from the result.
最后的[，k:=NULL]将从结果中删除哑键。

You can also turn this into a function, so it's cleaner to use:

你也可以把它变成一个函数，所以用起来更简洁:

# Version 1; easier to write:
CJ.table.1 <- function(X,Y)
  setkey(X[,c(k=1,.SD)],k)[Y[,c(k=1,.SD)],allow.cartesian=TRUE][,k:=NULL]

CJ.table.1(x2,y2)
#    id1 vals1 id2 vals2
# 1:   a     1   d     4
# 2:   b     2   d     4
# 3:   c     3   d     4
# 4:   a     1   e     5
# 5:   b     2   e     5
# 6:   c     3   e     5
# 7:   a     1   f     6
# 8:   b     2   f     6
# 9:   c     3   f     6
#10:   a     1   g     7
#11:   b     2   g     7
#12:   c     3   g     7

# Version 2; faster but messier:
CJ.table.2 <- function(X,Y) {
  eval(parse(text=paste0("setkey(X[,c(k=1,.SD)],k)[Y[,c(k=1,.SD)],list(",paste0(unique(c(names(X),names(Y))),collapse=","),")][,k:=NULL]")))
}

Here are some speed benchmarks:

以下是一些速度基准:

# Create a bigger (but still very small) example:
n<-1e3
x3<-data.table(id1=1L:n,vals1=sample(letters,n,replace=T))
y3<-data.table(id2=1L:n,vals2=sample(LETTERS,n,replace=T))

library(microbenchmark)
microbenchmark(merge=merge.data.frame(x3,y3,all=TRUE),
               CJ.table.1=CJ.table.1(x3,y3),
               CJ.table.2=CJ.table.2(x3,y3),
               times=3, unit="s")
#Unit: seconds
#       expr        min         lq     median         uq        max neval
#      merge 4.03710225 4.23233688 4.42757152 5.57854711 6.72952271     3
# CJ.table.1 0.06227603 0.06264222 0.06300842 0.06701880 0.07102917     3
# CJ.table.2 0.04740142 0.04812997 0.04885853 0.05433146 0.05980440     3

#3

If you want to do it via data.table, this is one way:

如果你想通过数据来做。表，这是一种方式:

cjdt <- function(a,b){
  cj = CJ(1:nrow(a),1:nrow(b))
  cbind(a[cj[[1]],],b[cj[[2]],])
}

A = data.table(ida = 1:10)
B = data.table(idb = 1:10)
cjdt(A,B)

如上所述，如果您正在执行许多小连接，并且不需要数据。表对象及其产生的开销，可以通过使用Rcpp编写c++代码块等实现显著的速度提升:

// [[Rcpp::export]]
NumericMatrix crossJoin(NumericVector a, NumericVector b){
  int szA = a.size(), 
      szB = b.size();
  int i,j,r;
  NumericMatrix ret(szA*szB,2);
  for(i = 0, r = 0; i < szA; i++){
    for(j = 0; j < szB; j++, r++){
      ret(r,0) = a(i);
      ret(r,1) = b(j);
    }
  }
  return ret;
}

To compare, firstly for a large join:

C++

c++

n = 1
a = runif(10000)
b = runif(10000)
system.time({for(i in 1:n){
  crossJoin(a,b)
}})

user system elapsed 1.033 0.424 1.462

用户系统运行1.033 0.424 1.462

data.table

system.time({for(i in 1:n){
  CJ(a,b)
}})

user system elapsed 0.602 0.569 2.452

用户系统运行0.602 0.569 2.452

Now for lots of little joins:

C++

c++

n = 1e5
a = runif(10)
b = runif(10)
system.time({for(i in 1:n){
  crossJoin(a,b)
}})

user system elapsed 0.660 0.077 0.739

用户系统运行时间0.660 0.077 0.739

data.table

system.time({for(i in 1:n){
  CJ(a,b)
}})

user system elapsed 26.164 0.056 26.271

用户系统26.164流逝0.056 26.271

#4

Usig sqldf:

Usig sqldf:

x <- data.frame(id1 = c("a", "b", "c"), vals1 = 1:3)
y <- data.frame(id2 = c("d", "e", "f"), vals2 = 4:6) 

library(sqldf)
sqldf("SELECT * FROM x
      CROSS JOIN y")

Output:

输出:

  id1 vals1 id2 vals2
1   a     1   d     4
2   a     1   e     5
3   a     1   f     6
4   b     2   d     4
5   b     2   e     5
6   b     2   f     6
7   c     3   d     4
8   c     3   e     5
9   c     3   f     6

Just for the record, with the base package, we can use the by= NULL instead of all=TRUE:

对于记录，对于基包，我们可以使用by= NULL而不是all=TRUE:

merge(x, y, by= NULL)

#5

By using the merge function and its optional parameters:

使用合并函数及其可选参数:

Outer join: merge(x = df1, y = df2, by = "CustomerId", all = TRUE)

Left outer: merge(x = df1, y = df2, by = "CustomerId", all.x = TRUE)

Right outer: merge(x = df1, y = df2, by = "CustomerId", all.y = TRUE)

Cross join: merge(x = df1, y = df2, by = NULL)

#6

~~I don't know of a built-in way to do it with data.frame's but it isn't hard to make.~~

我不知道有什么内置的方法可以用数据。frame's，但这并不难做。

@danas showed there is an easy built-in way, but I'll leave my answer here in case it is useful for other purposes.

@danas显示了一种简单的内置方法，但我将在这里留下我的答案，以防它对其他用途有用。

cross.join <- function(a, b) {
    idx <- expand.grid(seq(length=nrow(a)), seq(length=nrow(b)))
    cbind(a[idx[,1],], b[idx[,2],])
}

and showing that it works with some built-in data sets:

并显示它与一些内置数据集一起工作:

> tmp <- cross.join(mtcars, iris)
> dim(mtcars)
[1] 32 11
> dim(iris)
[1] 150   5
> dim(tmp)
[1] 4800   16
> str(tmp)
'data.frame':   4800 obs. of  16 variables:
 $ mpg         : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
 $ cyl         : num  6 6 4 6 8 6 8 4 4 6 ...
 $ disp        : num  160 160 108 258 360 ...
 $ hp          : num  110 110 93 110 175 105 245 62 95 123 ...
 $ drat        : num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
 $ wt          : num  2.62 2.88 2.32 3.21 3.44 ...
 $ qsec        : num  16.5 17 18.6 19.4 17 ...
 $ vs          : num  0 0 1 1 0 1 0 1 1 1 ...
 $ am          : num  1 1 1 0 0 0 0 0 0 0 ...
 $ gear        : num  4 4 4 3 3 3 3 4 4 4 ...
 $ carb        : num  4 4 1 1 2 1 4 2 2 4 ...
 $ Sepal.Length: num  5.1 5.1 5.1 5.1 5.1 5.1 5.1 5.1 5.1 5.1 ...
 $ Sepal.Width : num  3.5 3.5 3.5 3.5 3.5 3.5 3.5 3.5 3.5 3.5 ...
 $ Petal.Length: num  1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 ...
 $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 ...
 $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

秒客网

如何在R中交叉连接?

6 个解决方案

#1

#2

#3

To compare, firstly for a large join:

Now for lots of little joins:

#4

#5

#6

#1

#2

#3

To compare, firstly for a large join:

Now for lots of little joins:

#4

#5

#6

相关文章