如何使用R和dplyr连接来自不同SQL数据库的表?

时间:2021-11-16 09:45:53

I'm using dplyr (0.7.0), dbplyr (1.0.0), DBI 0.6-1, and odbc (1.0.1.9000). I would like to do something like the following:

我使用的是dplyr(0.7.0),dbplyr(1.0.0),DBI 0.6-1和odbc(1.0.1.9000)。我想做类似以下的事情:

db1 <- DBI::dbConnect(
  odbc::odbc(),
  Driver = "SQL Server",
  Server = "MyServer",
  Database = "DB1"
)
db2 <- DBI::dbConnect(
  odbc::odbc(),
  Driver = "SQL Server",
  Server = "MyServer",
  Database = "DB2"
)
x <- tbl(db1, "Table1") %>%
  dplyr::left_join(tbl(db2, "Table2"), by = "JoinColumn") 

but I keep getting an error that doesn't really seem to have any substance to it. When I use show_query it seems like the code is trying to create a SQL query that joins the two tables without taking the separate databases into account. Per the documentation for dplyr::left_join I've also tried:

但我一直得到一个似乎没有任何实质内容的错误。当我使用show_query时,似乎代码正在尝试创建一个连接两个表的SQL查询,而不考虑单独的数据库。根据dplyr :: left_join的文档,我也尝试过:

x <- tbl(db1, "Table1") %>%
      dplyr::left_join(tbl(db2, "Table2"), by = "JoinColumn", copy = TRUE) 

But there is no change in the output or error message. Is there a different way to join tables from separate databases on the same server?

但输出或错误消息没有变化。是否有不同的方法从同一服务器上的不同数据库连接表?

3 个解决方案

#1


1  

I'm assuming from the code you provided that (a) you're interested in joining the two tbl objects via dplyr's syntax before you run collect() and pull the results into local memory and that (b) you want to refer directly to the database objects in the call to tbl().

我假设您提供的代码(a)您有兴趣在运行collect()之前通过dplyr的语法加入两个tbl对象并将结果拉入本地内存并且(b)您想直接引用调用tbl()时的数据库对象。

These choices are important if you want to leverage dplyr to programmatically build your query logic while simultaneously leveraging the database server to INNER JOIN large volumes of data down to the set that you're interested in. (Or at least that's why I ended up here.)

如果您想利用dplyr以编程方式构建查询逻辑,同时利用数据库服务器将大量数据下载到您感兴趣的集合中,这些选择很重要。(或者至少这就是为什么我在这里结束了。)

The solution I found uses one connection without specifying the database, and spells out the database and schema information using in_schema() (I couldn't find this documented or vignetted anywhere):

我找到的解决方案使用一个连接而不指定数据库,并使用in_schema()拼出数据库和架构信息(我无法在任何地方找到此文档或渐晕):

conn <- DBI::dbConnect(
  odbc::odbc(),
  Driver = "SQL Server",
  Server = "MyServer"
)

x <- tbl(src_dbi(conn),
         in_schema("DB1.dbo", "Table1")) %>%
  dplyr::left_join(tbl(src_dbi(conn),
                       in_schema("DB1.dbo", "Table2")),
                   by = "JoinColumn")

#2


0  

I would use the merge() function to perform the left join the on the tables. It would be something like x <- merge(df1, df2, by = "JoinColumn", all.x = TRUE).

我会使用merge()函数来执行表上的左连接。它将类似于x < - merge(df1,df2,by =“JoinColumn”,all.x = TRUE)。

#3


0  

I faced the same problem and I wasn't able to solve it with dplyr::left_join.

我遇到了同样的问题,我无法用dplyr :: left_join解决它。

At least I was able to do the job using the following workaround. I connected to SQL Server without declaring a default database, then I ran the query with sql().

至少我能够使用以下解决方法完成这项工作。我连接到SQL Server而没有声明默认数据库,然后我用sql()运行查询。

con <- dbConnect(odbc::odbc(), dsn="DWH" ,  uid="", pwd= "" )

data_db <- tbl( con, sql("SELECT * 
                    FROM DB1..Table1 AS a
                    LEFT JOIN DB2..Table2 AS b ON a.JoinColumn = b.JoinColumn") ) 

data_db %>% ...

data_db%>%...

Hope it helps.

希望能帮助到你。

#1


1  

I'm assuming from the code you provided that (a) you're interested in joining the two tbl objects via dplyr's syntax before you run collect() and pull the results into local memory and that (b) you want to refer directly to the database objects in the call to tbl().

我假设您提供的代码(a)您有兴趣在运行collect()之前通过dplyr的语法加入两个tbl对象并将结果拉入本地内存并且(b)您想直接引用调用tbl()时的数据库对象。

These choices are important if you want to leverage dplyr to programmatically build your query logic while simultaneously leveraging the database server to INNER JOIN large volumes of data down to the set that you're interested in. (Or at least that's why I ended up here.)

如果您想利用dplyr以编程方式构建查询逻辑,同时利用数据库服务器将大量数据下载到您感兴趣的集合中,这些选择很重要。(或者至少这就是为什么我在这里结束了。)

The solution I found uses one connection without specifying the database, and spells out the database and schema information using in_schema() (I couldn't find this documented or vignetted anywhere):

我找到的解决方案使用一个连接而不指定数据库,并使用in_schema()拼出数据库和架构信息(我无法在任何地方找到此文档或渐晕):

conn <- DBI::dbConnect(
  odbc::odbc(),
  Driver = "SQL Server",
  Server = "MyServer"
)

x <- tbl(src_dbi(conn),
         in_schema("DB1.dbo", "Table1")) %>%
  dplyr::left_join(tbl(src_dbi(conn),
                       in_schema("DB1.dbo", "Table2")),
                   by = "JoinColumn")

#2


0  

I would use the merge() function to perform the left join the on the tables. It would be something like x <- merge(df1, df2, by = "JoinColumn", all.x = TRUE).

我会使用merge()函数来执行表上的左连接。它将类似于x < - merge(df1,df2,by =“JoinColumn”,all.x = TRUE)。

#3


0  

I faced the same problem and I wasn't able to solve it with dplyr::left_join.

我遇到了同样的问题,我无法用dplyr :: left_join解决它。

At least I was able to do the job using the following workaround. I connected to SQL Server without declaring a default database, then I ran the query with sql().

至少我能够使用以下解决方法完成这项工作。我连接到SQL Server而没有声明默认数据库,然后我用sql()运行查询。

con <- dbConnect(odbc::odbc(), dsn="DWH" ,  uid="", pwd= "" )

data_db <- tbl( con, sql("SELECT * 
                    FROM DB1..Table1 AS a
                    LEFT JOIN DB2..Table2 AS b ON a.JoinColumn = b.JoinColumn") ) 

data_db %>% ...

data_db%>%...

Hope it helps.

希望能帮助到你。