在SAS中连接来自两个Oracle数据库的表

I am joining two tables together that are located in two separate oracle databases.

我将位于两个独立的oracle数据库中的两个表连接在一起。

I am currently doing this in sas by creating two libname connections to each database and then simply using something like the below.

我目前在sas中做这个操作，为每个数据库创建两个libname连接，然后简单地使用如下所示。

libname dbase_a oracle user= etc... ;
libname dbase_b oracle user= etc... ;

proc sql;
create table t1 as 

select a.*, b.*
from dbase_a.table1 a inner join dbase_b.table2 b
on a.id = b.id;
quit;

However the query is painfully slow. Can you suggest any better options to speed up such a query (short of creating a database link going down the path of creating a database link)?

然而，查询非常缓慢。您能提出任何更好的选项来加速这样的查询吗(除了创建沿着创建数据库链接的路径的数据库链接)?

Many thanks for looking at this.

非常感谢您的关注。

2 个解决方案

#1

If those two databases are on the same server and you are able to execute cross-database queries in Oracle, you could try using SQL pass-through:

如果这两个数据库在同一台服务器上，并且您能够在Oracle中执行跨数据库查询，您可以尝试使用SQL pass-through:

proc sql;
connect to oracle (user= password= <...>);
create table t1 as
select * from connection to oracle (
  select a.*, b.*
  from dbase_a.schema_a.table1 a
  inner join dbase_b.schema_b.table2 b
    on a.id = b.id;
);
disconnect from oracle;
quit;

I think that, in most cases, SAS attemps as much as possible to have the query executed on the database server, even if pass-through was not explicitely specified. However, when that query queries tables that are on different servers, different databases on a system that does not allow cross-database queries or if the query contains SAS-specific functions that SAS is not able to translate in something valid on the DBMS system, then SAS will indeed resort to 'downloading' the complete tables and processing the query locally, which can evidently be painfully inefficient.

我认为，在大多数情况下，SAS会尽可能地在数据库服务器上执行查询，即使传递没有明确指定。然而,当查询查询的表在不同的服务器上,不同的数据库系统,不允许跨数据库查询或者查询包含SAS-specific函数SAS不能翻译的DBMS系统上有效,然后SAS确实会诉诸“下载”完整的表和处理本地查询,可显然是痛苦的效率低下。

#2

The select is for all columns from each table, and the inner join is on the id values only. Because the join criteria evaluation is for data coming from disparate sources, the baggage of all columns could be a big factor in the timing because even non-match rows must be downloaded (by the libname engine, within the SQL execution context) during the ON evaluation.

select是针对每个表的所有列，而内部连接仅在id值上。因为连接标准评估是针对来自不同来源的数据，所以所有列的包都可能是时间的一个重要因素，因为即使是不匹配的行也必须在评估期间被下载(由libname引擎，在SQL执行上下文中)。

One approach would be to:

一个办法是:

Select only the id from each table
只从每个表中选择id
Find the intersection
找到交集
Upload the intersection to each server (as a scratch table)
将交集上传到每个服务器(作为一个临时表)
Utilize the intersection on each server as pass through selection criteria within the final join in SAS
利用每个服务器上的交集作为在sa的最终连接中通过选择标准

There are a couple variations depending on the expected number of id matches, the number of different ids in each table, or knowing table-1 and table-2 as SMALL and BIG. For a large number of id matches that need transfer back to a server you will probably want to use some form of bulk copy. For a relative small number of ids in the intersection you might get away with enumerating them directly in a SQL statement using the construct IN (). The size of a SQL statement could be limited by the database, the SAS/ACCESS to ORACLE engine, the SAS macro system.

根据预期的id匹配数量、每个表中不同id的数量，或者根据表1和表2的大小知道表2有一些不同的变化。对于需要传输回服务器的大量id匹配，您可能需要使用某种形式的批量复制。对于交集中相对较少的id，可以使用()中的构造直接在SQL语句中枚举它们。SQL语句的大小可能受到数据库、对ORACLE引擎的SAS/ACCESS、SAS宏系统的限制。

Consider a data scenario in which it has been determined the potential number of matching ids would be too large for a construct in (id-1,...id-n). In such a case the list of matching ids are dealt with in a tabular manner:

考虑这样一个数据场景，其中已经确定了匹配id的潜在数量，对于id-1、…在这种情况下，匹配的id列表以表格方式处理:

libname SOURCE1 ORACLE ....;
libname SOURCE2 ORACLE ....;

libname SCRATCH1 ORACLE ... must specify a scratch schema ...;
libname SCRATCH2 ORACLE ... must specify a scratch schema ...;

proc sql;
    connect using SOURCE1 as PASS1;
    connect using SOURCE2 as PASS2;

    * compute intersection from only id data sent to SAS;
    create table INTERSECTION as
    (select id from connection to PASS1 (select id from table1))
    intersect
    (select id from connection to PASS2 (select id from table2))
    ;

    * upload intersection to each server;
    create table SCRATCH1.ids as select id from INTERSECTION;
    create table SCRATCH2.ids as select id from INTERSECTION;

    * compute inner join from only data that matches intersection;
    create table INNERJOIN as select ONE.*, TWO.* from
    (select * from connection to PASS1 (
        select * from oracle-path-to-schema.table1 
        where id in (select id from oracle-path-to-scratch.ids)
    ))
    JOIN
    (select * from connection to PASS2 (
        select * from oracle-path-to-schema.table2
        where id in (select id from oracle-path-to-scratch.ids)
    ));
    ...

For the case of both table-1 and table-2 having very large numbers of ids that exceed the resource capacity of your SAS platform you will have to also iterate the approach for ranges of id counts. Techniques for range criteria determination for each iteration is a tale for another day.

对于表1和表2中有大量id超过SAS平台的资源容量的情况，您还必须对id计数的范围进行迭代。为每个迭代确定范围标准的技术是另一天的故事。

#1