使用data.table:输出中缺少列的非equi连接

时间:2021-03-06 22:51:58

I am doing a left non-equi join using data.table:

我正在使用data.table进行左非equi连接:

OUTPUT <- DT2[DT1, on=.(DOB, FORENAME, SURNAME, POSTCODE, START_DATE <= MONTH, EXPIRY_DATE >= MONTH)]

The OUTPUT contains a correct left join, with the exception that the MONTH column (which is present in DT1) is missing.

OUTPUT包含正确的左连接,但MONTH列(存在于DT1中)缺失。

Is this a bug in data.table?

这是data.table中的错误吗?

NB: Of course, START_DATE, EXPIRY_DATE and MONTH are in the same YYYY-MM-DD, IDate format. The results of the join are correct based on these non-equi criteria. It is just that the column is missing and I need to use it in further work.

注意:当然,START_DATE,EXPIRY_DATE和MONTH采用相同的YYYY-MM-DD,IDate格式。基于这些非等标准,连接的结果是正确的。只是列缺失了,我需要在进一步的工作中使用它。

Edit 1: Simplified reproducible example

编辑1:简化的可重复示例

DT1 <- structure(list(ID = c(1, 2, 3), FORENAME = c("JOHN", "JACK", 
"ROB"), SURNAME = c("JOHNSON", "JACKSON", "ROBINSON"), MONTH = structure(c(16953L, 
16953L, 16953L), class = c("IDate", "Date"))), .Names = c("ID", 
"FORENAME", "SURNAME", "MONTH"), row.names = c(NA, -3L), class = c("data.table", 
"data.frame"))

DT2 <- structure(list(CERT_NUMBER = 999, FORENAME = "JOHN", SURNAME = "JOHNSON", 
    START_DATE = structure(16801L, class = c("IDate", "Date")), 
    EXPIRY_DATE = structure(17166L, class = c("IDate", "Date"
    ))), .Names = c("CERT_NUMBER", "FORENAME", "SURNAME", "START_DATE", 
"EXPIRY_DATE"), row.names = c(NA, -1L), class = c("data.table", 
"data.frame"))

OUTPUT <- DT2[DT1, on=.(FORENAME, SURNAME, START_DATE <= MONTH, EXPIRY_DATE >= MONTH)]

> OUTPUT
   CERT_NUMBER FORENAME  SURNAME START_DATE EXPIRY_DATE ID
1:         999     JOHN  JOHNSON 2016-06-01  2016-06-01  1
2:          NA     JACK  JACKSON 2016-06-01  2016-06-01  2
3:          NA      ROB ROBINSON 2016-06-01  2016-06-01  3
  • FORENAME and SURNAME are joined on and are present in the output.
  • FORENAME和SURNAME连接在一起并出现在输出中。

  • MONTH is also (non-equi) joined on, and is absent from the output.
  • MONTH也(非equi)加入,并且不在输出中。

Why is this expected behaviour?

为什么这是预期的行为?

Even if it is expected behaviour, it is not helpful in my case, because I need to retain MONTH for further data manipulation.

即使它是预期的行为,在我的情况下也没有用,因为我需要保留MONTH以进行进一步的数据操作。

My expected output would be the same table, but with the MONTH column retained as it is in DT1. After all, what I expect from a left join is for every row and column to be retained from the left table (DT1) and for all columns and only matched rows to be added from the right table (DT2).

我的预期输出将是同一个表,但MONTH列保留在DT1中。毕竟,我对左连接的期望是从左表(DT1)和所有列保留每个行和列,并且只从右表(DT2)添加匹配的行。

   CERT_NUMBER FORENAME  SURNAME START_DATE EXPIRY_DATE ID      MONTH
1:         999     JOHN  JOHNSON 2016-01-01  2016-12-31  1 2016-06-01
2:          NA     JACK  JACKSON       <NA>        <NA>  2 2016-06-01
3:          NA      ROB ROBINSON       <NA>        <NA>  3 2016-06-01

Edit 2: apparently in the output produced by my code, the START and END dates are also wrong! Only person 1 had a certificate with a start date on 1-Jan and end date on 31-Dec! The expected output is what it should be. But the actual output made everything 1-Jan.

编辑2:显然在我的代码产生的输出中,START和END日期也是错误的!只有1人拥有证书,开始日期为1月1日,结束日期为31日至12月!预期的产出应该是应有的。但实际产量在1月1日完成。

1 个解决方案

#1


13  

In data.table, joins of the form x[i] traditionally uses values from i but uses column names from x. Even though this is different from SQL which returns both, this default makes a lot of sense for equi joins since we are interested in all rows from i and if they match then both data.tables have equal values anyway, and if they don't we need to keep those unmatched values from i in result.

在data.table中,x [i]形式的连接传统上使用来自i的值但使用x中的列名。虽然这与返回两者的SQL不同,但是这个默认对于equi连接很有意义,因为我们对来自i的所有行感兴趣,如果它们匹配,那么两个data.tables都有相同的值,如果它们不相同我们需要在结果中保留i中那些不匹配的值。

But for non-equi joins, since the values might not match exactly, i.e., can fall within a range, there might be cases where we will have to return outputs similar to SQL (or identify such cases and return the result user expects, similar to the case of equi joins). This hasn't been done yet, but I've placed a provision for it at the moment, which is to refer to the columns explicitly with a x. prefix. It is not convenient, I agree. Hopefully this'll be taken care of automatically soon.

但是对于非equi连接,由于值可能不完全匹配,即可能落在一个范围内,可能会出现我们必须返回类似于SQL的输出(或识别此类情况并返回用户期望的结果,类似对于equi连接的情况)。这还没有完成,但我现在已经为它做了一个规定,即用x明确引用列。字首。我同意,这不方便。希望这将很快自动完成。

Here's how to get your result using x. prefix.

以下是使用x获取结果的方法。字首。

ans <- DT2[DT1, .(CERT_NUMBER, FORENAME, SURNAME, x.START_DATE, x.EXPIRY_DATE, ID, MONTH), 
         on=.(FORENAME, SURNAME, START_DATE <= MONTH, EXPIRY_DATE >= MONTH)]

IIRC there's an issue filed on the GitHub project page about this as well.

IIRC还有一个关于GitHub项目页面的问题。

#1


13  

In data.table, joins of the form x[i] traditionally uses values from i but uses column names from x. Even though this is different from SQL which returns both, this default makes a lot of sense for equi joins since we are interested in all rows from i and if they match then both data.tables have equal values anyway, and if they don't we need to keep those unmatched values from i in result.

在data.table中,x [i]形式的连接传统上使用来自i的值但使用x中的列名。虽然这与返回两者的SQL不同,但是这个默认对于equi连接很有意义,因为我们对来自i的所有行感兴趣,如果它们匹配,那么两个data.tables都有相同的值,如果它们不相同我们需要在结果中保留i中那些不匹配的值。

But for non-equi joins, since the values might not match exactly, i.e., can fall within a range, there might be cases where we will have to return outputs similar to SQL (or identify such cases and return the result user expects, similar to the case of equi joins). This hasn't been done yet, but I've placed a provision for it at the moment, which is to refer to the columns explicitly with a x. prefix. It is not convenient, I agree. Hopefully this'll be taken care of automatically soon.

但是对于非equi连接,由于值可能不完全匹配,即可能落在一个范围内,可能会出现我们必须返回类似于SQL的输出(或识别此类情况并返回用户期望的结果,类似对于equi连接的情况)。这还没有完成,但我现在已经为它做了一个规定,即用x明确引用列。字首。我同意,这不方便。希望这将很快自动完成。

Here's how to get your result using x. prefix.

以下是使用x获取结果的方法。字首。

ans <- DT2[DT1, .(CERT_NUMBER, FORENAME, SURNAME, x.START_DATE, x.EXPIRY_DATE, ID, MONTH), 
         on=.(FORENAME, SURNAME, START_DATE <= MONTH, EXPIRY_DATE >= MONTH)]

IIRC there's an issue filed on the GitHub project page about this as well.

IIRC还有一个关于GitHub项目页面的问题。