R如何计算数据帧中行之间的差异

时间:2022-04-04 19:35:30

Here is a simple example of my problem:

这是我的问题的一个简单示例:

> df <- data.frame(ID=1:10,Score=4*10:1)
> df
       ID Score
    1   1    40
    2   2    36
    3   3    32
    4   4    28
    5   5    24
    6   6    20
    7   7    16
    8   8    12
    9   9     8
    10 10     4
    > diff(df)

Error in r[i1] - r[-length(r):-(length(r) - lag + 1L)] : 
  non-numeric argument to binary operator

Can anyone tell me why this error occurs?

谁能告诉我为什么会出现这个错误?

6 个解决方案

#1


27  

diff wants a matrix or a vector rather than a data frame. Try

diff需要矩阵或向量而不是数据帧。尝试

data.frame(diff(as.matrix(df)))

#2


20  

Perhaps you are looking for something like this:

也许你正在寻找这样的东西:

> tail(df, -1) - head(df, -1)
   ID Score
2   1    -4
3   1    -4
4   1    -4
5   1    -4
6   1    -4
7   1    -4
8   1    -4
9   1    -4
10  1    -4

You can subtract or add two data.frames together if they are the same dimensions. So, what we are doing here is subtracting one data.frame that is missing the first row (tail(df, -1)) and one that is missing the last row (head(df, -1)) and subtracting them.

如果它们是相同的尺寸,您可以将两个data.frames相加或相加。所以,我们在这里做的是减去一个缺少第一行的数据帧(tail(df,-1))和一个缺少最后一行(head(df,-1))并减去它们的数据。

#3


8  

Because df works on vector or matrix. You can use apply to apply the function across columns like so:

因为df适用于矢量或矩阵。您可以使用apply来跨列应用函数,如下所示:

 apply( df , 2 , diff )
   ID Score
2   1    -4
3   1    -4
4   1    -4
5   1    -4
6   1    -4
7   1    -4
8   1    -4
9   1    -4
10  1    -4

It seems unlikely that you want to calculate the difference in sequential IDs, so you could choose to apply it on all columns except the first like so:

您似乎不太可能想要计算顺序ID的差异,因此您可以选择将其应用于除第一个之外的所有列,如下所示:

apply( df[-1] , 2 , diff )

Or you could use data.table (not that it adds anything here I just really want to start using it!), and I am again assuming that you do not want to apply diff to the ID column:

或者你可以使用data.table(不是它在这里添加任何东西我真的想开始使用它!),我再次假设你不想将diff应用于ID列:

DT <- data.table(df)
DT[ , list(ID,Score,Diff=diff(Score))  ]
    ID Score Diff
 1:  1    40   -4
 2:  2    36   -4
 3:  3    32   -4
 4:  4    28   -4
 5:  5    24   -4
 6:  6    20   -4
 7:  7    16   -4
 8:  8    12   -4
 9:  9     8   -4
10: 10     4   -4

And thanks to @AnandaMahto an alternative syntax that gives more flexibility to choose which columns to run it on could be:

感谢@AnandaMahto提供了一种替代语法,可以更灵活地选择运行它的列:

DT[, lapply(.SD, diff), .SDcols = 1:2]

Here .SDcols = 1:2 means you want to apply the diff function to columns 1 and 2. If you have 20 columns and didn't want to apply it to ID you could use .SDcols=2:20 as an example.

这里.SDcols = 1:2表示你想将diff函数应用于第1列和第2列。如果你有20列并且不想将它应用于ID,你可以使用.SDcols = 2:20作为例子。

#4


5  

Another option using dplyr would be using mutate_each to loop through all the columns, get the difference of the column (.) with the lag of the column (.) and remove the NA element at the top with na.omit()

使用dplyr的另一个选项是使用mutate_each循环遍历所有列,获取列(。)的差异与列的延迟(。)并使用na.omit()删除顶部的NA元素

library(dplyr)
df %>%
    mutate_each(funs(. - lag(.))) %>%
    na.omit() 

Or with shift from data.table. Convert the 'data.frame' to 'data.table' (setDT(df)), loop through the columns (lapply(.SD, ..)) and get the difference between the column (x) and thelag(shiftby default gives thelagastype = "lag"`). Remove the first observation i.e. NA element.

或者从data.table转移。将'data.frame'转换为'data.table'(setDT(df)),遍历列(lapply(.SD,..))并获取列(x)和thelag之间的差异(shiftby default给出thelagastype =“lag”`)。删除第一个观察,即NA元素。

library(data.table)
setDT(df)[, lapply(.SD, function(x) (x- shift(x))[-1])]

#5


4  

Adding this a few years later for completeness- you can use a simple [.data.frame subseting in order to achieve this too

几年后为了完整性而添加它 - 你可以使用一个简单的[.data.frame subseting来实现这一点

df[-1, ] - df[-nrow(df), ]
#    ID Score
# 2   1    -4
# 3   1    -4
# 4   1    -4
# 5   1    -4
# 6   1    -4
# 7   1    -4
# 8   1    -4
# 9   1    -4
# 10  1    -4

#6


3  

I would like to show an alternative way for doing such kind of things even often I have the feeling it is not appreciated doing this in that way: using sql.

我想展示一种替代方式来做这种事情,即使我经常感觉不喜欢这样做:使用sql。

sqldf(paste("SELECT a.ID,a.Score"
            ,"      , a.Score - (SELECT b.Score"
            ,"                   FROM df b"
            ,"                   WHERE b.ID < a.ID"
            ,"                   ORDER BY b.ID DESC"
            ,"                   ) diff"
            ," FROM df a"
            )
      )

The code seems complicated but it is not and it has some advantage, as you can see at the results:

代码似乎很复杂,但它没有,它有一些优势,你可以看到结果:

    ID Score diff
 1   1    40 <NA>
 2   2    36 -4.0
 3   3    32 -4.0
 4   4    28 -4.0
 5   5    24 -4.0
 6   6    20 -4.0
 7   7    16 -4.0
 8   8    12 -4.0
 9   9     8 -4.0
 10 10     4 -4.0

One advantage is that you use the original dataframe (without converting into other classes) and you get a data frame (put it in res <- ....). Another advantage is that you have still all rows. And the third advantage is that you can easily consider grouping factors. For example:

一个优点是你使用原始数据帧(不转换成其他类),你得到一个数据框(把它放在res < - ....)。另一个优点是你仍然拥有所有行。第三个优点是您可以轻松地考虑分组因素。例如:

df2 <- data.frame(ID=1:10,grp=rep(c("v","w"), each=5),Score=4*10:1)

sqldf(paste("SELECT a.ID,a.grp,a.Score"
            ,"      , a.Score - (SELECT b.Score"
            ,"                   FROM df2 b"
            ,"                   WHERE b.ID < a.ID"
            ,"                         AND a.grp = b.grp"
            ,"                   ORDER BY b.ID DESC"
            ,"                   ) diff"
     ," FROM df2 a"
     )
)


   ID grp Score diff
1   1   v    40 <NA>
2   2   v    36 -4.0
3   3   v    32 -4.0
4   4   v    28 -4.0
5   5   v    24 -4.0
6   6   w    20 <NA>
7   7   w    16 -4.0
8   8   w    12 -4.0
9   9   w     8 -4.0
10 10   w     4 -4.0

#1


27  

diff wants a matrix or a vector rather than a data frame. Try

diff需要矩阵或向量而不是数据帧。尝试

data.frame(diff(as.matrix(df)))

#2


20  

Perhaps you are looking for something like this:

也许你正在寻找这样的东西:

> tail(df, -1) - head(df, -1)
   ID Score
2   1    -4
3   1    -4
4   1    -4
5   1    -4
6   1    -4
7   1    -4
8   1    -4
9   1    -4
10  1    -4

You can subtract or add two data.frames together if they are the same dimensions. So, what we are doing here is subtracting one data.frame that is missing the first row (tail(df, -1)) and one that is missing the last row (head(df, -1)) and subtracting them.

如果它们是相同的尺寸,您可以将两个data.frames相加或相加。所以,我们在这里做的是减去一个缺少第一行的数据帧(tail(df,-1))和一个缺少最后一行(head(df,-1))并减去它们的数据。

#3


8  

Because df works on vector or matrix. You can use apply to apply the function across columns like so:

因为df适用于矢量或矩阵。您可以使用apply来跨列应用函数,如下所示:

 apply( df , 2 , diff )
   ID Score
2   1    -4
3   1    -4
4   1    -4
5   1    -4
6   1    -4
7   1    -4
8   1    -4
9   1    -4
10  1    -4

It seems unlikely that you want to calculate the difference in sequential IDs, so you could choose to apply it on all columns except the first like so:

您似乎不太可能想要计算顺序ID的差异,因此您可以选择将其应用于除第一个之外的所有列,如下所示:

apply( df[-1] , 2 , diff )

Or you could use data.table (not that it adds anything here I just really want to start using it!), and I am again assuming that you do not want to apply diff to the ID column:

或者你可以使用data.table(不是它在这里添加任何东西我真的想开始使用它!),我再次假设你不想将diff应用于ID列:

DT <- data.table(df)
DT[ , list(ID,Score,Diff=diff(Score))  ]
    ID Score Diff
 1:  1    40   -4
 2:  2    36   -4
 3:  3    32   -4
 4:  4    28   -4
 5:  5    24   -4
 6:  6    20   -4
 7:  7    16   -4
 8:  8    12   -4
 9:  9     8   -4
10: 10     4   -4

And thanks to @AnandaMahto an alternative syntax that gives more flexibility to choose which columns to run it on could be:

感谢@AnandaMahto提供了一种替代语法,可以更灵活地选择运行它的列:

DT[, lapply(.SD, diff), .SDcols = 1:2]

Here .SDcols = 1:2 means you want to apply the diff function to columns 1 and 2. If you have 20 columns and didn't want to apply it to ID you could use .SDcols=2:20 as an example.

这里.SDcols = 1:2表示你想将diff函数应用于第1列和第2列。如果你有20列并且不想将它应用于ID,你可以使用.SDcols = 2:20作为例子。

#4


5  

Another option using dplyr would be using mutate_each to loop through all the columns, get the difference of the column (.) with the lag of the column (.) and remove the NA element at the top with na.omit()

使用dplyr的另一个选项是使用mutate_each循环遍历所有列,获取列(。)的差异与列的延迟(。)并使用na.omit()删除顶部的NA元素

library(dplyr)
df %>%
    mutate_each(funs(. - lag(.))) %>%
    na.omit() 

Or with shift from data.table. Convert the 'data.frame' to 'data.table' (setDT(df)), loop through the columns (lapply(.SD, ..)) and get the difference between the column (x) and thelag(shiftby default gives thelagastype = "lag"`). Remove the first observation i.e. NA element.

或者从data.table转移。将'data.frame'转换为'data.table'(setDT(df)),遍历列(lapply(.SD,..))并获取列(x)和thelag之间的差异(shiftby default给出thelagastype =“lag”`)。删除第一个观察,即NA元素。

library(data.table)
setDT(df)[, lapply(.SD, function(x) (x- shift(x))[-1])]

#5


4  

Adding this a few years later for completeness- you can use a simple [.data.frame subseting in order to achieve this too

几年后为了完整性而添加它 - 你可以使用一个简单的[.data.frame subseting来实现这一点

df[-1, ] - df[-nrow(df), ]
#    ID Score
# 2   1    -4
# 3   1    -4
# 4   1    -4
# 5   1    -4
# 6   1    -4
# 7   1    -4
# 8   1    -4
# 9   1    -4
# 10  1    -4

#6


3  

I would like to show an alternative way for doing such kind of things even often I have the feeling it is not appreciated doing this in that way: using sql.

我想展示一种替代方式来做这种事情,即使我经常感觉不喜欢这样做:使用sql。

sqldf(paste("SELECT a.ID,a.Score"
            ,"      , a.Score - (SELECT b.Score"
            ,"                   FROM df b"
            ,"                   WHERE b.ID < a.ID"
            ,"                   ORDER BY b.ID DESC"
            ,"                   ) diff"
            ," FROM df a"
            )
      )

The code seems complicated but it is not and it has some advantage, as you can see at the results:

代码似乎很复杂,但它没有,它有一些优势,你可以看到结果:

    ID Score diff
 1   1    40 <NA>
 2   2    36 -4.0
 3   3    32 -4.0
 4   4    28 -4.0
 5   5    24 -4.0
 6   6    20 -4.0
 7   7    16 -4.0
 8   8    12 -4.0
 9   9     8 -4.0
 10 10     4 -4.0

One advantage is that you use the original dataframe (without converting into other classes) and you get a data frame (put it in res <- ....). Another advantage is that you have still all rows. And the third advantage is that you can easily consider grouping factors. For example:

一个优点是你使用原始数据帧(不转换成其他类),你得到一个数据框(把它放在res < - ....)。另一个优点是你仍然拥有所有行。第三个优点是您可以轻松地考虑分组因素。例如:

df2 <- data.frame(ID=1:10,grp=rep(c("v","w"), each=5),Score=4*10:1)

sqldf(paste("SELECT a.ID,a.grp,a.Score"
            ,"      , a.Score - (SELECT b.Score"
            ,"                   FROM df2 b"
            ,"                   WHERE b.ID < a.ID"
            ,"                         AND a.grp = b.grp"
            ,"                   ORDER BY b.ID DESC"
            ,"                   ) diff"
     ," FROM df2 a"
     )
)


   ID grp Score diff
1   1   v    40 <NA>
2   2   v    36 -4.0
3   3   v    32 -4.0
4   4   v    28 -4.0
5   5   v    24 -4.0
6   6   w    20 <NA>
7   7   w    16 -4.0
8   8   w    12 -4.0
9   9   w     8 -4.0
10 10   w     4 -4.0