Here is a simple example of my problem:
这是我的问题的一个简单示例:
> df <- data.frame(ID=1:10,Score=4*10:1)
> df
ID Score
1 1 40
2 2 36
3 3 32
4 4 28
5 5 24
6 6 20
7 7 16
8 8 12
9 9 8
10 10 4
> diff(df)
Error in r[i1] - r[-length(r):-(length(r) - lag + 1L)] :
non-numeric argument to binary operator
Can anyone tell me why this error occurs?
谁能告诉我为什么会出现这个错误?
6 个解决方案
#1
27
diff wants a matrix or a vector rather than a data frame. Try
diff需要矩阵或向量而不是数据帧。尝试
data.frame(diff(as.matrix(df)))
#2
20
Perhaps you are looking for something like this:
也许你正在寻找这样的东西:
> tail(df, -1) - head(df, -1)
ID Score
2 1 -4
3 1 -4
4 1 -4
5 1 -4
6 1 -4
7 1 -4
8 1 -4
9 1 -4
10 1 -4
You can subtract or add two data.frame
s together if they are the same dimensions. So, what we are doing here is subtracting one data.frame
that is missing the first row (tail(df, -1)
) and one that is missing the last row (head(df, -1)
) and subtracting them.
如果它们是相同的尺寸,您可以将两个data.frames相加或相加。所以,我们在这里做的是减去一个缺少第一行的数据帧(tail(df,-1))和一个缺少最后一行(head(df,-1))并减去它们的数据。
#3
8
Because df works on vector or matrix. You can use apply to apply the function across columns like so:
因为df适用于矢量或矩阵。您可以使用apply来跨列应用函数,如下所示:
apply( df , 2 , diff )
ID Score
2 1 -4
3 1 -4
4 1 -4
5 1 -4
6 1 -4
7 1 -4
8 1 -4
9 1 -4
10 1 -4
It seems unlikely that you want to calculate the difference in sequential IDs, so you could choose to apply it on all columns except the first like so:
您似乎不太可能想要计算顺序ID的差异,因此您可以选择将其应用于除第一个之外的所有列,如下所示:
apply( df[-1] , 2 , diff )
Or you could use data.table
(not that it adds anything here I just really want to start using it!), and I am again assuming that you do not want to apply diff
to the ID column:
或者你可以使用data.table(不是它在这里添加任何东西我真的想开始使用它!),我再次假设你不想将diff应用于ID列:
DT <- data.table(df)
DT[ , list(ID,Score,Diff=diff(Score)) ]
ID Score Diff
1: 1 40 -4
2: 2 36 -4
3: 3 32 -4
4: 4 28 -4
5: 5 24 -4
6: 6 20 -4
7: 7 16 -4
8: 8 12 -4
9: 9 8 -4
10: 10 4 -4
And thanks to @AnandaMahto an alternative syntax that gives more flexibility to choose which columns to run it on could be:
感谢@AnandaMahto提供了一种替代语法,可以更灵活地选择运行它的列:
DT[, lapply(.SD, diff), .SDcols = 1:2]
Here .SDcols = 1:2
means you want to apply the diff
function to columns 1 and 2. If you have 20 columns and didn't want to apply it to ID you could use .SDcols=2:20
as an example.
这里.SDcols = 1:2表示你想将diff函数应用于第1列和第2列。如果你有20列并且不想将它应用于ID,你可以使用.SDcols = 2:20作为例子。
#4
5
Another option using dplyr
would be using mutate_each
to loop through all the columns, get the difference of the column (.
) with the lag
of the column (.
) and remove the NA element at the top with na.omit()
使用dplyr的另一个选项是使用mutate_each循环遍历所有列,获取列(。)的差异与列的延迟(。)并使用na.omit()删除顶部的NA元素
library(dplyr)
df %>%
mutate_each(funs(. - lag(.))) %>%
na.omit()
Or with shift
from data.table
. Convert the 'data.frame' to 'data.table' (setDT(df)
), loop through the columns (lapply(.SD, ..
)) and get the difference between the column (
x) and the
lag(
shiftby default gives the
lagas
type = "lag"`). Remove the first observation i.e. NA element.
或者从data.table转移。将'data.frame'转换为'data.table'(setDT(df)),遍历列(lapply(.SD,..))并获取列(x)和thelag之间的差异(shiftby default给出thelagastype =“lag”`)。删除第一个观察,即NA元素。
library(data.table)
setDT(df)[, lapply(.SD, function(x) (x- shift(x))[-1])]
#5
4
Adding this a few years later for completeness- you can use a simple [.data.frame
subseting in order to achieve this too
几年后为了完整性而添加它 - 你可以使用一个简单的[.data.frame subseting来实现这一点
df[-1, ] - df[-nrow(df), ]
# ID Score
# 2 1 -4
# 3 1 -4
# 4 1 -4
# 5 1 -4
# 6 1 -4
# 7 1 -4
# 8 1 -4
# 9 1 -4
# 10 1 -4
#6
3
I would like to show an alternative way for doing such kind of things even often I have the feeling it is not appreciated doing this in that way: using sql.
我想展示一种替代方式来做这种事情,即使我经常感觉不喜欢这样做:使用sql。
sqldf(paste("SELECT a.ID,a.Score"
," , a.Score - (SELECT b.Score"
," FROM df b"
," WHERE b.ID < a.ID"
," ORDER BY b.ID DESC"
," ) diff"
," FROM df a"
)
)
The code seems complicated but it is not and it has some advantage, as you can see at the results:
代码似乎很复杂,但它没有,它有一些优势,你可以看到结果:
ID Score diff
1 1 40 <NA>
2 2 36 -4.0
3 3 32 -4.0
4 4 28 -4.0
5 5 24 -4.0
6 6 20 -4.0
7 7 16 -4.0
8 8 12 -4.0
9 9 8 -4.0
10 10 4 -4.0
One advantage is that you use the original dataframe (without converting into other classes) and you get a data frame (put it in res <- ....). Another advantage is that you have still all rows. And the third advantage is that you can easily consider grouping factors. For example:
一个优点是你使用原始数据帧(不转换成其他类),你得到一个数据框(把它放在res < - ....)。另一个优点是你仍然拥有所有行。第三个优点是您可以轻松地考虑分组因素。例如:
df2 <- data.frame(ID=1:10,grp=rep(c("v","w"), each=5),Score=4*10:1)
sqldf(paste("SELECT a.ID,a.grp,a.Score"
," , a.Score - (SELECT b.Score"
," FROM df2 b"
," WHERE b.ID < a.ID"
," AND a.grp = b.grp"
," ORDER BY b.ID DESC"
," ) diff"
," FROM df2 a"
)
)
ID grp Score diff
1 1 v 40 <NA>
2 2 v 36 -4.0
3 3 v 32 -4.0
4 4 v 28 -4.0
5 5 v 24 -4.0
6 6 w 20 <NA>
7 7 w 16 -4.0
8 8 w 12 -4.0
9 9 w 8 -4.0
10 10 w 4 -4.0
#1
27
diff wants a matrix or a vector rather than a data frame. Try
diff需要矩阵或向量而不是数据帧。尝试
data.frame(diff(as.matrix(df)))
#2
20
Perhaps you are looking for something like this:
也许你正在寻找这样的东西:
> tail(df, -1) - head(df, -1)
ID Score
2 1 -4
3 1 -4
4 1 -4
5 1 -4
6 1 -4
7 1 -4
8 1 -4
9 1 -4
10 1 -4
You can subtract or add two data.frame
s together if they are the same dimensions. So, what we are doing here is subtracting one data.frame
that is missing the first row (tail(df, -1)
) and one that is missing the last row (head(df, -1)
) and subtracting them.
如果它们是相同的尺寸,您可以将两个data.frames相加或相加。所以,我们在这里做的是减去一个缺少第一行的数据帧(tail(df,-1))和一个缺少最后一行(head(df,-1))并减去它们的数据。
#3
8
Because df works on vector or matrix. You can use apply to apply the function across columns like so:
因为df适用于矢量或矩阵。您可以使用apply来跨列应用函数,如下所示:
apply( df , 2 , diff )
ID Score
2 1 -4
3 1 -4
4 1 -4
5 1 -4
6 1 -4
7 1 -4
8 1 -4
9 1 -4
10 1 -4
It seems unlikely that you want to calculate the difference in sequential IDs, so you could choose to apply it on all columns except the first like so:
您似乎不太可能想要计算顺序ID的差异,因此您可以选择将其应用于除第一个之外的所有列,如下所示:
apply( df[-1] , 2 , diff )
Or you could use data.table
(not that it adds anything here I just really want to start using it!), and I am again assuming that you do not want to apply diff
to the ID column:
或者你可以使用data.table(不是它在这里添加任何东西我真的想开始使用它!),我再次假设你不想将diff应用于ID列:
DT <- data.table(df)
DT[ , list(ID,Score,Diff=diff(Score)) ]
ID Score Diff
1: 1 40 -4
2: 2 36 -4
3: 3 32 -4
4: 4 28 -4
5: 5 24 -4
6: 6 20 -4
7: 7 16 -4
8: 8 12 -4
9: 9 8 -4
10: 10 4 -4
And thanks to @AnandaMahto an alternative syntax that gives more flexibility to choose which columns to run it on could be:
感谢@AnandaMahto提供了一种替代语法,可以更灵活地选择运行它的列:
DT[, lapply(.SD, diff), .SDcols = 1:2]
Here .SDcols = 1:2
means you want to apply the diff
function to columns 1 and 2. If you have 20 columns and didn't want to apply it to ID you could use .SDcols=2:20
as an example.
这里.SDcols = 1:2表示你想将diff函数应用于第1列和第2列。如果你有20列并且不想将它应用于ID,你可以使用.SDcols = 2:20作为例子。
#4
5
Another option using dplyr
would be using mutate_each
to loop through all the columns, get the difference of the column (.
) with the lag
of the column (.
) and remove the NA element at the top with na.omit()
使用dplyr的另一个选项是使用mutate_each循环遍历所有列,获取列(。)的差异与列的延迟(。)并使用na.omit()删除顶部的NA元素
library(dplyr)
df %>%
mutate_each(funs(. - lag(.))) %>%
na.omit()
Or with shift
from data.table
. Convert the 'data.frame' to 'data.table' (setDT(df)
), loop through the columns (lapply(.SD, ..
)) and get the difference between the column (
x) and the
lag(
shiftby default gives the
lagas
type = "lag"`). Remove the first observation i.e. NA element.
或者从data.table转移。将'data.frame'转换为'data.table'(setDT(df)),遍历列(lapply(.SD,..))并获取列(x)和thelag之间的差异(shiftby default给出thelagastype =“lag”`)。删除第一个观察,即NA元素。
library(data.table)
setDT(df)[, lapply(.SD, function(x) (x- shift(x))[-1])]
#5
4
Adding this a few years later for completeness- you can use a simple [.data.frame
subseting in order to achieve this too
几年后为了完整性而添加它 - 你可以使用一个简单的[.data.frame subseting来实现这一点
df[-1, ] - df[-nrow(df), ]
# ID Score
# 2 1 -4
# 3 1 -4
# 4 1 -4
# 5 1 -4
# 6 1 -4
# 7 1 -4
# 8 1 -4
# 9 1 -4
# 10 1 -4
#6
3
I would like to show an alternative way for doing such kind of things even often I have the feeling it is not appreciated doing this in that way: using sql.
我想展示一种替代方式来做这种事情,即使我经常感觉不喜欢这样做:使用sql。
sqldf(paste("SELECT a.ID,a.Score"
," , a.Score - (SELECT b.Score"
," FROM df b"
," WHERE b.ID < a.ID"
," ORDER BY b.ID DESC"
," ) diff"
," FROM df a"
)
)
The code seems complicated but it is not and it has some advantage, as you can see at the results:
代码似乎很复杂,但它没有,它有一些优势,你可以看到结果:
ID Score diff
1 1 40 <NA>
2 2 36 -4.0
3 3 32 -4.0
4 4 28 -4.0
5 5 24 -4.0
6 6 20 -4.0
7 7 16 -4.0
8 8 12 -4.0
9 9 8 -4.0
10 10 4 -4.0
One advantage is that you use the original dataframe (without converting into other classes) and you get a data frame (put it in res <- ....). Another advantage is that you have still all rows. And the third advantage is that you can easily consider grouping factors. For example:
一个优点是你使用原始数据帧(不转换成其他类),你得到一个数据框(把它放在res < - ....)。另一个优点是你仍然拥有所有行。第三个优点是您可以轻松地考虑分组因素。例如:
df2 <- data.frame(ID=1:10,grp=rep(c("v","w"), each=5),Score=4*10:1)
sqldf(paste("SELECT a.ID,a.grp,a.Score"
," , a.Score - (SELECT b.Score"
," FROM df2 b"
," WHERE b.ID < a.ID"
," AND a.grp = b.grp"
," ORDER BY b.ID DESC"
," ) diff"
," FROM df2 a"
)
)
ID grp Score diff
1 1 v 40 <NA>
2 2 v 36 -4.0
3 3 v 32 -4.0
4 4 v 28 -4.0
5 5 v 24 -4.0
6 6 w 20 <NA>
7 7 w 16 -4.0
8 8 w 12 -4.0
9 9 w 8 -4.0
10 10 w 4 -4.0