I am new to R and was wondering what is the best way to do the following -
我是R的新手,想知道做以下事情的最佳方式是什么 -
My actual problem is a multivariate regression model but its a fairly large dataset(>5000 rows and 12 columns) and hence I've designed an analogous shorter problem. The solution to the below problem can be replicated to solve my actual problem. Any help(including speed issues) on the below will be greatly appreciated- I have the following two data frames-d1 and d2
我的实际问题是一个多元回归模型,但它是一个相当大的数据集(> 5000行和12列),因此我设计了一个类似的较短问题。可以复制以下问题的解决方案来解决我的实际问题。我们将非常感谢下面的任何帮助(包括速度问题) - 我有以下两个数据框-d1和d2
d1 -
sno letter age
1 a 29
2 b 30
3 a 33
4 b 22
5 c 25
d2-
letter marks
a 40
b 90
c 60
Now , I want to calculate whether a,b,c have passed or failed from d2 using marks_code and then include the corresponding grades in d1. So my final output should look like this-
现在,我想使用marks_code计算a,b,c是否通过d2传递或失败,然后在d1中包含相应的等级。所以我的最终输出应该像这样 -
d1 -
sno letter age grade
1 a 29 0
2 b 30 1
3 a 33 0
4 b 22 1
5 c 25 1
Below is the code I wrote-(I'm not getting the result I want!)
下面是我写的代码 - (我没有得到我想要的结果!)
d1 <- data.frame(cbind(1:5,c("a","b","a","b","c"),c(29,30,33,22,25)),stringsAsFactors=FALSE )
colnames(d1) <- c("sno","letter","age")
d2 <- data.frame(cbind(c("a","b","c"),c(40,90,60)),stringsAsFactors=FALSE)
colnames(d2) <- c("letter","marks")
d2$grade <- rep(NA,3) #initialising the vector
d2$grade <- sapply(d2$marks,marks_code)
d1$grade <- rep(NA,5)
d1_coding(d1$letter)
d1_coding <- function(y1)
{
letter_names <- unique(y1)
m <- length(letter_names)
for(i in 1:m)
{
sub <- subset(d1,d1$letter==letter_name[i])
num_obs <- length(sub$sno)
sub$grade <- rep(d2$grade[i],num_obs)
merge(d1,sub,by="sno")
}
return(d1)
}
marks_code <- function(y)
{
a <-NA
if(y<=40)
a <- 0#fail
else
a<- 1#pass
return(a)
}
Thanks a lot in advance! :)
非常感谢提前! :)
3 个解决方案
#1
1
Using data.table
:
require(data.table)
d1 <- as.data.table(d1)
d2 <- as.data.table(d2)
setkey(d1, "letter")
setkey(d2, "letter")
out <- d2[d1][, grade := (marks > 40) * 1]
setcolorder(out, c("letter", "sno", "age", "marks", "grade"))
# letter sno age marks grade
# 1: a 1 29 40 0
# 2: a 3 33 40 0
# 3: b 2 30 90 1
# 4: b 4 22 90 1
# 5: c 5 25 60 1
If you want the same order, you can set key back to "sno" as:
如果您想要相同的订单,可以将密钥设置为“sno”,如下所示:
setkey(out, "sno")
#2
0
You should use ifelse
for this because unlike if
it is vectorized.
你应该使用ifelse,因为它与矢量化不同。
d1 <- read.table(text=" sno letter age
1 a 29
2 b 30
3 a 33
4 b 22
5 c 25",header=TRUE)
d2 <- read.table(text=" letter marks
a 40
b 90
c 60",header=TRUE)
res <- merge(d1,d2)
res$grade <- ifelse(res$marks <= 40, 0, 1)
res <- res[order(res$sno),]
# letter sno age marks grade
# 1 a 1 29 40 0
# 3 b 2 30 90 1
# 2 a 3 33 40 0
# 4 b 4 22 90 1
# 5 c 5 25 60 1
#3
0
Here's a different approach:
这是一种不同的方法:
d1$grade <-
as.numeric(sapply(d1$letter, FUN=function(z) d2[d2$letter==z,"marks"]>40))
And another, without sapply
:
而另一个,没有sapply:
d1$grade <-
as.numeric(d2$marks[pmatch(d1$letter, d2$letter, duplicates.ok=TRUE)] > 40)
#1
1
Using data.table
:
require(data.table)
d1 <- as.data.table(d1)
d2 <- as.data.table(d2)
setkey(d1, "letter")
setkey(d2, "letter")
out <- d2[d1][, grade := (marks > 40) * 1]
setcolorder(out, c("letter", "sno", "age", "marks", "grade"))
# letter sno age marks grade
# 1: a 1 29 40 0
# 2: a 3 33 40 0
# 3: b 2 30 90 1
# 4: b 4 22 90 1
# 5: c 5 25 60 1
If you want the same order, you can set key back to "sno" as:
如果您想要相同的订单,可以将密钥设置为“sno”,如下所示:
setkey(out, "sno")
#2
0
You should use ifelse
for this because unlike if
it is vectorized.
你应该使用ifelse,因为它与矢量化不同。
d1 <- read.table(text=" sno letter age
1 a 29
2 b 30
3 a 33
4 b 22
5 c 25",header=TRUE)
d2 <- read.table(text=" letter marks
a 40
b 90
c 60",header=TRUE)
res <- merge(d1,d2)
res$grade <- ifelse(res$marks <= 40, 0, 1)
res <- res[order(res$sno),]
# letter sno age marks grade
# 1 a 1 29 40 0
# 3 b 2 30 90 1
# 2 a 3 33 40 0
# 4 b 4 22 90 1
# 5 c 5 25 60 1
#3
0
Here's a different approach:
这是一种不同的方法:
d1$grade <-
as.numeric(sapply(d1$letter, FUN=function(z) d2[d2$letter==z,"marks"]>40))
And another, without sapply
:
而另一个,没有sapply:
d1$grade <-
as.numeric(d2$marks[pmatch(d1$letter, d2$letter, duplicates.ok=TRUE)] > 40)