I am trying to simulate a three-variable dataset so that I can run linear regression models on it. 'X1' and 'X2' would be continuous independent variables (mean=0, sd=1), and 'Y' would be the continuous dependent variable.
我正在尝试模拟三变量数据集,以便我可以在其上运行线性回归模型。 'X1'和'X2'将是连续的自变量(mean = 0,sd = 1),'Y'将是连续因变量。
The variables will be regression model will produce coefficients like this: Y = 5 + 3(X1) - 2(X2)
变量将是回归模型将产生如下系数:Y = 5 + 3(X1) - 2(X2)
I would like to simulate this dataset such that the resulting regression model has an R-squared value of 0.2. How can I determine the value of 'sd.value' so that the regression model has this R-squared?
我想模拟这个数据集,使得得到的回归模型的R平方值为0.2。如何确定'sd.value'的值,以便回归模型具有此R平方?
n <- 200
set.seed(101)
sd.value <- 1
X1 <- rnorm(n, 0, 1)
X2 <- rnorm(n, 0, 1)
Y <- rnorm(n, (5 + 3*X1 - 2*X2), sd.value)
simdata <- data.frame(X1, X2, Y)
summary(lm(Y ~ X1 + X2, data=simdata))
4 个解决方案
#1
6
Take a look at this code, it should be close enough to what you want:
看看这段代码,它应该足够接近你想要的东西:
simulate <- function(n.obs=10^4, beta=c(5, 3, -2), R.sq=0.8) {
stopifnot(length(beta) == 3)
df <- data.frame(x1=rnorm(n.obs), x2=rnorm(n.obs)) # x1 and x2 are independent
var.epsilon <- (beta[2]^2 + beta[3]^2) * (1 - R.sq) / R.sq
stopifnot(var.epsilon > 0)
df$epsilon <- rnorm(n.obs, sd=sqrt(var.epsilon))
df$y <- with(df, beta[1] + beta[2]*x1 + beta[3]*x2 + epsilon)
return(df)
}
get.R.sq <- function(desired) {
model <- lm(y ~ x1 + x2, data=simulate(R.sq=desired))
return(summary(model)$r.squared)
}
df <- data.frame(desired.R.sq=seq(from=0.05, to=0.95, by=0.05))
df$actual.R.sq <- sapply(df$desired.R.sq, FUN=get.R.sq)
plot(df)
abline(a=0, b=1, col="red", lty=2)
Basically your question comes down to figuring out the expression for var.epsilon. Since we have y = b1 + b2*x1 + b3*x2 + epsilon, and Xs and epsilon are all independent, we have var[y] = b2^2 * var[x1] + b3^2 * var[x2] + var[eps], where the var[Xs]=1 by assumption. You can then solve for var[eps] as a function of R-squared.
基本上你的问题归结为找出var.epsilon的表达式。由于我们有y = b1 + b2 * x1 + b3 * x2 + epsilon,而Xs和epsilon都是独立的,我们有var [y] = b2 ^ 2 * var [x1] + b3 ^ 2 * var [x2] + var [eps],其中var [Xs] = 1假设。然后,您可以求解var [eps]作为R平方的函数。
#2
2
So the formula for R^2 is 1-var(residual)/var(total)
所以R ^ 2的公式是1-var(残差)/ var(总计)
In this case, the variance of Y
is going to be 3^2+2^2+sd.value^2
, since we are adding three independent random variables. And, asymptotically, the residual variance is going to be simply sd.value^2
.
在这种情况下,Y的方差将是3 ^ 2 + 2 ^ 2 + sd.value ^ 2,因为我们添加了三个独立的随机变量。并且,渐近地,残差方差将简单地为sd.value ^ 2。
So you can compute rsquared explicitly with this function:
因此,您可以使用此函数显式计算rsquared:
rsq<-function(x){1-x^2/(9+ 4+x^2)}
With a little algebra, you can compute the inverse of this function:
使用小代数,您可以计算此函数的反函数:
rsqi<-function(x){sqrt(13)*sqrt((1-x)/x)}
So setting sd.value<-rsqi(rsquared)
should give you what you want.
所以设置sd.value <-rsqi(rsquared)应该可以给你你想要的东西。
We can test this as follows:
我们可以测试如下:
simrsq<-function(x){
Y <- rnorm(n, (5 + 3*X1 - 2*X2), rsqi(x))
simdata <- data.frame(X1, X2, Y)
summary(lm(Y ~ X1 + X2, data=simdata))$r.squared
}
> meanrsq<-rep(0,9)
> for(i in 1:50)
+ meanrsq<-meanrsq+Vectorize(simrsq)((1:9)/10)
> meanrsq/50
[1] 0.1031827 0.2075984 0.3063701 0.3977051 0.5052408 0.6024988 0.6947790
[8] 0.7999349 0.8977187
So it looks to be correct.
所以它看起来是正确的。
#3
2
This is how I would do it (blind iterative algorithm, assuming no knowledge, for when you are purely interested in "how to simulate this"):
我就是这样做的(盲迭代算法,假设没有知识,因为当你纯粹对“如何模拟这个”感兴趣时):
simulate.sd <- function(nsim=10, n=200, seed=101, tol=0.01) {
set.seed(seed)
sd.value <- 1
rsquare <- 1:nsim
results <- 1:nsim
for (i in 1:nsim) {
# tracking iteration: if we miss the value, abort at sd.value > 7.
iter <- 0
while (rsquare[i] > (0.20 + tol) | rsquare[i] < (0.2 - tol)) {
sd.value <- sd.value + 0.01
rsquare[i] <- simulate.sd.iter(sd.value, n)
iter <- iter + 1
if (iter > 3000) { break }
}
results[i] <- sd.value # store the current sd.value that is OK!
sd.value <- 1
}
cbind(results, rsquare)
}
simulate.sd.iter <- function(sd.value, n=200) { # helper function
# Takes the sd.value, creates data, and returns the r-squared
X1 <- rnorm(n, 0, 1)
X2 <- rnorm(n, 0, 1)
Y <- rnorm(n, (5 + 3*X1 - 2*X2), sd.value)
simdata <- data.frame(X1, X2, Y)
return(summary(lm(Y ~ X1 + X2, data=simdata))$r.squared)
}
simulate.sd()
A few things to note:
有几点需要注意:
- I let the X1 and X2 vary, since this affects this sought
sd.value
. - The tolerance is how exact you want this estimate to be. Are you fine with an r-squared of ~0.19 or ~0.21? Have the tolerance be 0.01.
- Note that a too precise tolerance might not allow you to find a result.
- The value of 1 is quite a bad starting value, making this iterative algorithm quite slow.
我让X1和X2变化,因为这影响了这个寻求的sd.value。
容差是您想要这个估计的精确程度。 r平方为~0.19或~0.21,你还好吗?公差为0.01。
请注意,过于精确的公差可能无法让您找到结果。
值1是一个相当糟糕的起始值,使得这个迭代算法非常慢。
The resulting vector for 10 results is:
得到的10个结果的向量是:
[1] 5.64 5.35 5.46 5.42 5.79 5.39 5.64 5.62 4.70 5.55
,
[1] 5.64 5.35 5.46 5.42 5.79 5.39 5.64 5.62 4.70 5.55,
which takes roughly 13 seconds on my machine.
我的机器大约需要13秒钟。
My next step would be to start from 4.5, add 0.001 to the iteration instead of 0.01, and perhaps lower the tolerance. Good luck!
我的下一步是从4.5开始,在迭代中加0.001而不是0.01,并且可能会降低容差。祝你好运!
Alright, some summary statistics for nsim=100, taking 150 seconds, with steps increase of 0.001, and tolerance still at 0.01:
好吧,nsim = 100的一些汇总统计数据,耗时150秒,步长增加0.001,容差仍然为0.01:
Min. 1st Qu. Median Mean 3rd Qu. Max.
4.513 4.913 5.036 5.018 5.157 5.393
Why are you interested in this though?
你为什么对此感兴趣?
#4
-1
Here is another code to generate multiple linear regression with errors follow normal distribution: OPS sorry this code just produces multiple regression
这是生成多个线性回归的另一个代码,其中错误遵循正态分布:OPS抱歉此代码只产生多重回归
sim.regression<-function(n.obs=10,coefficients=runif(10,-5,5),s.deviation=.1){
n.var=length(coefficients)
M=matrix(0,ncol=n.var,nrow=n.obs)
beta=as.matrix(coefficients)
for (i in 1:n.var){
M[,i]=rnorm(n.obs,0,1)
}
y=M %*% beta + rnorm(n.obs,0,s.deviation)
return (list(x=M,y=y,coeff=coefficients))
}
#1
6
Take a look at this code, it should be close enough to what you want:
看看这段代码,它应该足够接近你想要的东西:
simulate <- function(n.obs=10^4, beta=c(5, 3, -2), R.sq=0.8) {
stopifnot(length(beta) == 3)
df <- data.frame(x1=rnorm(n.obs), x2=rnorm(n.obs)) # x1 and x2 are independent
var.epsilon <- (beta[2]^2 + beta[3]^2) * (1 - R.sq) / R.sq
stopifnot(var.epsilon > 0)
df$epsilon <- rnorm(n.obs, sd=sqrt(var.epsilon))
df$y <- with(df, beta[1] + beta[2]*x1 + beta[3]*x2 + epsilon)
return(df)
}
get.R.sq <- function(desired) {
model <- lm(y ~ x1 + x2, data=simulate(R.sq=desired))
return(summary(model)$r.squared)
}
df <- data.frame(desired.R.sq=seq(from=0.05, to=0.95, by=0.05))
df$actual.R.sq <- sapply(df$desired.R.sq, FUN=get.R.sq)
plot(df)
abline(a=0, b=1, col="red", lty=2)
Basically your question comes down to figuring out the expression for var.epsilon. Since we have y = b1 + b2*x1 + b3*x2 + epsilon, and Xs and epsilon are all independent, we have var[y] = b2^2 * var[x1] + b3^2 * var[x2] + var[eps], where the var[Xs]=1 by assumption. You can then solve for var[eps] as a function of R-squared.
基本上你的问题归结为找出var.epsilon的表达式。由于我们有y = b1 + b2 * x1 + b3 * x2 + epsilon,而Xs和epsilon都是独立的,我们有var [y] = b2 ^ 2 * var [x1] + b3 ^ 2 * var [x2] + var [eps],其中var [Xs] = 1假设。然后,您可以求解var [eps]作为R平方的函数。
#2
2
So the formula for R^2 is 1-var(residual)/var(total)
所以R ^ 2的公式是1-var(残差)/ var(总计)
In this case, the variance of Y
is going to be 3^2+2^2+sd.value^2
, since we are adding three independent random variables. And, asymptotically, the residual variance is going to be simply sd.value^2
.
在这种情况下,Y的方差将是3 ^ 2 + 2 ^ 2 + sd.value ^ 2,因为我们添加了三个独立的随机变量。并且,渐近地,残差方差将简单地为sd.value ^ 2。
So you can compute rsquared explicitly with this function:
因此,您可以使用此函数显式计算rsquared:
rsq<-function(x){1-x^2/(9+ 4+x^2)}
With a little algebra, you can compute the inverse of this function:
使用小代数,您可以计算此函数的反函数:
rsqi<-function(x){sqrt(13)*sqrt((1-x)/x)}
So setting sd.value<-rsqi(rsquared)
should give you what you want.
所以设置sd.value <-rsqi(rsquared)应该可以给你你想要的东西。
We can test this as follows:
我们可以测试如下:
simrsq<-function(x){
Y <- rnorm(n, (5 + 3*X1 - 2*X2), rsqi(x))
simdata <- data.frame(X1, X2, Y)
summary(lm(Y ~ X1 + X2, data=simdata))$r.squared
}
> meanrsq<-rep(0,9)
> for(i in 1:50)
+ meanrsq<-meanrsq+Vectorize(simrsq)((1:9)/10)
> meanrsq/50
[1] 0.1031827 0.2075984 0.3063701 0.3977051 0.5052408 0.6024988 0.6947790
[8] 0.7999349 0.8977187
So it looks to be correct.
所以它看起来是正确的。
#3
2
This is how I would do it (blind iterative algorithm, assuming no knowledge, for when you are purely interested in "how to simulate this"):
我就是这样做的(盲迭代算法,假设没有知识,因为当你纯粹对“如何模拟这个”感兴趣时):
simulate.sd <- function(nsim=10, n=200, seed=101, tol=0.01) {
set.seed(seed)
sd.value <- 1
rsquare <- 1:nsim
results <- 1:nsim
for (i in 1:nsim) {
# tracking iteration: if we miss the value, abort at sd.value > 7.
iter <- 0
while (rsquare[i] > (0.20 + tol) | rsquare[i] < (0.2 - tol)) {
sd.value <- sd.value + 0.01
rsquare[i] <- simulate.sd.iter(sd.value, n)
iter <- iter + 1
if (iter > 3000) { break }
}
results[i] <- sd.value # store the current sd.value that is OK!
sd.value <- 1
}
cbind(results, rsquare)
}
simulate.sd.iter <- function(sd.value, n=200) { # helper function
# Takes the sd.value, creates data, and returns the r-squared
X1 <- rnorm(n, 0, 1)
X2 <- rnorm(n, 0, 1)
Y <- rnorm(n, (5 + 3*X1 - 2*X2), sd.value)
simdata <- data.frame(X1, X2, Y)
return(summary(lm(Y ~ X1 + X2, data=simdata))$r.squared)
}
simulate.sd()
A few things to note:
有几点需要注意:
- I let the X1 and X2 vary, since this affects this sought
sd.value
. - The tolerance is how exact you want this estimate to be. Are you fine with an r-squared of ~0.19 or ~0.21? Have the tolerance be 0.01.
- Note that a too precise tolerance might not allow you to find a result.
- The value of 1 is quite a bad starting value, making this iterative algorithm quite slow.
我让X1和X2变化,因为这影响了这个寻求的sd.value。
容差是您想要这个估计的精确程度。 r平方为~0.19或~0.21,你还好吗?公差为0.01。
请注意,过于精确的公差可能无法让您找到结果。
值1是一个相当糟糕的起始值,使得这个迭代算法非常慢。
The resulting vector for 10 results is:
得到的10个结果的向量是:
[1] 5.64 5.35 5.46 5.42 5.79 5.39 5.64 5.62 4.70 5.55
,
[1] 5.64 5.35 5.46 5.42 5.79 5.39 5.64 5.62 4.70 5.55,
which takes roughly 13 seconds on my machine.
我的机器大约需要13秒钟。
My next step would be to start from 4.5, add 0.001 to the iteration instead of 0.01, and perhaps lower the tolerance. Good luck!
我的下一步是从4.5开始,在迭代中加0.001而不是0.01,并且可能会降低容差。祝你好运!
Alright, some summary statistics for nsim=100, taking 150 seconds, with steps increase of 0.001, and tolerance still at 0.01:
好吧,nsim = 100的一些汇总统计数据,耗时150秒,步长增加0.001,容差仍然为0.01:
Min. 1st Qu. Median Mean 3rd Qu. Max.
4.513 4.913 5.036 5.018 5.157 5.393
Why are you interested in this though?
你为什么对此感兴趣?
#4
-1
Here is another code to generate multiple linear regression with errors follow normal distribution: OPS sorry this code just produces multiple regression
这是生成多个线性回归的另一个代码,其中错误遵循正态分布:OPS抱歉此代码只产生多重回归
sim.regression<-function(n.obs=10,coefficients=runif(10,-5,5),s.deviation=.1){
n.var=length(coefficients)
M=matrix(0,ncol=n.var,nrow=n.obs)
beta=as.matrix(coefficients)
for (i in 1:n.var){
M[,i]=rnorm(n.obs,0,1)
}
y=M %*% beta + rnorm(n.obs,0,s.deviation)
return (list(x=M,y=y,coeff=coefficients))
}