引导分层/多级数据(重采样集群)

时间:2021-03-09 19:16:22

I am producing a script for creating bootstrap samples from the cats dataset (from the -MASS- package).

我正在生成一个脚本,用于从cats数据集(从- mass - package)创建引导示例。

Following the Davidson and Hinkley textbook [1] I ran a simple linear regression and adopted a fundamental non-parametric procedure for bootstrapping from iid observations, namely pairs resampling.

遵循戴维森和欣克利的教科书[1],我进行了一个简单的线性回归,并采用了一个基本的非参数过程来从iid观测中进行自举,即对重采样。

The original sample is in the form:

原始样品为:

Bwt   Hwt

2.0   7.0
2.1   7.2

...

1.9    6.8

Through an univariate linear model we want to explain cats hearth weight through their brain weight.

通过单变量线性模型,我们想通过他们的大脑重量来解释猫的体重。

The code is:

的代码是:

library(MASS)
library(boot)


##################
#   CATS MODEL   #
##################

cats.lm <- glm(Hwt ~ Bwt, data=cats)
cats.diag <- glm.diag.plots(cats.lm, ret=T)


#######################
#   CASE resampling   #
#######################

cats.fit <- function(data) coef(glm(data$Hwt ~ data$Bwt)) 
statistic.coef <- function(data, i) cats.fit(data[i,]) 

bootl <- boot(data=cats, statistic=statistic.coef, R=999)

Suppose now that there exists a clustering variable cluster = 1, 2,..., 24 (for instance, each cat belongs to a given litter). For simplicity, suppose that data are balanced: we have 6 observations for each cluster. Hence, each of the 24 litters is made up of 6 cats (i.e. n_cluster = 6 and n = 144).

假设有一个聚类变量cluster = 1,2,…例如,每只猫都属于一窝猫。为了简单起见,假设数据是平衡的:我们对每个集群有6个观察结果。因此,24只幼崽中的每一只都由6只猫组成(即n_cluster = 6, n = 144)。

It is possible to create a fake cluster variable through:

可以通过以下方式创建一个伪群集变量:

q <- rep(1:24, times=6)
cluster <- sample(q)
c.data <- cbind(cats, cluster)

I have two related questions:

我有两个相关的问题:

How to simulate samples in accordance with the (clustered) dataset strucure? That is, how to resample at the cluster level? I would like to sample the clusters with replacement and to set the observations within each selected cluster as in the original dataset (i.e. sampling with replacenment the clusters and without replacement the observations within each cluster).

如何根据(群集)数据集结构来模拟样本?也就是说,如何在集群级别重新采样?我想用替换的方法对集群进行采样,并在每个选定的集群中设置原始数据集中的观察值(例如,对集群进行替换采样,而不替换每个集群中的观察值)。

This is the strategy proposed by Davidson (p. 100). Suppose we draw B = 100 samples. Each of them should be composed by 24 possibly recurrent clusters (e.g. cluster = 3, 3, 1, 4, 12, 11, 12, 5, 6, 8, 17, 19, 10, 9, 7, 7, 16, 18, 24, 23, 11, 15, 20, 1), and each cluster should contain the same 6 observations of the original dataset. How to do that in R? (either with or without the -boot- package.) Do you have alternative suggestions for proceeding?

这是戴维森提出的战略(第100页)。假设B = 100个样本值。他们每个人应该由24个可能复发集群(如集群= 3、3、1,4,12日,11日,12日,5日,6日8日,17日,19日,10日,9日,7日,7日,16日,18日,24日,23日,11日,15日,20日1),和每个集群应该包含相同的6个观测原始数据集。怎么用R表示?(带或不带-boot- package。)你对下一步有什么建议吗?

The second question concerns the initial regression model. Suppose I adopt a fixed-effects model, with cluster-level intercepts. Does it change the resampling procedure adopted?

第二个问题涉及初始回归模型。假设我采用了一个固定效果模型,具有集群级的拦截。它是否改变了所采用的重新采样程序?

[1] Davidson, A. C., Hinkley, D. V. (1997). Bootstrap methods and their applications. Cambridge University press.

[1]戴维森,a . C。(1997)。引导方法及其应用。剑桥大学出版社。

2 个解决方案

#1


7  

If I understand you correctly, this what you are trying to do with c.data as input:

如果我没理解错的话,这就是c的作用。数据作为输入:

  • Resample clusters with replacement
  • 重新取样集群替换
  • Maintain the association between each cluster in the random sample and its points from the original data set (i.e. c.data)
  • 维护随机样本中每个集群与其原始数据集(即c.data)点之间的关联
  • Create a bootstrap with the sampled clusters
  • 创建带采样集群的引导程序。

Here is a script that achieve this which you can wrap into a function to repeat it R times, where R is the number of bootstrap replicates

这是一个实现这个目标的脚本,您可以将它封装到一个函数中,以重复它R次,其中R是引导复制的数量

q <- rep(1:24, times=6)
cluster <- sample(q)
c.data <- cbind(cats, cluster)

# get a vector with all clusters
c <- sort(unique(c.data$cluster))

# group the data points per cluster
clust.group <- function(c) {
    c.data[c.data$cluster==c,]
}

clust.list <- lapply(c,clust.group)

# resample clusters with replacement
c.sample <- sample(c, replace=T)

clust.sample <- clust.list[c.sample]

clust.size <- 6

# combine the cluster list back to a single data matrix
clust.bind <- function(c) {
    matrix(unlist(c),nrow=clust.size)
}

c.boot <- do.call(rbind,lapply(clust.sample,clust.bind))

# Just to maintain columns name
colnames(c.boot) <- names(c.data)

# the new data set (single bootstrap replicate)
c.boot

#2


0  

I tried to solve the problem in the following way. Although it works, it can be probably improved in terms of speed and "elegance". Also, if possible I would have preferred to find a way for using the -boot- package, as it allows to automatically compute a number of bootstrapped confidence intervals through boot.ci...

我试着用以下的方法来解决这个问题。虽然它很有效,但在速度和“优雅”方面可能会有所改进。另外,如果可能的话,我希望找到一种使用-boot- package的方法,因为它允许通过boot.ci自动计算一些引导的置信区间。

For simplicity, the starting dataset consists in 18 cats (the "lower-level" observations) nested in 6 laboratories (the clustering variable). The dataset is balanced (n_cluster = 3 for each cluster). We have one regressor, x, for explaining y.

为简单起见,启动数据集包含在6个实验室(集群变量)中嵌套的18只猫(“低级”观察)。数据集是平衡的(每个集群的n_cluster = 3)。我们有一个遗憾,x,用来解释y。

The fake dataset and the matrix where to store results are:

假数据集和存储结果的矩阵为:

  # fake sample 
  dat <- expand.grid(cat=factor(1:3), lab=factor(1:6))
  dat <- cbind(dat, x=runif(18), y=runif(18, 2, 5))

  # empty matrix for storing coefficients estimates and standard errors of x
  B <- 50 # number of bootstrap samples
  b.sample <- matrix(nrow=B, ncol=3, dimnames=list(c(), c("sim", "b_x", "se_x")))
  b.sample[,1] <- rep(1:B)

At each of the B iterations, the following loop samples 6 clusters with replacement, each composed by 3 cats sampled without replacement (i.e. the clusters' internal composition is maintained unaltered). The estimates of the regressor coefficient and of its standard error are stored in the previously created matrix:

在每一个B迭代中,下面的循环都会对6个集群进行替换,每个集群由3个没有替换的猫科动物组成(即集群的内部组成保持不变)。回归系数及其标准误差的估计值存储在先前创建的矩阵中:

  ####################################
  #   loop through "b.sample" rows   #
  ####################################

  for (i in seq(1:B)) {

  ###   sampling with replacement from the clustering variable   

    # sampling with replacement from "cluster" 
    cls <- sample(unique(dat$lab), replace=TRUE)
    cls.col <- data.frame(lab=cls)

    # reconstructing the overall simulated sample
    cls.resample <- merge(cls.col, dat, by="lab")


  ###   fitting linear model to simulated data    

    # model fit
    mod.fit <- function(data) glm(data$y ~ data$x)

    # estimated coefficients and standard errors
    b_x <- summary(mod.fit(data=cls.resample))$coefficients[2,1]
    se_x <- summary(mod.fit(data=cls.resample))$coefficients[2,2]

    b.sample[i,2] <- b_x
    b.sample[i,3] <- se_x

  }

The final bootstrapped standard error is:

最后的引导标准错误是:

 boot_se_x <- sum(b.sample[,3])/(B-1) 
 boot_se_x

Is there any way for adopting the -boot- package for doing this?

有没有办法采用-boot- package来实现这一点?

Also, with respect to the adoption of fixed-effects at cluster level instead of the simple linear regression, my main doubt is that in some simulated samples it can happen that we have not selected some of the initial clusters, so that the related cluster-specific intercepts cannot be identified. If you have a look at the code I posted, it should not be a problem from a "mechanical" point of view (at each iteration we can fit a different FE model with just the sampled clusters' intercepts).

此外,对于采用集群级别的固定效果而不是简单的线性回归,我的主要疑问是,在一些模拟样本中,我们可能没有选择一些初始的集群,因此无法识别相关的特定于集群的拦截。如果您看过我所发布的代码,那么从“机械”的角度来看,它不应该是一个问题(在每次迭代中,我们可以使用抽样的集群的拦截来拟合不同的FE模型)。

I was wondering whether there is a statistical issue in all of this

我想知道这里面是否有统计问题。

#1


7  

If I understand you correctly, this what you are trying to do with c.data as input:

如果我没理解错的话,这就是c的作用。数据作为输入:

  • Resample clusters with replacement
  • 重新取样集群替换
  • Maintain the association between each cluster in the random sample and its points from the original data set (i.e. c.data)
  • 维护随机样本中每个集群与其原始数据集(即c.data)点之间的关联
  • Create a bootstrap with the sampled clusters
  • 创建带采样集群的引导程序。

Here is a script that achieve this which you can wrap into a function to repeat it R times, where R is the number of bootstrap replicates

这是一个实现这个目标的脚本,您可以将它封装到一个函数中,以重复它R次,其中R是引导复制的数量

q <- rep(1:24, times=6)
cluster <- sample(q)
c.data <- cbind(cats, cluster)

# get a vector with all clusters
c <- sort(unique(c.data$cluster))

# group the data points per cluster
clust.group <- function(c) {
    c.data[c.data$cluster==c,]
}

clust.list <- lapply(c,clust.group)

# resample clusters with replacement
c.sample <- sample(c, replace=T)

clust.sample <- clust.list[c.sample]

clust.size <- 6

# combine the cluster list back to a single data matrix
clust.bind <- function(c) {
    matrix(unlist(c),nrow=clust.size)
}

c.boot <- do.call(rbind,lapply(clust.sample,clust.bind))

# Just to maintain columns name
colnames(c.boot) <- names(c.data)

# the new data set (single bootstrap replicate)
c.boot

#2


0  

I tried to solve the problem in the following way. Although it works, it can be probably improved in terms of speed and "elegance". Also, if possible I would have preferred to find a way for using the -boot- package, as it allows to automatically compute a number of bootstrapped confidence intervals through boot.ci...

我试着用以下的方法来解决这个问题。虽然它很有效,但在速度和“优雅”方面可能会有所改进。另外,如果可能的话,我希望找到一种使用-boot- package的方法,因为它允许通过boot.ci自动计算一些引导的置信区间。

For simplicity, the starting dataset consists in 18 cats (the "lower-level" observations) nested in 6 laboratories (the clustering variable). The dataset is balanced (n_cluster = 3 for each cluster). We have one regressor, x, for explaining y.

为简单起见,启动数据集包含在6个实验室(集群变量)中嵌套的18只猫(“低级”观察)。数据集是平衡的(每个集群的n_cluster = 3)。我们有一个遗憾,x,用来解释y。

The fake dataset and the matrix where to store results are:

假数据集和存储结果的矩阵为:

  # fake sample 
  dat <- expand.grid(cat=factor(1:3), lab=factor(1:6))
  dat <- cbind(dat, x=runif(18), y=runif(18, 2, 5))

  # empty matrix for storing coefficients estimates and standard errors of x
  B <- 50 # number of bootstrap samples
  b.sample <- matrix(nrow=B, ncol=3, dimnames=list(c(), c("sim", "b_x", "se_x")))
  b.sample[,1] <- rep(1:B)

At each of the B iterations, the following loop samples 6 clusters with replacement, each composed by 3 cats sampled without replacement (i.e. the clusters' internal composition is maintained unaltered). The estimates of the regressor coefficient and of its standard error are stored in the previously created matrix:

在每一个B迭代中,下面的循环都会对6个集群进行替换,每个集群由3个没有替换的猫科动物组成(即集群的内部组成保持不变)。回归系数及其标准误差的估计值存储在先前创建的矩阵中:

  ####################################
  #   loop through "b.sample" rows   #
  ####################################

  for (i in seq(1:B)) {

  ###   sampling with replacement from the clustering variable   

    # sampling with replacement from "cluster" 
    cls <- sample(unique(dat$lab), replace=TRUE)
    cls.col <- data.frame(lab=cls)

    # reconstructing the overall simulated sample
    cls.resample <- merge(cls.col, dat, by="lab")


  ###   fitting linear model to simulated data    

    # model fit
    mod.fit <- function(data) glm(data$y ~ data$x)

    # estimated coefficients and standard errors
    b_x <- summary(mod.fit(data=cls.resample))$coefficients[2,1]
    se_x <- summary(mod.fit(data=cls.resample))$coefficients[2,2]

    b.sample[i,2] <- b_x
    b.sample[i,3] <- se_x

  }

The final bootstrapped standard error is:

最后的引导标准错误是:

 boot_se_x <- sum(b.sample[,3])/(B-1) 
 boot_se_x

Is there any way for adopting the -boot- package for doing this?

有没有办法采用-boot- package来实现这一点?

Also, with respect to the adoption of fixed-effects at cluster level instead of the simple linear regression, my main doubt is that in some simulated samples it can happen that we have not selected some of the initial clusters, so that the related cluster-specific intercepts cannot be identified. If you have a look at the code I posted, it should not be a problem from a "mechanical" point of view (at each iteration we can fit a different FE model with just the sampled clusters' intercepts).

此外,对于采用集群级别的固定效果而不是简单的线性回归,我的主要疑问是,在一些模拟样本中,我们可能没有选择一些初始的集群,因此无法识别相关的特定于集群的拦截。如果您看过我所发布的代码,那么从“机械”的角度来看,它不应该是一个问题(在每次迭代中,我们可以使用抽样的集群的拦截来拟合不同的FE模型)。

I was wondering whether there is a statistical issue in all of this

我想知道这里面是否有统计问题。