如何使用“清除”功能

时间:2022-08-02 07:39:50

When I look at the source of R Packages, i see the function sweep used quite often.Sometimes it's used when a simpler function would have sufficed (e.g., apply),other times, it's impossible to know exactly what it's is doing withoutspending a fair amount of time to step through the code block it's in.

当我查看R包的源代码时,我看到函数扫描经常使用。有时,当一个简单的函数已经足够时(例如,apply),有时,如果不花大量的时间来遍历它所在的代码块,就不可能确切地知道它在做什么。

The fact that I can reproduce sweep's effect using a simpler function suggests thati don't understand sweep's core use cases, and the fact that this function is used so often suggests that it's quite useful.

事实上,我可以使用一个更简单的函数来复制扫描的效果,这表明thati不理解扫描的核心用例,而且这个函数经常被使用,这说明它非常有用。

The context:

背景:

sweep is a function in R's standard library; its arguments are:

扫描是R标准库中的一个函数;它的参数是:

sweep(x, MARGIN, STATS, FUN="-", check.margin=T, ...)# x is the data# STATS refers to the summary statistics which you wish to 'sweep out'# FUN is the function used to carry out the sweep, "-" is the default

As you can see, the arguments are similar to apply though sweep requiresone more parameter, STATS.

如您所见,参数类似于通过扫描requiresone更多参数STATS应用。

Another key difference is that sweep returns an array of the same shape as the input array, whereas the result returned by apply depends on the function passed in.

另一个关键的区别是,scan返回与输入数组相同形状的数组,而apply返回的结果取决于传入的函数。

sweep in action:

扫描在行动:

# e.g., use 'sweep' to express a given matrix in terms of distance from # the respective column mean# create some data:M = matrix( 1:12, ncol=3)# calculate column-wise mean for Mdx = colMeans(M)# now 'sweep' that summary statistic from Msweep(M, 2, dx, FUN="-")     [,1] [,2] [,3][1,] -1.5 -1.5 -1.5[2,] -0.5 -0.5 -0.5[3,]  0.5  0.5  0.5[4,]  1.5  1.5  1.5

So in sum, what i'm looking for is an exemplary use case or two for sweep.

综上所述,我所寻找的是一两个典型的用于扫描的用例。

Please, do not recite or link to the R Documentation, mailing lists, or any of the 'primary' R sources--assume I've read them. What I'm interested in is how experienced R programmers/analysts use sweep in their own code.

请不要背诵或链接到R文档、邮件列表或任何“主要”R源——假设我已经读过它们了。我感兴趣的是有经验的R程序员/分析师如何在他们自己的代码中使用扫描。

5 个解决方案

#1


58  

sweep is typically used when you operate a matrix by row or by column, and the other input of the operation is a different value for each row / column. Whether you operate by row or column is defined by MARGIN, as for apply. The values used for what I called "the other input" is defined by STATS. So, for each row (or column), you will take a value from STATS and use in the operation defined by FUN.

扫描通常用于按行或按列操作矩阵,而操作的另一个输入是每个行/列的不同值。无论您是按行还是按列操作,都由页边距定义,如apply。我所称的“其他输入”所使用的值是由STATS定义的。因此,对于每一行(或列),您将从STATS中获取一个值,并在由FUN定义的操作中使用。

For instance, if you want to add 1 to the 1st row, 2 to the 2nd, etc... of the matrix you defined, you will do:

例如,如果你想把1加到第一行,2加到第二行,等等……在你定义的矩阵中,你会:

sweep (M, 1, c (1: 4), "+")

I frankly did not understand the definition in the R documentation either, I just learned by looking up examples.

坦白地说,我也不理解R文档中的定义,我只是通过查找示例了解的。

#2


15  

sweep() can be great for systematically manipulating a large matrix either column by column, or row by row, as shown below:

扫描()可以很好地系统地操作一个大矩阵,可以是逐列操作,也可以是逐行操作,如下所示:

> print(size)     Weight Waist Height[1,]    130    26    140[2,]    110    24    155[3,]    118    25    142[4,]    112    25    175[5,]    128    26    170> sweep(size, 2, c(10, 20, 30), "+")     Weight Waist Height[1,]    140    46    170[2,]    120    44    185[3,]    128    45    172[4,]    122    45    205[5,]    138    46    200

Granted, this example is simple, but changing the STATS and FUN argument, other manipulations are possible.

当然,这个示例很简单,但是更改统计数据和有趣的参数,可以进行其他操作。

#3


7  

This question is a bit old, but since I've recently faced this problem a typical use of sweep can be found in the source code for the stats function cov.wt, used for computing weighted covariance matrices. I'm looking at the code in R 3.0.1. Here sweep is used to subtract out column means before computing the covariance. On line 19 of the code the centering vector is derived:

这个问题有点老了,但是由于我最近遇到了这个问题,在stats函数cov的源代码中可以找到一个典型的使用扫地的方法。wt,用于计算加权协方差矩阵。我看R。1中的代码。这里,在计算协方差之前,使用扫描来减去列表示。在代码的第19行,导出了定心矢量:

 center <- if (center)         colSums(wt * x)    else 0

and on line 54 it is swept out of the matrix

在第54行,它被从矩阵中扫出来

x <- sqrt(wt) * sweep(x, 2, center, check.margin = FALSE)

The author of the code is using the default value FUN = "-", which confused me for a while.

代码的作者使用默认值FUN = "-",这让我困惑了一段时间。

#4


2  

One use is when you're computing weighted sums for an array. Where rowSums or colSums can be assumed to mean 'weights=1', sweep can be used prior to this to give a weighted result. This is particularly useful for arrays with >=3 dimensions.

一个用途是计算数组的加权和。当行数或colsum被假设为“权值=1”时,可以在此之前使用扫描来给出加权结果。这对于>=3维的数组特别有用。

This comes up e.g. when calculating a weighted covariance matrix as per @James King's example.

例如,当按照@James King的例子计算加权协方差矩阵时,就会出现这种情况。

Here's another based on a current project:

下面是另一个基于当前项目的:

set.seed(1)## 2x2x2 arraya1 <- array(as.integer(rnorm(8, 10, 5)), dim=c(2, 2, 2))## 'element-wise' sum of matrices## weights = 1rowSums(a1, dims=2)## weightsw1 <- c(3, 4)## a1[, , 1] * 3;  a1[, , 2] * 4a1 <- sweep(a1, MARGIN=3, STATS=w1, FUN="*")rowSums(a1, dims=2)

#5


1  

You could use sweep function to scale and center data like the following code. Note that means and sds are arbitrary here (you may have some reference values that you want to standardize data based on them):

您可以使用扫描函数来缩放和居中数据,如下面的代码所示。注意这里的方法和sds是任意的(您可能有一些参考值,您希望基于它们对数据进行标准化):

df=matrix(sample.int(150, size = 100, replace = FALSE),5,5)df_means=t(apply(df,2,mean))df_sds=t(apply(df,2,sd))df_T=sweep(sweep(df,2,df_means,"-"),2,df_sds,"/")*10+50

This code convert raw scores to T scores (with mean=50 and sd=10):

该代码将原始分数转换为T分数(均值=50,sd=10):

> df     [,1] [,2] [,3] [,4] [,5][1,]  109    8   89   69   15[2,]   85   13   25  150   26[3,]   30   79   48    1  125[4,]   56   74   23  140  100[5,]  136  110  112   12   43> df_T         [,1]     [,2]     [,3]     [,4]     [,5][1,] 56.15561 39.03218 57.46965 49.22319 40.28305[2,] 50.42946 40.15594 41.31905 60.87539 42.56695[3,] 37.30704 54.98946 47.12317 39.44109 63.12203[4,] 43.51037 53.86571 40.81435 59.43685 57.93136[5,] 62.59752 61.95672 63.27377 41.02349 46.09661

#1


58  

sweep is typically used when you operate a matrix by row or by column, and the other input of the operation is a different value for each row / column. Whether you operate by row or column is defined by MARGIN, as for apply. The values used for what I called "the other input" is defined by STATS. So, for each row (or column), you will take a value from STATS and use in the operation defined by FUN.

扫描通常用于按行或按列操作矩阵,而操作的另一个输入是每个行/列的不同值。无论您是按行还是按列操作,都由页边距定义,如apply。我所称的“其他输入”所使用的值是由STATS定义的。因此,对于每一行(或列),您将从STATS中获取一个值,并在由FUN定义的操作中使用。

For instance, if you want to add 1 to the 1st row, 2 to the 2nd, etc... of the matrix you defined, you will do:

例如,如果你想把1加到第一行,2加到第二行,等等……在你定义的矩阵中,你会:

sweep (M, 1, c (1: 4), "+")

I frankly did not understand the definition in the R documentation either, I just learned by looking up examples.

坦白地说,我也不理解R文档中的定义,我只是通过查找示例了解的。

#2


15  

sweep() can be great for systematically manipulating a large matrix either column by column, or row by row, as shown below:

扫描()可以很好地系统地操作一个大矩阵,可以是逐列操作,也可以是逐行操作,如下所示:

> print(size)     Weight Waist Height[1,]    130    26    140[2,]    110    24    155[3,]    118    25    142[4,]    112    25    175[5,]    128    26    170> sweep(size, 2, c(10, 20, 30), "+")     Weight Waist Height[1,]    140    46    170[2,]    120    44    185[3,]    128    45    172[4,]    122    45    205[5,]    138    46    200

Granted, this example is simple, but changing the STATS and FUN argument, other manipulations are possible.

当然,这个示例很简单,但是更改统计数据和有趣的参数,可以进行其他操作。

#3


7  

This question is a bit old, but since I've recently faced this problem a typical use of sweep can be found in the source code for the stats function cov.wt, used for computing weighted covariance matrices. I'm looking at the code in R 3.0.1. Here sweep is used to subtract out column means before computing the covariance. On line 19 of the code the centering vector is derived:

这个问题有点老了,但是由于我最近遇到了这个问题,在stats函数cov的源代码中可以找到一个典型的使用扫地的方法。wt,用于计算加权协方差矩阵。我看R。1中的代码。这里,在计算协方差之前,使用扫描来减去列表示。在代码的第19行,导出了定心矢量:

 center <- if (center)         colSums(wt * x)    else 0

and on line 54 it is swept out of the matrix

在第54行,它被从矩阵中扫出来

x <- sqrt(wt) * sweep(x, 2, center, check.margin = FALSE)

The author of the code is using the default value FUN = "-", which confused me for a while.

代码的作者使用默认值FUN = "-",这让我困惑了一段时间。

#4


2  

One use is when you're computing weighted sums for an array. Where rowSums or colSums can be assumed to mean 'weights=1', sweep can be used prior to this to give a weighted result. This is particularly useful for arrays with >=3 dimensions.

一个用途是计算数组的加权和。当行数或colsum被假设为“权值=1”时,可以在此之前使用扫描来给出加权结果。这对于>=3维的数组特别有用。

This comes up e.g. when calculating a weighted covariance matrix as per @James King's example.

例如,当按照@James King的例子计算加权协方差矩阵时,就会出现这种情况。

Here's another based on a current project:

下面是另一个基于当前项目的:

set.seed(1)## 2x2x2 arraya1 <- array(as.integer(rnorm(8, 10, 5)), dim=c(2, 2, 2))## 'element-wise' sum of matrices## weights = 1rowSums(a1, dims=2)## weightsw1 <- c(3, 4)## a1[, , 1] * 3;  a1[, , 2] * 4a1 <- sweep(a1, MARGIN=3, STATS=w1, FUN="*")rowSums(a1, dims=2)

#5


1  

You could use sweep function to scale and center data like the following code. Note that means and sds are arbitrary here (you may have some reference values that you want to standardize data based on them):

您可以使用扫描函数来缩放和居中数据,如下面的代码所示。注意这里的方法和sds是任意的(您可能有一些参考值,您希望基于它们对数据进行标准化):

df=matrix(sample.int(150, size = 100, replace = FALSE),5,5)df_means=t(apply(df,2,mean))df_sds=t(apply(df,2,sd))df_T=sweep(sweep(df,2,df_means,"-"),2,df_sds,"/")*10+50

This code convert raw scores to T scores (with mean=50 and sd=10):

该代码将原始分数转换为T分数(均值=50,sd=10):

> df     [,1] [,2] [,3] [,4] [,5][1,]  109    8   89   69   15[2,]   85   13   25  150   26[3,]   30   79   48    1  125[4,]   56   74   23  140  100[5,]  136  110  112   12   43> df_T         [,1]     [,2]     [,3]     [,4]     [,5][1,] 56.15561 39.03218 57.46965 49.22319 40.28305[2,] 50.42946 40.15594 41.31905 60.87539 42.56695[3,] 37.30704 54.98946 47.12317 39.44109 63.12203[4,] 43.51037 53.86571 40.81435 59.43685 57.93136[5,] 62.59752 61.95672 63.27377 41.02349 46.09661