用R表示元素频率向量的最简单方法

时间:2022-05-01 21:29:07

Suppose I have a vector of values v. What is the easiest way to get a vector f of length equal to v, where the ith element of f is the frequency of the ith element of v in v?

假设我有一个值向量v,得到一个长度为v的向量f最简单的方法是什么,其中f的第I个元素是v中第I个元素的频率?

The only way I know to do it seems unnecessarily complicated:

我所知道的唯一方法似乎不必要地复杂:

v = sample(1:10,100,replace=TRUE)
D = data.frame( idx=1:length(v), v=v )
E = merge( D, data.frame(table(v)) )
E = E[ with(E,order(idx)), ]
f = E$Freq

Surely there's a simpler way to do this, along the lines of "frequencies(v)"?

肯定有一种更简单的方法,沿着“频率(v)”的线来做吗?

3 个解决方案

#1


2  

For a vector of small positive integers v, as in the question, the expression

对于一个小正整数v的向量,如问题中的表达式

tabulate(v)[v]

is particularly simple as well as speedy.

既简单又快速。

For more general numerical vectors v you can persuade ecdf to help you out, as in

对于更一般的数值向量v,你可以说服ecdf来帮助你,比如

w <- sapply(v, ecdf(v)) * length(v)
tabulate(w)[w]

It's probably better to do the coding of the underlying algorithm yourself, though--and it certainly avoids the floating point rounding error implicit in the preceding solution:

不过,最好自己编写底层算法的代码——当然也避免了前面解决方案中隐含的浮点舍入误差:

frequencies <- function(x) {
  i <- order(x)
  v <- x[i]
  w <- cumsum(c(TRUE, v[-1] != v[-length(x)]))
  f <- tabulate(w)[w]
  return(f[order(i)])
}

This algorithm sorts the data, assigns sequential identifiers 1, 2, 3, ... to the values as it encounters them (by summing a binary indicator of when the values change), uses the preceding tabulate()[] trick to obtain the frequencies efficiently, and then unsorts the results to make the output match the input, component by component.

该算法对数据进行排序,分配顺序标识符1、2、3、……对于它遇到的值(通过对值何时变化的二进制指示符求和),使用前面的tabulate()[]技巧来有效地获取频率,然后对结果进行排序,使输出与输入、各个分量相匹配。

#2


1  

Something like this works for me:

像这样的东西对我很管用:

sapply(v, function(elmt, vec) sum(vec == elmt), vec=v)

#3


1  

I think the best solution here is:

我认为最好的解决办法是:

ave(v,v,FUN=length)

It is simply ave()'s design to replicate and map the return value of FUN() back to every index of the input vector whose element was part of the group for which that particular invocation of FUN() was performed.

ave()的设计就是将FUN()的返回值复制并映射到输入向量的每个索引,这些索引的元素是执行FUN()调用的组的一部分。

#1


2  

For a vector of small positive integers v, as in the question, the expression

对于一个小正整数v的向量,如问题中的表达式

tabulate(v)[v]

is particularly simple as well as speedy.

既简单又快速。

For more general numerical vectors v you can persuade ecdf to help you out, as in

对于更一般的数值向量v,你可以说服ecdf来帮助你,比如

w <- sapply(v, ecdf(v)) * length(v)
tabulate(w)[w]

It's probably better to do the coding of the underlying algorithm yourself, though--and it certainly avoids the floating point rounding error implicit in the preceding solution:

不过,最好自己编写底层算法的代码——当然也避免了前面解决方案中隐含的浮点舍入误差:

frequencies <- function(x) {
  i <- order(x)
  v <- x[i]
  w <- cumsum(c(TRUE, v[-1] != v[-length(x)]))
  f <- tabulate(w)[w]
  return(f[order(i)])
}

This algorithm sorts the data, assigns sequential identifiers 1, 2, 3, ... to the values as it encounters them (by summing a binary indicator of when the values change), uses the preceding tabulate()[] trick to obtain the frequencies efficiently, and then unsorts the results to make the output match the input, component by component.

该算法对数据进行排序,分配顺序标识符1、2、3、……对于它遇到的值(通过对值何时变化的二进制指示符求和),使用前面的tabulate()[]技巧来有效地获取频率,然后对结果进行排序,使输出与输入、各个分量相匹配。

#2


1  

Something like this works for me:

像这样的东西对我很管用:

sapply(v, function(elmt, vec) sum(vec == elmt), vec=v)

#3


1  

I think the best solution here is:

我认为最好的解决办法是:

ave(v,v,FUN=length)

It is simply ave()'s design to replicate and map the return value of FUN() back to every index of the input vector whose element was part of the group for which that particular invocation of FUN() was performed.

ave()的设计就是将FUN()的返回值复制并映射到输入向量的每个索引,这些索引的元素是执行FUN()调用的组的一部分。