如何计算R中间隔内外数据的比例?

时间:2022-08-10 02:55:48

I have the following data

我有以下数据

Frequency = 260 



[1] -9.326550e-03
   [2] -4.422175e-03
   [3]  9.003794e-03
   [4] -1.778217e-03
   [5] -4.676712e-03
   [6]  1.242704e-02
   [7]  5.759863e-03

And I want to count how many of these are in between these:

我想知道其中有多少介于这些之间:

Frequency = 260 



           [,1]         [,2]
[1]         NA           NA
[2] 0.010363147 -0.010363147
[3] 0.010072569 -0.010072569
[4] 0.010018997 -0.010018997
[1] 0.009700522 -0.009700522
[5] 0.009476024 -0.009476024
[7] 0.009748085 -0.009748085

I have to do this in r, but I'm a beginner. Thanks in advance!

我必须在r中这样做,但我是初学者。提前致谢!

3 个解决方案

#1


Unless I misunderstand -- you want the number of times the j-th element of your first object is between the two elements of the j-th row of the second? If so,

除非我误解了 - 你想要第一个对象的第j个元素在第二个第j行的两个元素之间的次数?如果是这样,

sum( (data1  > data2[,1]) & (data1 < data2[,2]))/length(data1)

Will do it.

会做的。

#2


Here's one approach using foverlaps from the package data.table, with the following toy data sets:

这是使用包data.table中的foverlaps的一种方法,包含以下玩具数据集:

library(data.table)
##
set.seed(123)
ts1 <- data.table(
  ts(rnorm(50, sd = .1), frequency = 260))[
    ,V2 := V1]
##
ts2 <- cbind(
  ts(rnorm(50,-0.1,.5), frequency=260)
  ,ts(rnorm(50,0.1,.5), frequency=260))
ts2 <- data.table(
  t(apply(ts2, 1, sort)))[
    1, c("V1", "V2") := NA]
setkeyv(ts2, c("V1","V2"))

Since foverlaps needs two columns from each of the input data.tables, we just duplicate the first column in ts1 (this is the convention, as far as I'm aware).

由于foverlaps需要来自每个输入data.tables的两列,我们只是复制ts1中的第一列(据我所知,这是约定)。

fts <- foverlaps(
  x = ts1, y = na.omit(ts2)
  ,type = "within")[
    ,list(Freq = .N)
    ,by = "V1,V2"]

This joins ts1 on ts2 for every occurrence of a ts1 value that falls within each of ts2's [V1, V2] intervals - and then aggregates to get a count by interval. Since it is feasible that some of ts2's intervals will contain zero ts1 values (which is the case with this sample data), you can left join the aggregate data back on the original ts2 object, and derive the corresponding proportions:

这会在每次出现ts1值(每个ts2的[V1,V2]间隔)时在ts2上加入ts1,然后聚合以获得按间隔计数。由于某些ts2的间隔可能包含零ts1值(此样本数据就是这种情况),因此可以将聚合数据连接回原始ts2对象,并导出相应的比例:

(merge(x = ts2, y = fdt, all.x=TRUE)[
  is.na(Freq), Freq := 0][
    ,Inside := Freq/nrow(ts1)][
      ,Outside := 1 - Inside])[1:10,]
##
#            V1          V2 Freq Inside Outside
# 1:         NA          NA    0   0.00    1.00
# 2: -1.2545844 -0.37373731    0   0.00    1.00
# 3: -0.9266236 -0.21024328    1   0.02    0.98
# 4: -0.8743764 -0.29245223    0   0.00    1.00
# 5: -0.7339710  0.19230687   50   1.00    0.00
# 6: -0.7103589  0.13898042   50   1.00    0.00
# 7: -0.7089414 -0.26660369    0   0.00    1.00
# 8: -0.7007681  0.58032622   50   1.00    0.00
# 9: -0.6860721  0.01936587   35   0.70    0.30
# 10: -0.6573338 -0.41395304    0   0.00    1.00

#3


I think @nrussell's answer is just fine, but you can accomplish your answer much more simply using base R, so I'll document it here for you since you said you're a beginner. I've commented it as well to hopefully help you learn what's going on:

我认为@nrussell的答案很好,但你可以更简单地使用base R来完成你的答案,所以我会在这里为你记录,因为你说你是初学者。我也评论过它,希望能帮助你了解正在发生的事情:

##  Set a seed so simulated data can be duplicated:
set.seed(2001)

##  Simulate your data to be counted:
d <- rnorm(50)

##  Simulate your ranges:
r <- rnorm(10)
r <- cbind(r - 0.1, r + 0.1)

##  Sum up the values of d falling inside each row of ranges.  The apply
##    function takes each row of r, and compares the values of d to the
##    bounds of your ranges (lower in the first column, upper in the second)
##    and the resulting logical vector is then summed, where TRUEs are equal
##    to 1, thus counting the number of values in d falling between each
##    set of bounds:
sums <- apply(r, MARGIN=1, FUN=function(x) { sum( d > x[1] & d < x[2] ) })

##  Each item of the sums vector refers to the corresponding
##      row of ranges in the r object...

#1


Unless I misunderstand -- you want the number of times the j-th element of your first object is between the two elements of the j-th row of the second? If so,

除非我误解了 - 你想要第一个对象的第j个元素在第二个第j行的两个元素之间的次数?如果是这样,

sum( (data1  > data2[,1]) & (data1 < data2[,2]))/length(data1)

Will do it.

会做的。

#2


Here's one approach using foverlaps from the package data.table, with the following toy data sets:

这是使用包data.table中的foverlaps的一种方法,包含以下玩具数据集:

library(data.table)
##
set.seed(123)
ts1 <- data.table(
  ts(rnorm(50, sd = .1), frequency = 260))[
    ,V2 := V1]
##
ts2 <- cbind(
  ts(rnorm(50,-0.1,.5), frequency=260)
  ,ts(rnorm(50,0.1,.5), frequency=260))
ts2 <- data.table(
  t(apply(ts2, 1, sort)))[
    1, c("V1", "V2") := NA]
setkeyv(ts2, c("V1","V2"))

Since foverlaps needs two columns from each of the input data.tables, we just duplicate the first column in ts1 (this is the convention, as far as I'm aware).

由于foverlaps需要来自每个输入data.tables的两列,我们只是复制ts1中的第一列(据我所知,这是约定)。

fts <- foverlaps(
  x = ts1, y = na.omit(ts2)
  ,type = "within")[
    ,list(Freq = .N)
    ,by = "V1,V2"]

This joins ts1 on ts2 for every occurrence of a ts1 value that falls within each of ts2's [V1, V2] intervals - and then aggregates to get a count by interval. Since it is feasible that some of ts2's intervals will contain zero ts1 values (which is the case with this sample data), you can left join the aggregate data back on the original ts2 object, and derive the corresponding proportions:

这会在每次出现ts1值(每个ts2的[V1,V2]间隔)时在ts2上加入ts1,然后聚合以获得按间隔计数。由于某些ts2的间隔可能包含零ts1值(此样本数据就是这种情况),因此可以将聚合数据连接回原始ts2对象,并导出相应的比例:

(merge(x = ts2, y = fdt, all.x=TRUE)[
  is.na(Freq), Freq := 0][
    ,Inside := Freq/nrow(ts1)][
      ,Outside := 1 - Inside])[1:10,]
##
#            V1          V2 Freq Inside Outside
# 1:         NA          NA    0   0.00    1.00
# 2: -1.2545844 -0.37373731    0   0.00    1.00
# 3: -0.9266236 -0.21024328    1   0.02    0.98
# 4: -0.8743764 -0.29245223    0   0.00    1.00
# 5: -0.7339710  0.19230687   50   1.00    0.00
# 6: -0.7103589  0.13898042   50   1.00    0.00
# 7: -0.7089414 -0.26660369    0   0.00    1.00
# 8: -0.7007681  0.58032622   50   1.00    0.00
# 9: -0.6860721  0.01936587   35   0.70    0.30
# 10: -0.6573338 -0.41395304    0   0.00    1.00

#3


I think @nrussell's answer is just fine, but you can accomplish your answer much more simply using base R, so I'll document it here for you since you said you're a beginner. I've commented it as well to hopefully help you learn what's going on:

我认为@nrussell的答案很好,但你可以更简单地使用base R来完成你的答案,所以我会在这里为你记录,因为你说你是初学者。我也评论过它,希望能帮助你了解正在发生的事情:

##  Set a seed so simulated data can be duplicated:
set.seed(2001)

##  Simulate your data to be counted:
d <- rnorm(50)

##  Simulate your ranges:
r <- rnorm(10)
r <- cbind(r - 0.1, r + 0.1)

##  Sum up the values of d falling inside each row of ranges.  The apply
##    function takes each row of r, and compares the values of d to the
##    bounds of your ranges (lower in the first column, upper in the second)
##    and the resulting logical vector is then summed, where TRUEs are equal
##    to 1, thus counting the number of values in d falling between each
##    set of bounds:
sums <- apply(r, MARGIN=1, FUN=function(x) { sum( d > x[1] & d < x[2] ) })

##  Each item of the sums vector refers to the corresponding
##      row of ranges in the r object...