如何从数据帧中删除部分副本?

时间:2022-05-02 07:40:12

Data I'm importing describes numeric measurements taken at various locations for more or less evenly spread timestamps. sometimes this "evenly spread" is not really true and I have to discard some of the values, it's not that important which one, as long as I have one value for each timestamp for each location.

我正在导入的数据描述了在不同位置进行的数值测量,以获得或多或少均匀分布的时间戳。有时这种“均匀分布”不是真的,我必须放弃一些值,哪个值并不重要,只要每个位置的时间戳都有一个值。

what I do with the data? I add it to a result data.frame. There I have a timestamp column and the values in the timestamp column, they are definitely evenly spaced according to the step.

我怎么处理这些数据?我将它添加到结果数据。这里我有一个时间戳列和时间戳列中的值,它们之间的间隔显然是均匀的。

timestamps <- ceiling(as.numeric((timestamps-epoch)*24*60/step))*step*60 + epoch
result[result$timestamp %in% timestamps, columnName] <- values

This does NOT work when I have timestamps that fall in the same time step. This is an example:

当我有属于同一时间步骤的时间戳时,这将不起作用。这是一个例子:

> data.frame(ts=timestamps, v=values)
                   ts         v
1 2009-09-30 10:00:00 -2.081609
2 2009-09-30 10:04:18 -2.079778
3 2009-09-30 10:07:47 -2.113531
4 2009-09-30 10:09:01 -2.124716
5 2009-09-30 10:15:00 -2.102117
6 2009-09-30 10:27:56 -2.093542
7 2009-09-30 10:30:00 -2.092626
8 2009-09-30 10:45:00 -2.086339
9 2009-09-30 11:00:00 -2.080144
> data.frame(ts=ceiling(as.numeric((timestamps-epoch)*24*60/step))*step*60+epoch,
+ v=values)
                   ts         v
1 2009-09-30 10:00:00 -2.081609
2 2009-09-30 10:15:00 -2.079778
3 2009-09-30 10:15:00 -2.113531
4 2009-09-30 10:15:00 -2.124716
5 2009-09-30 10:15:00 -2.102117
6 2009-09-30 10:30:00 -2.093542
7 2009-09-30 10:30:00 -2.092626
8 2009-09-30 10:45:00 -2.086339
9 2009-09-30 11:00:00 -2.080144

in Python I would (mis)use a dictionary to achieve what I need:

在Python中,我将使用字典来实现我所需要的:

dict(zip(timestamps, values)).items()

returns a list of pairs where the first coordinate is unique.

返回第一个坐标为惟一的一对的列表。

in R I don't know how to do it in a compact and efficient way.

在R中,我不知道如何以一种紧凑而高效的方式去做。

2 个解决方案

#1


19  

I would use subset combined with duplicated to filter non-unique timestamps in the second data frame:

在第二个数据框中,我将使用与duplicate组合的子集来过滤非唯一的时间戳:

R> df_ <- read.table(textConnection('
                     ts         v
1 "2009-09-30 10:00:00" -2.081609
2 "2009-09-30 10:15:00" -2.079778
3 "2009-09-30 10:15:00" -2.113531
4 "2009-09-30 10:15:00" -2.124716
5 "2009-09-30 10:15:00" -2.102117
6 "2009-09-30 10:30:00" -2.093542
7 "2009-09-30 10:30:00" -2.092626
8 "2009-09-30 10:45:00" -2.086339
9 "2009-09-30 11:00:00" -2.080144
'), as.is=TRUE, header=TRUE)

R> subset(df_, !duplicated(ts))
                   ts      v
1 2009-09-30 10:00:00 -2.082
2 2009-09-30 10:15:00 -2.080
6 2009-09-30 10:30:00 -2.094
8 2009-09-30 10:45:00 -2.086
9 2009-09-30 11:00:00 -2.080

Update: To select a specific value you can use aggregate

更新:要选择可以使用聚合的特定值

aggregate(df_$v, by=list(df_$ts), function(x) x[1])  # first value
aggregate(df_$v, by=list(df_$ts), function(x) tail(x, n=1))  # last value
aggregate(df_$v, by=list(df_$ts), function(x) max(x))  # max value

#2


6  

I think you are looking at data structures for time-indexed objects, and not for a dictionary. For the former, look at the zoo and xts packages which offer much better time-pased subsetting:

我认为您是在研究时间索引对象的数据结构,而不是字典。对于前者,看看动物园和xts包,它们提供了更好的时间压缩设置:

R> library(xts)
R> X <- xts(data.frame(val=rnorm(10)), \
            order.by=Sys.time() + sort(runif(10,10,300)))
R> X
                        val
2009-11-20 07:06:17 -1.5564
2009-11-20 07:06:40 -0.2960
2009-11-20 07:07:50 -0.4123
2009-11-20 07:08:18 -1.5574
2009-11-20 07:08:45 -1.8846
2009-11-20 07:09:47  0.4550
2009-11-20 07:09:57  0.9598
2009-11-20 07:10:11  1.0018
2009-11-20 07:10:12  1.0747
2009-11-20 07:10:58  0.7062
R> X["2009-11-20 07:08::2009-11-20 07:09"]
                        val
2009-11-20 07:08:18 -1.5574
2009-11-20 07:08:45 -1.8846
2009-11-20 07:09:47  0.4550
2009-11-20 07:09:57  0.9598
R> 

The X object is ordered by a time sequence -- make sure it is of type POSIXct so you may need to parse your dates first. Then we can just index for '7:08 to 7:09 on the give day'.

X对象是按时间序列排序的——请确保它是POSIXct类型,因此您可能需要首先解析您的日期。然后我们可以在给定的日子里为7:08到7:09做索引。

#1


19  

I would use subset combined with duplicated to filter non-unique timestamps in the second data frame:

在第二个数据框中,我将使用与duplicate组合的子集来过滤非唯一的时间戳:

R> df_ <- read.table(textConnection('
                     ts         v
1 "2009-09-30 10:00:00" -2.081609
2 "2009-09-30 10:15:00" -2.079778
3 "2009-09-30 10:15:00" -2.113531
4 "2009-09-30 10:15:00" -2.124716
5 "2009-09-30 10:15:00" -2.102117
6 "2009-09-30 10:30:00" -2.093542
7 "2009-09-30 10:30:00" -2.092626
8 "2009-09-30 10:45:00" -2.086339
9 "2009-09-30 11:00:00" -2.080144
'), as.is=TRUE, header=TRUE)

R> subset(df_, !duplicated(ts))
                   ts      v
1 2009-09-30 10:00:00 -2.082
2 2009-09-30 10:15:00 -2.080
6 2009-09-30 10:30:00 -2.094
8 2009-09-30 10:45:00 -2.086
9 2009-09-30 11:00:00 -2.080

Update: To select a specific value you can use aggregate

更新:要选择可以使用聚合的特定值

aggregate(df_$v, by=list(df_$ts), function(x) x[1])  # first value
aggregate(df_$v, by=list(df_$ts), function(x) tail(x, n=1))  # last value
aggregate(df_$v, by=list(df_$ts), function(x) max(x))  # max value

#2


6  

I think you are looking at data structures for time-indexed objects, and not for a dictionary. For the former, look at the zoo and xts packages which offer much better time-pased subsetting:

我认为您是在研究时间索引对象的数据结构,而不是字典。对于前者,看看动物园和xts包,它们提供了更好的时间压缩设置:

R> library(xts)
R> X <- xts(data.frame(val=rnorm(10)), \
            order.by=Sys.time() + sort(runif(10,10,300)))
R> X
                        val
2009-11-20 07:06:17 -1.5564
2009-11-20 07:06:40 -0.2960
2009-11-20 07:07:50 -0.4123
2009-11-20 07:08:18 -1.5574
2009-11-20 07:08:45 -1.8846
2009-11-20 07:09:47  0.4550
2009-11-20 07:09:57  0.9598
2009-11-20 07:10:11  1.0018
2009-11-20 07:10:12  1.0747
2009-11-20 07:10:58  0.7062
R> X["2009-11-20 07:08::2009-11-20 07:09"]
                        val
2009-11-20 07:08:18 -1.5574
2009-11-20 07:08:45 -1.8846
2009-11-20 07:09:47  0.4550
2009-11-20 07:09:57  0.9598
R> 

The X object is ordered by a time sequence -- make sure it is of type POSIXct so you may need to parse your dates first. Then we can just index for '7:08 to 7:09 on the give day'.

X对象是按时间序列排序的——请确保它是POSIXct类型,因此您可能需要首先解析您的日期。然后我们可以在给定的日子里为7:08到7:09做索引。