I have one file (location) that has an x,y coordinates and a date/time identification. I want to get information from a second table (weather) that has a "similar" date/time variable and the co-variables (temperature and wind speed). The trick is the date/time are not exactly the same numbers in both tables. I want to select the weather data that is closest from the location data. I know I need to do some loops and thats about it.
我有一个文件(位置),具有x,y坐标和日期/时间标识。我想从第二个表(天气)中获取具有“相似”日期/时间变量和共变量(温度和风速)的信息。诀窍是两个表中的日期/时间不完全相同。我想选择距离位置数据最近的天气数据。我知道我需要做一些关于它的循环。
Example location example weather
x y date/time date/time temp wind
1 3 01/02/2003 18:00 01/01/2003 13:00 12 15
2 3 01/02/2003 19:00 01/02/2003 16:34 10 16
3 4 01/03/2003 23:00 01/02/2003 20:55 14 22
2 5 01/04/2003 02:00 01/02/2003 21:33 14 22
01/03/2003 00:22 13 19
01/03/2003 14:55 12 12
01/03/2003 18:00 10 12
01/03/2003 23:44 2 33
01/04/2003 01:55 6 22
So the final output would be a table with the correctly "best" matched weather data to the location data
因此,最终输出将是具有与位置数据正确“最佳”匹配的天气数据的表
x y datetime datetime temp wind
1 3 01/02/2003 18:00 ---- 01/02/2003 16:34 10 16
2 3 01/02/2003 19:00 ---- 01/02/2003 20:55 14 22
3 4 01/03/2003 23:00 ---- 01/03/2003 00:22 13 19
2 5 01/04/2003 02:00 ---- 01/04/2003 01:55 6 22
Any suggestions where to start? I am trying to do this in R
有什么建议从哪里开始?我想在R中这样做
2 个解决方案
#1
5
I needed to bring that data in as data and time separately and then paste and format
我需要将数据分别作为数据和时间,然后粘贴和格式化
location$dt.time <- as.POSIXct(paste(location$date, location$time),
format="%m/%d/%Y %H:%M")
And the same for weather
天气也一样
Then for each value of date.time in location
, find the entry in weather
that has the lowest absolute values for the time differences:
然后,对于位置中date.time的每个值,找到天气中具有最小时间差绝对值的条目:
sapply(location$dt.time, function(x) which.min(abs(difftime(x, weather$dt.time))))
# [1] 2 3 8 9
cbind(location, weather[ sapply(location$dt.time,
function(x) which.min(abs(difftime(x, weather$dt.time)))), ])
x y date time dt.time date time temp wind dt.time
2 1 3 01/02/2003 18:00 2003-01-02 18:00:00 01/02/2003 16:34 10 16 2003-01-02 16:34:00
3 2 3 01/02/2003 19:00 2003-01-02 19:00:00 01/02/2003 20:55 14 22 2003-01-02 20:55:00
8 3 4 01/03/2003 23:00 2003-01-03 23:00:00 01/03/2003 23:44 2 33 2003-01-03 23:44:00
9 2 5 01/04/2003 02:00 2003-01-04 02:00:00 01/04/2003 01:55 6 22 2003-01-04 01:55:00
cbind(location, weather[
sapply(location$dt.time,
function(x) which.min(abs(difftime(x, weather$dt.time)))), ])[ #pick columns
c(1,2,5,8,9,10)]
x y dt.time temp wind dt.time.1
2 1 3 2003-01-02 18:00:00 10 16 2003-01-02 16:34:00
3 2 3 2003-01-02 19:00:00 14 22 2003-01-02 20:55:00
8 3 4 2003-01-03 23:00:00 2 33 2003-01-03 23:44:00
9 2 5 2003-01-04 02:00:00 6 22 2003-01-04 01:55:00
My answers seem a bit different than yours but another reader has already questioned your abilities to do the matching properly by hand.
我的答案似乎与你的答案有点不同,但另一位读者已经质疑你手工正确匹配的能力。
#2
5
One fast and short way may be using data.table. If you create two data.table's X and Y, both with keys, then the syntax is :
一种快速和简短的方式可能是使用data.table。如果您创建两个data.table的X和Y,两者都带有键,则语法为:
X[Y,roll=TRUE]
We call that a rolling join because we roll the prevailing observation in X forward to match the row in Y. See the examples in ?data.table and the introduction vignette.
我们称之为滚动连接,因为我们在X向前滚动主要观察以匹配Y中的行。请参阅?data.table和引入插图中的示例。
Another way to do this is the zoo package which has locf (last observation carried forward), and possibly other packages too.
另一种方法是动物园包,它有locf(最后一个观察结果),也可能是其他包。
I'm not sure if you mean closest in terms of location, or time. If location, and that location is x,y coordinates then you will need some distance measure in 2D space I guess. data.table only does univariate 'closest' e.g. by time. Reading your question for a 2nd time it does seem you mean closest in the prevailing sense though.
我不确定你的位置或时间是否最接近。如果位置,并且该位置是x,y坐标,那么我猜你需要在2D空间中进行一些距离测量。 data.table只有单变量'最接近',例如按时间。第二次阅读你的问题看起来你的意思似乎与普遍意义上最接近。
EDIT: Seen the example data now. data.table won't do this in one step because although it can roll forwards or backwards, it won't roll to the nearest. You could do it with an extra step using which=TRUE and then test whether the one after the prevailing was actually closer.
编辑:现在看到示例数据。 data.table不会在一个步骤中执行此操作,因为虽然它可以向前或向后滚动,但它不会滚动到最近。您可以使用= TRUE的额外步骤来执行此操作,然后测试流行后的那个是否真的更接近。
#1
5
I needed to bring that data in as data and time separately and then paste and format
我需要将数据分别作为数据和时间,然后粘贴和格式化
location$dt.time <- as.POSIXct(paste(location$date, location$time),
format="%m/%d/%Y %H:%M")
And the same for weather
天气也一样
Then for each value of date.time in location
, find the entry in weather
that has the lowest absolute values for the time differences:
然后,对于位置中date.time的每个值,找到天气中具有最小时间差绝对值的条目:
sapply(location$dt.time, function(x) which.min(abs(difftime(x, weather$dt.time))))
# [1] 2 3 8 9
cbind(location, weather[ sapply(location$dt.time,
function(x) which.min(abs(difftime(x, weather$dt.time)))), ])
x y date time dt.time date time temp wind dt.time
2 1 3 01/02/2003 18:00 2003-01-02 18:00:00 01/02/2003 16:34 10 16 2003-01-02 16:34:00
3 2 3 01/02/2003 19:00 2003-01-02 19:00:00 01/02/2003 20:55 14 22 2003-01-02 20:55:00
8 3 4 01/03/2003 23:00 2003-01-03 23:00:00 01/03/2003 23:44 2 33 2003-01-03 23:44:00
9 2 5 01/04/2003 02:00 2003-01-04 02:00:00 01/04/2003 01:55 6 22 2003-01-04 01:55:00
cbind(location, weather[
sapply(location$dt.time,
function(x) which.min(abs(difftime(x, weather$dt.time)))), ])[ #pick columns
c(1,2,5,8,9,10)]
x y dt.time temp wind dt.time.1
2 1 3 2003-01-02 18:00:00 10 16 2003-01-02 16:34:00
3 2 3 2003-01-02 19:00:00 14 22 2003-01-02 20:55:00
8 3 4 2003-01-03 23:00:00 2 33 2003-01-03 23:44:00
9 2 5 2003-01-04 02:00:00 6 22 2003-01-04 01:55:00
My answers seem a bit different than yours but another reader has already questioned your abilities to do the matching properly by hand.
我的答案似乎与你的答案有点不同,但另一位读者已经质疑你手工正确匹配的能力。
#2
5
One fast and short way may be using data.table. If you create two data.table's X and Y, both with keys, then the syntax is :
一种快速和简短的方式可能是使用data.table。如果您创建两个data.table的X和Y,两者都带有键,则语法为:
X[Y,roll=TRUE]
We call that a rolling join because we roll the prevailing observation in X forward to match the row in Y. See the examples in ?data.table and the introduction vignette.
我们称之为滚动连接,因为我们在X向前滚动主要观察以匹配Y中的行。请参阅?data.table和引入插图中的示例。
Another way to do this is the zoo package which has locf (last observation carried forward), and possibly other packages too.
另一种方法是动物园包,它有locf(最后一个观察结果),也可能是其他包。
I'm not sure if you mean closest in terms of location, or time. If location, and that location is x,y coordinates then you will need some distance measure in 2D space I guess. data.table only does univariate 'closest' e.g. by time. Reading your question for a 2nd time it does seem you mean closest in the prevailing sense though.
我不确定你的位置或时间是否最接近。如果位置,并且该位置是x,y坐标,那么我猜你需要在2D空间中进行一些距离测量。 data.table只有单变量'最接近',例如按时间。第二次阅读你的问题看起来你的意思似乎与普遍意义上最接近。
EDIT: Seen the example data now. data.table won't do this in one step because although it can roll forwards or backwards, it won't roll to the nearest. You could do it with an extra step using which=TRUE and then test whether the one after the prevailing was actually closer.
编辑:现在看到示例数据。 data.table不会在一个步骤中执行此操作,因为虽然它可以向前或向后滚动,但它不会滚动到最近。您可以使用= TRUE的额外步骤来执行此操作,然后测试流行后的那个是否真的更接近。