I started a "free" open-sourced project to create a new data set for pH of the earth oceans.
我开始了一个“免费”的开源项目,为地球海洋的pH值创建一个新的数据集。
I started from the open data-set from NOAA and created a 2.45 millions rows data-set with those columns:
我从NOAA的开放数据集开始,使用这些列创建了一个2.45百万行的数据集:
colnames(NOAA_NODC_OSD_SUR_pH_7to9)
[1] "Year" "Month" "Day" "Hour" "Lat" "Long" "Depth" "pH"
Method document HERE.
方法文件HERE。
Data-set HERE.
My goal now is to "qualify" each row (2.45m)... to do so, I need to calculate the distance from each point of Lat/Long to the nearest shore.
我现在的目标是“限定”每一行(2.45米)......为此,我需要计算从纬度/长度的每个点到最近岸的距离。
So I am looking for a method that would take In: Lat/Long Out: Distance (km from shore)
所以我正在寻找一种方法,它将采取:纬度/长距离:距离(距离海岸的公里)
With this, I can qualify if the data point can be affected from shore contamination, like nearby city effluence for example.
有了这个,我可以确定数据点是否会受到岸上污染的影响,例如附近的城市污水。
I have search for a method to do this, but all seems to need packages/software that I don't have.
我已经找到了一种方法来做到这一点,但似乎所有人都需要我没有的软件包/软件。
If someone would be willing to help out, I would appreciate. Or if you know of an easy (free) method to accomplish this, please let me know...
如果有人愿意帮忙,我将不胜感激。或者,如果你知道一个简单(免费)的方法来实现这一目标,请告诉我...
I can work in R programming, Shell scripts stuff, but not an expert of those....
我可以在R编程,Shell脚本中工作,但不是那些专家....
1 个解决方案
#1
7
So there are several things going on here. First, your dataset seems to have pH vs. depth. So while there are ~ 2.5MM rows, there are only ~200,000 rows with depth=0 - still a lot.
所以这里有几件事情。首先,您的数据集似乎具有pH与深度。因此,虽然有大约2.5MM的行,但只有大约200,000行,深度= 0 - 仍然很多。
Second, to get distance to nearest coast you need a shapefile of coastlines. Fortunately this is available here, at the excellent Natural Earth website.
其次,为了到达最近的海岸,你需要一个海岸线的形状文件。幸运的是,这里可以在优秀的Natural Earth网站上找到。
Third, your data is in long/lat (so, units = degrees), but you want distance in km, so you need to transform your data (the coastline data above is also in long/lat and also needs to be transformed). One problem with transformations is that your data is evidently global, and any global transformation will necessarily be non-planar. So the accuracy will depend on the actual location. The right way to do this is to grid your data and then use a set of planar transformations appropriate to whichever grid your points are in. This is beyond the scope of this question, though, so we'll use a global transformation (mollweide) just to give you an idea of how it's done in R.
第三,你的数据是长/纬度(因此,单位=度),但你想要以km为单位的距离,所以你需要转换你的数据(上面的海岸线数据也是长/纬度,也需要转换)。转换的一个问题是您的数据显然是全局的,任何全局转换都必然是非平面的。所以准确性将取决于实际位置。正确的方法是对数据进行网格化,然后使用适合于您的点所在网格的一组平面变换。但这超出了这个问题的范围,因此我们将使用全局变换(mollweide)只是为了让你了解它是如何在R中完成的。
library(rgdal) # for readOGR(...); loads package sp as well
library(rgeos) # for gDistance(...)
setwd(" < directory with all your files > ")
# WGS84 long/lat
wgs.84 <- "+proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0"
# ESRI:54009 world mollweide projection, units = meters
# see http://www.spatialreference.org/ref/esri/54009/
mollweide <- "+proj=moll +lon_0=0 +x_0=0 +y_0=0 +ellps=WGS84 +datum=WGS84 +units=m +no_defs"
df <- read.csv("OSD_All.csv")
sp.points <- SpatialPoints(df[df$Depth==0,c("Long","Lat")], proj4string=CRS(wgs.84))
coast <- readOGR(dsn=".",layer="ne_10m_coastline",p4s=wgs.84)
coast.moll <- spTransform(coast,CRS(mollweide))
point.moll <- spTransform(sp.points,CRS(mollweide))
set.seed(1) # for reproducible example
test <- sample(1:length(sp.points),10) # random sample of ten points
result <- sapply(test,function(i)gDistance(point.moll[i],coast.moll))
result/1000 # distance in km
# [1] 0.2185196 5.7132447 0.5302977 28.3381043 243.5410571 169.8712255 0.4182755 57.1516195 266.0498881 360.6789699
plot(coast)
points(sp.points[test],pch=20,col="red")
So this reads your dataset, extracts rows where Depth==0
, and converts that to a SpatialPoints object. Then we read the coastlines database downloaded from the link above into a SpatialLines object. Then we transform both to the Mollweide projection using spTransform(...)
, then we use gDistance(...)
in the rgeos
package to calculate the minimum distance between each point and the nearest coast.
因此,这将读取您的数据集,提取Depth == 0的行,并将其转换为SpatialPoints对象。然后我们将从上面链接下载的海岸线数据库读入SpatialLines对象。然后我们使用spTransform(...)将两者转换为Mollweide投影,然后我们在rgeos包中使用gDistance(...)来计算每个点与最近海岸之间的最小距离。
Again, it is important to remember that despite all the decimal places, these distances are just approximate.
同样,重要的是要记住,尽管所有小数位,这些距离只是近似值。
One very big problem is speed: this process takes ~ 2 min for 1000 distances (on my system), so to run all 200,000 distances would take about 6.7 hours. One option, theoretically, would be to find a coastline database with a lower resolution.
一个非常大的问题是速度:这个过程需要大约2分钟1000个距离(在我的系统上),所以运行所有200,000个距离大约需要6.7个小时。理论上,一种选择是找到分辨率较低的海岸线数据库。
The code below will calculate all 201,000 distances.
下面的代码将计算所有201,000个距离。
## not run
## estimated run time ~ 7 hours
result <- sapply(1:length(sp.points), function(i)gDistance(sp.points[i],coast))
EDIT: OP's comment about the cores got me to thinking that this could be an instance where the improvement from parallelization might be worth the effort. So here is how you would run this (on Windows) using parallel processing.
编辑:OP关于核心的评论让我认为这可能是一个实例,其中并行化的改进可能值得努力。所以这里是你如何使用并行处理运行它(在Windows上)。
library(foreach) # for foreach(...)
library(snow) # for makeCluster(...)
library(doSNOW) # for resisterDoSNOW(...)
cl <- makeCluster(4,type="SOCK") # create a 4-processor cluster
registerDoSNOW(cl) # register the cluster
get.dist.parallel <- function(n) {
foreach(i=1:n, .combine=c, .packages="rgeos", .inorder=TRUE,
.export=c("point.moll","coast.moll")) %dopar% gDistance(point.moll[i],coast.moll)
}
get.dist.seq <- function(n) sapply(1:n,function(i)gDistance(point.moll[i],coast.moll))
identical(get.dist.seq(10),get.dist.parallel(10)) # same result?
# [1] TRUE
library(microbenchmark) # run "benchmark"
microbenchmark(get.dist.seq(1000),get.dist.parallel(1000),times=1)
# Unit: seconds
# expr min lq mean median uq max neval
# get.dist.seq(1000) 140.19895 140.19895 140.19895 140.19895 140.19895 140.19895 1
# get.dist.parallel(1000) 50.71218 50.71218 50.71218 50.71218 50.71218 50.71218 1
Using 4 cores improves processing speed by about a factor of 3. So, since 1000 distances takes about a minute, 100,000 should take a little less than 2 hours.
使用4个核心可以将处理速度提高大约3倍。因此,由于1000个距离大约需要1分钟,因此100,000个应该花费不到2个小时。
Note that using times=1
is an abuse of microbenchmark(...)
really, as the whole point is to run the process multiple times and average the results, but I just didn't have the patience.
请注意,使用times = 1实际上是滥用microbenchmark(...),因为重点是多次运行该过程并对结果取平均值,但我只是没有耐心。
#1
7
So there are several things going on here. First, your dataset seems to have pH vs. depth. So while there are ~ 2.5MM rows, there are only ~200,000 rows with depth=0 - still a lot.
所以这里有几件事情。首先,您的数据集似乎具有pH与深度。因此,虽然有大约2.5MM的行,但只有大约200,000行,深度= 0 - 仍然很多。
Second, to get distance to nearest coast you need a shapefile of coastlines. Fortunately this is available here, at the excellent Natural Earth website.
其次,为了到达最近的海岸,你需要一个海岸线的形状文件。幸运的是,这里可以在优秀的Natural Earth网站上找到。
Third, your data is in long/lat (so, units = degrees), but you want distance in km, so you need to transform your data (the coastline data above is also in long/lat and also needs to be transformed). One problem with transformations is that your data is evidently global, and any global transformation will necessarily be non-planar. So the accuracy will depend on the actual location. The right way to do this is to grid your data and then use a set of planar transformations appropriate to whichever grid your points are in. This is beyond the scope of this question, though, so we'll use a global transformation (mollweide) just to give you an idea of how it's done in R.
第三,你的数据是长/纬度(因此,单位=度),但你想要以km为单位的距离,所以你需要转换你的数据(上面的海岸线数据也是长/纬度,也需要转换)。转换的一个问题是您的数据显然是全局的,任何全局转换都必然是非平面的。所以准确性将取决于实际位置。正确的方法是对数据进行网格化,然后使用适合于您的点所在网格的一组平面变换。但这超出了这个问题的范围,因此我们将使用全局变换(mollweide)只是为了让你了解它是如何在R中完成的。
library(rgdal) # for readOGR(...); loads package sp as well
library(rgeos) # for gDistance(...)
setwd(" < directory with all your files > ")
# WGS84 long/lat
wgs.84 <- "+proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0"
# ESRI:54009 world mollweide projection, units = meters
# see http://www.spatialreference.org/ref/esri/54009/
mollweide <- "+proj=moll +lon_0=0 +x_0=0 +y_0=0 +ellps=WGS84 +datum=WGS84 +units=m +no_defs"
df <- read.csv("OSD_All.csv")
sp.points <- SpatialPoints(df[df$Depth==0,c("Long","Lat")], proj4string=CRS(wgs.84))
coast <- readOGR(dsn=".",layer="ne_10m_coastline",p4s=wgs.84)
coast.moll <- spTransform(coast,CRS(mollweide))
point.moll <- spTransform(sp.points,CRS(mollweide))
set.seed(1) # for reproducible example
test <- sample(1:length(sp.points),10) # random sample of ten points
result <- sapply(test,function(i)gDistance(point.moll[i],coast.moll))
result/1000 # distance in km
# [1] 0.2185196 5.7132447 0.5302977 28.3381043 243.5410571 169.8712255 0.4182755 57.1516195 266.0498881 360.6789699
plot(coast)
points(sp.points[test],pch=20,col="red")
So this reads your dataset, extracts rows where Depth==0
, and converts that to a SpatialPoints object. Then we read the coastlines database downloaded from the link above into a SpatialLines object. Then we transform both to the Mollweide projection using spTransform(...)
, then we use gDistance(...)
in the rgeos
package to calculate the minimum distance between each point and the nearest coast.
因此,这将读取您的数据集,提取Depth == 0的行,并将其转换为SpatialPoints对象。然后我们将从上面链接下载的海岸线数据库读入SpatialLines对象。然后我们使用spTransform(...)将两者转换为Mollweide投影,然后我们在rgeos包中使用gDistance(...)来计算每个点与最近海岸之间的最小距离。
Again, it is important to remember that despite all the decimal places, these distances are just approximate.
同样,重要的是要记住,尽管所有小数位,这些距离只是近似值。
One very big problem is speed: this process takes ~ 2 min for 1000 distances (on my system), so to run all 200,000 distances would take about 6.7 hours. One option, theoretically, would be to find a coastline database with a lower resolution.
一个非常大的问题是速度:这个过程需要大约2分钟1000个距离(在我的系统上),所以运行所有200,000个距离大约需要6.7个小时。理论上,一种选择是找到分辨率较低的海岸线数据库。
The code below will calculate all 201,000 distances.
下面的代码将计算所有201,000个距离。
## not run
## estimated run time ~ 7 hours
result <- sapply(1:length(sp.points), function(i)gDistance(sp.points[i],coast))
EDIT: OP's comment about the cores got me to thinking that this could be an instance where the improvement from parallelization might be worth the effort. So here is how you would run this (on Windows) using parallel processing.
编辑:OP关于核心的评论让我认为这可能是一个实例,其中并行化的改进可能值得努力。所以这里是你如何使用并行处理运行它(在Windows上)。
library(foreach) # for foreach(...)
library(snow) # for makeCluster(...)
library(doSNOW) # for resisterDoSNOW(...)
cl <- makeCluster(4,type="SOCK") # create a 4-processor cluster
registerDoSNOW(cl) # register the cluster
get.dist.parallel <- function(n) {
foreach(i=1:n, .combine=c, .packages="rgeos", .inorder=TRUE,
.export=c("point.moll","coast.moll")) %dopar% gDistance(point.moll[i],coast.moll)
}
get.dist.seq <- function(n) sapply(1:n,function(i)gDistance(point.moll[i],coast.moll))
identical(get.dist.seq(10),get.dist.parallel(10)) # same result?
# [1] TRUE
library(microbenchmark) # run "benchmark"
microbenchmark(get.dist.seq(1000),get.dist.parallel(1000),times=1)
# Unit: seconds
# expr min lq mean median uq max neval
# get.dist.seq(1000) 140.19895 140.19895 140.19895 140.19895 140.19895 140.19895 1
# get.dist.parallel(1000) 50.71218 50.71218 50.71218 50.71218 50.71218 50.71218 1
Using 4 cores improves processing speed by about a factor of 3. So, since 1000 distances takes about a minute, 100,000 should take a little less than 2 hours.
使用4个核心可以将处理速度提高大约3倍。因此,由于1000个距离大约需要1分钟,因此100,000个应该花费不到2个小时。
Note that using times=1
is an abuse of microbenchmark(...)
really, as the whole point is to run the process multiple times and average the results, but I just didn't have the patience.
请注意,使用times = 1实际上是滥用microbenchmark(...),因为重点是多次运行该过程并对结果取平均值,但我只是没有耐心。