I've searched a number of places (*, r-blogger, etc), but haven't quite found a good option for doing this in R. Hopefully someone has some ideas.
我已经搜索了很多地方(*, r-blogger,等等),但是还没有找到一个好的选择在r。希望有人有一些想法。
I have a set of environmental sampling data. The data includes a variety of fields (visit date, region, location, sample medium, sample component, result, etc.).
我有一组环境采样数据。数据包括各种字段(访问日期、区域、位置、样本介质、样本成分、结果等)。
Here's a subset of the pertinent fields. This is where I start...
这是相关字段的子集。这是我开始的地方……
visit_date region location media component result
1990-08-20 LAKE 555723 water Mg *Nondetect
1999-07-01 HILL 432422 water Ca 3.2
2010-09-12 LAKE 555723 water pH 6.8
2010-09-12 LAKE 555723 water Mg 2.1
2010-09-12 HILL 432423 water pH 7.2
2010-09-12 HILL 432423 water N 0.8
2010-09-12 HILL 432423 water NH4 112
What I hope to reach is a table/dataframe like this:
我希望得到的是这样的一个表/dataframe:
visit_date region location media component result pH
1990-08-20 LAKE 555723 water Mg *Nondetect *Not recorded
1999-07-01 HILL 432422 water Ca 3.2 *Not recorded
2010-09-12 LAKE 555723 water pH 6.8 6.8
2010-09-12 LAKE 555723 water Mg 2.1 6.8
2010-09-12 HILL 432423 water pH 7.2 7.2
2010-09-12 HILL 432423 water N 0.8 7.2
2010-09-12 HILL 432423 water NH4 112 7.2
I attempted to use the method here -- R finding rows of a data frame where certain columns match those of another -- but unfortunately didn't get to the result I wanted. Instead the pH column was either my pre-populated value -999
or NA
and not the pH value for that particular visit date if it was collected. Since the result data set is around 500k records, I'm using unique(tResult$pH)
to determine the values of the pH column.
我尝试在这里使用这个方法——R查找数据帧的行,其中某些列与另一个列匹配——但不幸的是没有得到我想要的结果。相反,pH值是我的预填充值-999或NA,如果它被收集的话,它的pH值不是特定访问日期的pH值。由于结果数据集大约为500k记录,所以我使用unique(tResult$pH)来确定pH列的值。
Here's that attempt. res
is the original result data.frame and component
would be the pH result subset (the pH sample results from the main results table).
这是尝试。res为原始结果数据,frame和component为pH结果子集(pH样本结果来自主结果表)。
keys <- c("region", "location", "visit_date", "media")
tResults <- data.table(res, key=keys)
tComponent <- data.table(component, key=keys)
tResults[tComponent, pH>0]
I've attempted using match
, merge
, and within
on the original data frame without success. Since then I've generated a subset for the components (pH in this example) where I copied over the results column to a new "pH" column, thinking I could match the keys and update a new "pH" column in the main result set.
我尝试在原始数据框架中使用match、merge和within,但没有成功。从那时起,我为组件生成了一个子集(本例中为pH),我将结果列复制到一个新的“pH”列,认为我可以匹配键并更新主结果集中的一个新的“pH”列。
Since not all result values are numeric (with values like *Not recorded
) I attempted to use numerics like -888
or other values which could substitute so I could force at least the result and pH columns to be numeric. Aside from the dates which are POSIXct
values, the remaining columns are character
columns. Original dataframe was created using StringsAsFactors=FALSE
.
由于不是所有的结果值都是数值(如*未被记录),所以我尝试使用-888之类的数值或其他可以替代的数值,这样至少可以使结果和pH列为数值。除了日期是POSIXct值之外,其余的列是字符列。使用StringsAsFactors=FALSE创建原始dataframe。
Once I can do this, I'll be able to generate similar columns for other components that can be used to populate and calculate other values for a given sample. At least that's my goal.
一旦我可以这样做,我将能够为其他组件生成类似的列,这些组件可以用于填充和计算给定示例的其他值。至少这是我的目标。
So I'm stumped on this one. In my mind it should be easy but I'm certainly NOT seeing it!
我被这个难住了。在我看来,这应该很容易,但我肯定看不到!
Your help and ideas are certainly welcome and appreciated!
欢迎您的帮助和建议!
1 个解决方案
#1
4
#df1 is your first data set and is dataframe
df1$phtem<-with(df1,ifelse(component=="pH",result,NA))
library(data.table)
library(zoo) # locf function
setDT(df1)[,pH:=na.locf(phtem,na.rm = FALSE)]
visit_date region location media component result phtem pH
1: 1990-08-20 LAKE 555723 water Mg *Nondetect NA NA
2: 1999-07-01 HILL 432422 water Ca 3.2 NA NA
3: 2010-09-12 LAKE 555723 water pH 6.8 6.8 6.8
4: 2010-09-12 LAKE 555723 water Mg 2.1 NA 6.8
5: 2010-09-12 HILL 432423 water pH 7.2 7.2 7.2
6: 2010-09-12 HILL 432423 water N 0.8 NA 7.2
7: 2010-09-12 HILL 432423 water NH4 112 NA 7.2
# you can delete phtem if you don't need.
如果不需要,可以删除phtem。
Edit:
编辑:
library(data.table)
setDT(df1)[,pH:=result[component=="pH"],by="region,location,visit_date,media"]
df1
visit_date region location media component result pH
1: 1990-08-20 LAKE 555723 water Mg *Nondetect NA
2: 1999-07-01 HILL 432422 water Ca 3.2 NA
3: 2010-09-12 LAKE 555723 water pH 6.8 6.8
4: 2010-09-12 LAKE 555723 water Mg 2.1 6.8
5: 2010-09-12 HILL 432423 water pH 7.2 7.2
6: 2010-09-12 HILL 432423 water N 0.8 7.2
7: 2010-09-12 HILL 432423 water NH4 112 7.2
#1
4
#df1 is your first data set and is dataframe
df1$phtem<-with(df1,ifelse(component=="pH",result,NA))
library(data.table)
library(zoo) # locf function
setDT(df1)[,pH:=na.locf(phtem,na.rm = FALSE)]
visit_date region location media component result phtem pH
1: 1990-08-20 LAKE 555723 water Mg *Nondetect NA NA
2: 1999-07-01 HILL 432422 water Ca 3.2 NA NA
3: 2010-09-12 LAKE 555723 water pH 6.8 6.8 6.8
4: 2010-09-12 LAKE 555723 water Mg 2.1 NA 6.8
5: 2010-09-12 HILL 432423 water pH 7.2 7.2 7.2
6: 2010-09-12 HILL 432423 water N 0.8 NA 7.2
7: 2010-09-12 HILL 432423 water NH4 112 NA 7.2
# you can delete phtem if you don't need.
如果不需要,可以删除phtem。
Edit:
编辑:
library(data.table)
setDT(df1)[,pH:=result[component=="pH"],by="region,location,visit_date,media"]
df1
visit_date region location media component result pH
1: 1990-08-20 LAKE 555723 water Mg *Nondetect NA
2: 1999-07-01 HILL 432422 water Ca 3.2 NA
3: 2010-09-12 LAKE 555723 water pH 6.8 6.8
4: 2010-09-12 LAKE 555723 water Mg 2.1 6.8
5: 2010-09-12 HILL 432423 water pH 7.2 7.2
6: 2010-09-12 HILL 432423 water N 0.8 7.2
7: 2010-09-12 HILL 432423 water NH4 112 7.2