Here is some sample data for which I want to encode the gender of the names over time:
以下是一些示例数据,我想对其中的名称性别进行编码:
names_to_encode <- structure(list(names = structure(c(2L, 2L, 1L, 1L, 3L, 3L), .Label = c("jane", "john", "madison"), class = "factor"), year = c(1890, 1990, 1890, 1990, 1890, 2012)), .Names = c("names", "year"), row.names = c(NA, -6L), class = "data.frame")
Here is a minimal set of the Social Security data, limited to just those names from 1890 and 1990:
这是社会保障数据的最小集合,仅限于1890年和1990年的名称:
ssa_demo <- structure(list(name = c("jane", "jane", "john", "john", "madison", "madison"), year = c(1890L, 1990L, 1890L, 1990L, 1890L, 1990L), female = c(372, 771, 56, 81, 0, 1407), male = c(0, 8, 8502, 29066, 14, 145)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -6L), .Names = c("name", "year", "female", "male"))
I've defined a function which subsets the Social Security data given a year or range of years. In other words, it calculates whether a name was male or female over a given time period by figuring out the proportion of male and female births with that name. Here is the function along with a helper function:
我已经定义了一个函数,该函数在给定一年或一年的范围内对社会保障数据进行子集。换句话说,它通过计算具有该名称的男性和女性出生比例来计算在给定时间段内姓名是男性还是女性。这是函数以及辅助函数:
require(plyr)
require(dplyr)
select_ssa <- function(years) {
# If we get only one year (1890) convert it to a range of years (1890-1890)
if (length(years) == 1) years <- c(years, years)
# Calculate the male and female proportions for the given range of years
ssa_select <- ssa_demo %.%
filter(year >= years[1], year <= years[2]) %.%
group_by(name) %.%
summarise(female = sum(female),
male = sum(male)) %.%
mutate(proportion_male = round((male / (male + female)), digits = 4),
proportion_female = round((female / (male + female)), digits = 4)) %.%
mutate(gender = sapply(proportion_female, male_or_female))
return(ssa_select)
}
# Helper function to determine whether a name is male or female in a given year
male_or_female <- function(proportion_female) {
if (proportion_female > 0.5) {
return("female")
} else if(proportion_female == 0.5000) {
return("either")
} else {
return("male")
}
}
Now what I want to do is use plyr, specifically ddply
, to subset the data to be encoded by year, and merge each of those pieces with the value returned by the select_ssa
function. This is the code I have.
现在我想要做的是使用plyr,特别是ddply来对要按年编码的数据进行子集化,并将每个部分与select_ssa函数返回的值合并。这是我的代码。
ddply(names_to_encode, .(year), merge, y = select_ssa(year), by.x = "names", by.y = "name", all.x = TRUE)
When calling select_ssa(year)
, this command works just fine if I hard code a value like 1890
as the argument to the function. But when I try to pass it the current value for year
that ddply
is working with, I get an error message:
当调用select_ssa(year)时,如果我将类似1890的值硬编码为函数的参数,则此命令可以正常工作。但是当我尝试传递ddply正在使用的当前值时,我收到一条错误消息:
Error in filter_impl(.data, dots(...), environment()) :
(list) object cannot be coerced to type 'integer'
How can I pass the current value of year
on to ddply
?
如何将当前的年份值传递给ddply?
1 个解决方案
#1
1
I think you're making things too complicated by trying to do a join inside ddply
. If I were to use dplyr
I would probably do something more like this:
我认为你试图在ddply中进行连接会使事情变得太复杂。如果我使用dplyr,我可能会做更像这样的事情:
names_to_encode <- structure(list(name = structure(c(2L, 2L, 1L, 1L, 3L, 3L), .Label = c("jane", "john", "madison"), class = "factor"), year = c(1890, 1990, 1890, 1990, 1890, 2012)), .Names = c("name", "year"), row.names = c(NA, -6L), class = "data.frame")
ssa_demo <- structure(list(name = c("jane", "jane", "john", "john", "madison", "madison"), year = c(1890L, 1990L, 1890L, 1990L, 1890L, 1990L), female = c(372, 771, 56, 81, 0, 1407), male = c(0, 8, 8502, 29066, 14, 145)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -6L), .Names = c("name", "year", "female", "male"))
names_to_encode$name <- as.character(names_to_encode$name)
names_to_encode$year <- as.integer(names_to_encode$year)
tmp <- left_join(ssa_demo,names_to_encode) %.%
group_by(year,name) %.%
summarise(female = sum(female),
male = sum(male)) %.%
mutate(proportion_male = round((male / (male + female)), digits = 4),
proportion_female = round((female / (male + female)), digits = 4)) %.%
mutate(gender = ifelse(proportion_female == 0.5,"either",
ifelse(proportion_female > 0.5,"female","male")))
Note that 0.1.1 is still a little finicky about the types of join columns, so I had to convert them. I think I saw some activity on github that suggested that was either fixed in the dev version, or at least something they're working on.
请注意,0.1.1对于连接列的类型仍然有点挑剔,所以我不得不转换它们。我想我在github上看到了一些活动,这些活动建议在dev版本中修复,或者至少是他们正在研究的东西。
#1
1
I think you're making things too complicated by trying to do a join inside ddply
. If I were to use dplyr
I would probably do something more like this:
我认为你试图在ddply中进行连接会使事情变得太复杂。如果我使用dplyr,我可能会做更像这样的事情:
names_to_encode <- structure(list(name = structure(c(2L, 2L, 1L, 1L, 3L, 3L), .Label = c("jane", "john", "madison"), class = "factor"), year = c(1890, 1990, 1890, 1990, 1890, 2012)), .Names = c("name", "year"), row.names = c(NA, -6L), class = "data.frame")
ssa_demo <- structure(list(name = c("jane", "jane", "john", "john", "madison", "madison"), year = c(1890L, 1990L, 1890L, 1990L, 1890L, 1990L), female = c(372, 771, 56, 81, 0, 1407), male = c(0, 8, 8502, 29066, 14, 145)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -6L), .Names = c("name", "year", "female", "male"))
names_to_encode$name <- as.character(names_to_encode$name)
names_to_encode$year <- as.integer(names_to_encode$year)
tmp <- left_join(ssa_demo,names_to_encode) %.%
group_by(year,name) %.%
summarise(female = sum(female),
male = sum(male)) %.%
mutate(proportion_male = round((male / (male + female)), digits = 4),
proportion_female = round((female / (male + female)), digits = 4)) %.%
mutate(gender = ifelse(proportion_female == 0.5,"either",
ifelse(proportion_female > 0.5,"female","male")))
Note that 0.1.1 is still a little finicky about the types of join columns, so I had to convert them. I think I saw some activity on github that suggested that was either fixed in the dev version, or at least something they're working on.
请注意,0.1.1对于连接列的类型仍然有点挑剔,所以我不得不转换它们。我想我在github上看到了一些活动,这些活动建议在dev版本中修复,或者至少是他们正在研究的东西。