如何在另一个因子的每一层上聚合一个因子,在分类数据中按另外两个因子分组

时间:2022-09-21 14:07:12

Say that there is descriptive data on candidates across election years, districts (or states), and party. The data are currently dis-aggregated at the 'sub-district' level (say, voting precincts).

假设有关于候选人在选举年、地区(或州)和党派的描述性数据。这些数据目前在“分区”级别(如投票选区)进行分类。

Currently, when I try to aggregate the data to the district-level the various methods return counts that are inaccurate. In other words, the aggregation is not adequately taking into account that the candidates appear in the data multiple times per year, per district. What I need is an aggregate count of the number of times a particular party appear in a particular district, regardless of the repeated/duplicated information at the precinct level. In other words, I need a result that shows the party count for the district-year dyad for each unique candidate-year dyad. (Note: candidates may be repeated across election-years and/or districts, but may have different parties; Henry Clay in 1836 and 1840).

目前,当我试图将数据聚合到区域级别时,各种方法返回的计数是不准确的。换句话说,聚合没有充分考虑到候选人每年在每个地区多次出现在数据中。我需要的是对某一特定方在某一地区出现的次数的总和计数,而不考虑该地区的重复/重复信息。换句话说,我需要一个结果,显示出该地区的政党对每一个独特的候选人年的dyad的统计数字。(注:候选人可在各选举年及/或地区重复参选,但可能有不同的政党;1836年和1840年的亨利·克莱。

My question is: How do I aggregate data to obtain a count of a factor (party) at each level of another factor (district) by grouping two other factors (year and candidate-name [ID])?

我的问题是:如何通过将其他两个因素(年份和候选名称[ID])分组,来聚合数据,以获取另一个因素(地区)的每个级别的因子(party)计数?

Sample of Data Structure:

year<-rbind("1836", "1836", "1836", "1836", 
            "1840", "1840", "1840", "1840", 
            "1844", "1844", "1844", "1844", 
            "1848", "1848", "1848", "1848")

candidate<-rbind("Henry Clay", "Henry Clay", 
                 "Daniel Webster", 
                 "Daniel Webster", "Henry Clay", 
                 "Henry Clay", "Daniel Webster", 
                 "Daniel Webster", 
                 "Millard Fillmore", 
                 "Millard Fillmore", 
                 "Martin Van Buren", 
                 "Martin Van Buren", 
                 "Millard Fillmore", 
                 "Millard Fillmore", 
                 "Martin Van Buren", 
                 "Martin Van Buren")

party<-rbind("Democratic-Republican", 
             "Democratic-Republican", "Whig", 
             "Whig", "National Republican", 
             "National Republican", "Whig", 
             "Whig", "Know-Nothing", 
             "Know-Nothing", "Democrat", 
             "Democrat", "Know-Nothing", 
             "Know-Nothing", "Democrat", 
             "Democrat")

district<-rbind("Alaska", "Alaska", "Vermont", 
                "Vermont", "Alaska", "Alaska", 
                "Vermont", "Vermont", "Alaska", 
                "Alaska", "Vermont", "Vermont", 
                "Alaska", "Alaska", "Vermont", 
                "Vermont")

precinct<-rbind("Pre1", "Pre2", "Pre1", "Pre2", 
                "Pre1", "Pre2", "Pre1", "Pre2", 
                "Pre1", "Pre2", "Pre1", "Pre2", 
                "Pre1", "Pre2", "Pre1", "Pre2")

sample<-as.data.frame(cbind(year, candidate, party, district, 
              precinct))

Examples of Different Methods of Aggregating Data:

不同的数据聚合方法示例:

table

party.counts1<-data.frame(table(sample$V3, sample$V1, sample$V4))

aggregate:

Attempt 2a is close to final result needed, but returns counts that do not specify factor-level (party) and are still 'over-counting' party-district data based on precinct-level appearance of the party-candidate in a given year.

尝试2a接近需要的最终结果,但返回计数不指定因子级(party),仍然是基于政党候选人在给定年份的选区级外观的“过度计数”党区数据。

party.counts2<-aggregate(sample$V3, by=list(sample$V4, sample$V1), FUN=length)

party.counts2a<-aggregate(sample$V3~sample$V1:sample$V4:sample$V2, data=sample, FUN=length)

reshape

Reshape example displays similar problem as previous aggregate 2a attempt.

再创建示例显示与先前的聚合2a尝试类似的问题。

library(reshape2)
mdata <- melt(sample, id.vars=c("V1", "V2", "V4", "V5"), measure.vars=c("V3"))

party.counts3<-dcast(mdata, value~V1:V2:V4, length)

Again, my question is: How do I aggregate data to obtain a count of a factor (party) at each level of another factor (district) by grouping two other factors (year and candidate-name [ID])?

同样,我的问题是:如何通过将其他两个因素(年份和候选名称[ID])分组,来聚合数据,以获得另一个因素(地区)的每个级别的因子(party)计数?

1 个解决方案

#1


0  

So far, this is a solution but it is not very tidy. For instance, the count variable that is constructed is mis-labeled in the final object as the omitted variable in the aggregation command (here; V2). Also, the result is contained in a separate object (party.counts) rather than merged with the original data (object labelled sample, above).

到目前为止,这是一个解决方案,但并不十分整洁。例如,构造的count变量在最终对象中被错误地标记为聚合命令中被忽略的变量(这里;V2)。此外,结果包含在一个单独的对象(party.counts)中,而不是与原始数据(上面的对象标记为sample)合并。

cross.tab<-unique(sample[c("V3", "V4", "V1", "V2")])

party.counts<-aggregate(. ~ V3:V4:V1, cross.tab, length)

Any assistance or advice for generalizability and/or vectorization as well as ease of incorporation into the prior (original) data structure is appreciated.

对于通用性和/或矢量化以及易于合并到先前的(原始)数据结构的任何帮助或建议都是值得赞赏的。

#1


0  

So far, this is a solution but it is not very tidy. For instance, the count variable that is constructed is mis-labeled in the final object as the omitted variable in the aggregation command (here; V2). Also, the result is contained in a separate object (party.counts) rather than merged with the original data (object labelled sample, above).

到目前为止,这是一个解决方案,但并不十分整洁。例如,构造的count变量在最终对象中被错误地标记为聚合命令中被忽略的变量(这里;V2)。此外,结果包含在一个单独的对象(party.counts)中,而不是与原始数据(上面的对象标记为sample)合并。

cross.tab<-unique(sample[c("V3", "V4", "V1", "V2")])

party.counts<-aggregate(. ~ V3:V4:V1, cross.tab, length)

Any assistance or advice for generalizability and/or vectorization as well as ease of incorporation into the prior (original) data structure is appreciated.

对于通用性和/或矢量化以及易于合并到先前的(原始)数据结构的任何帮助或建议都是值得赞赏的。