如何使用ddply从数据框中删除na值?

时间:2021-11-18 01:38:03

Hopefully you guys can help me out. I've been looking all over the web, and I can't find an answer. Here's my data frame:

希望你们能帮助我。我一直在网上看,我找不到答案。这是我的数据框:

name    city    state   stars    main_category
A   Pittsburgh  PA       5.0     Soul Food
B   Houston     TX       3.0     Professional Services
C   Lafayette   IN       3.0     NA
D   Los Angeles CA       4.0     Local Services
E   Los Angeles CA       3.0     Local Services
F   Lafayette   IN       3.5     *n
G   Pittsburgh  PA       5.0     Doctors
H   Pittsburgh  PA       4.0     Soul Food
I   Houston     TX       4.0     Professional Services

What I would like for it to do is to output the rank by grouping cities (alphabetically) with state and then rank by the amount of stars gotten. Here's what I was hoping for:

我想要它做的是通过将城市(按字母顺序)与州分组来输出等级,然后按照得到的星数进行排名。这就是我所希望的:

name    city    state   stars    main_category              rank
I   Houston     TX       4.0     Professional Services       1  
B   Houston     TX       3.0     Professional Services       2
F   Lafayette   IN       3.5     *n                   1
D   Los Angeles CA       4.0     Local Services              1
E   Los Angeles CA       3.0     Local Services              2
G   Pittsburgh  PA       5.0     Doctors                     1
A   Pittsburgh  PA       5.0     Soul Food                   1
H   Pittsburgh  PA       4.0     Soul Food                   2

Here's my line of code.

这是我的代码行。

l <- ddply(d, c("city", "state", "main_category"), na.rm=T, transform, rank=rank(-stars, ties.method="max"))

This does not remove the NA that Lafayette has. And I don't know what to put, I also tried na.omit, but when I tried that, the rank column does not show up.

这并不能消除拉斐特所拥有的NA。而且我不知道该放什么,我也尝试了na.omit,但是当我尝试这个时,排名列没有显示出来。

3 个解决方案

#1


1  

Here's a base R solution. Not sure if you're set on using dplyr, but this seems to work. I think the last row should be ranked 3 since there are two first values ranked at 1

这是一个基础R解决方案。不确定你是否已开始使用dplyr,但这似乎有效。我认为最后一行应该排名3,因为有两个第一个值排在1

no <- na.omit(dat)
new <- no[do.call(order, with(no, list(city, state, -stars))),]
within(new, {
    rank  <- Reduce(c, Map(rank, split(-stars, city), ties.method = "min"))
})
#   name        city state stars         main_category rank
# 9    I     Houston    TX   4.0 Professional Services    1
# 2    B     Houston    TX   3.0 Professional Services    2
# 6    F   Lafayette    IN   3.5             *n    1
# 4    D Los Angeles    CA   4.0        Local Services    1
# 5    E Los Angeles    CA   3.0        Local Services    2
# 1    A  Pittsburgh    PA   5.0             Soul Food    1
# 7    G  Pittsburgh    PA   5.0               Doctors    1
# 8    H  Pittsburgh    PA   4.0             Soul Food    3

#2


0  

Using dplyr

使用dplyr

library(dplyr)
filter(dat, complete.cases(dat)) %>%
                                group_by(city) %>% 
                                arrange(city, state, desc(stars)) %>% 
                                mutate(rank= min_rank(desc(stars)))
 #   name        city state stars         main_category rank
 #1    I     Houston    TX   4.0 Professional Services    1
 #2    B     Houston    TX   3.0 Professional Services    2
 #3    F   Lafayette    IN   3.5             *n    1
 #4    D Los Angeles    CA   4.0        Local Services    1
 #5    E Los Angeles    CA   3.0        Local Services    2
 #6    A  Pittsburgh    PA   5.0             Soul Food    1
 #7    G  Pittsburgh    PA   5.0               Doctors    1
 #8    H  Pittsburgh    PA   4.0             Soul Food    3

#3


0  

na.rm with ddply goes inside .fun , in your case that'd be inside rank.

na.rm与ddply进入.fun,在你的情况下,是在内部排名。

your approach to NA's was as follows:

你对NA的态度如下:

ddply(d, c("city", "state", "main_category"), na.rm=T, transform, rank=rank(-stars, ties.method="max"))

ddply(d,c(“city”,“state”,“main_category”),na.rm = T,transform,rank = rank(-stars,ties.method =“max”))

Passing the argument inside .fun, should fix it. At least it works for me:

在.fun中传递参数,应该修复它。至少它对我有用:

ddply(d, c("city", "state", "main_category"), transform, 
rank=rank(-stars, na.last = TRUE, ties.method="max"))

#1


1  

Here's a base R solution. Not sure if you're set on using dplyr, but this seems to work. I think the last row should be ranked 3 since there are two first values ranked at 1

这是一个基础R解决方案。不确定你是否已开始使用dplyr,但这似乎有效。我认为最后一行应该排名3,因为有两个第一个值排在1

no <- na.omit(dat)
new <- no[do.call(order, with(no, list(city, state, -stars))),]
within(new, {
    rank  <- Reduce(c, Map(rank, split(-stars, city), ties.method = "min"))
})
#   name        city state stars         main_category rank
# 9    I     Houston    TX   4.0 Professional Services    1
# 2    B     Houston    TX   3.0 Professional Services    2
# 6    F   Lafayette    IN   3.5             *n    1
# 4    D Los Angeles    CA   4.0        Local Services    1
# 5    E Los Angeles    CA   3.0        Local Services    2
# 1    A  Pittsburgh    PA   5.0             Soul Food    1
# 7    G  Pittsburgh    PA   5.0               Doctors    1
# 8    H  Pittsburgh    PA   4.0             Soul Food    3

#2


0  

Using dplyr

使用dplyr

library(dplyr)
filter(dat, complete.cases(dat)) %>%
                                group_by(city) %>% 
                                arrange(city, state, desc(stars)) %>% 
                                mutate(rank= min_rank(desc(stars)))
 #   name        city state stars         main_category rank
 #1    I     Houston    TX   4.0 Professional Services    1
 #2    B     Houston    TX   3.0 Professional Services    2
 #3    F   Lafayette    IN   3.5             *n    1
 #4    D Los Angeles    CA   4.0        Local Services    1
 #5    E Los Angeles    CA   3.0        Local Services    2
 #6    A  Pittsburgh    PA   5.0             Soul Food    1
 #7    G  Pittsburgh    PA   5.0               Doctors    1
 #8    H  Pittsburgh    PA   4.0             Soul Food    3

#3


0  

na.rm with ddply goes inside .fun , in your case that'd be inside rank.

na.rm与ddply进入.fun,在你的情况下,是在内部排名。

your approach to NA's was as follows:

你对NA的态度如下:

ddply(d, c("city", "state", "main_category"), na.rm=T, transform, rank=rank(-stars, ties.method="max"))

ddply(d,c(“city”,“state”,“main_category”),na.rm = T,transform,rank = rank(-stars,ties.method =“max”))

Passing the argument inside .fun, should fix it. At least it works for me:

在.fun中传递参数,应该修复它。至少它对我有用:

ddply(d, c("city", "state", "main_category"), transform, 
rank=rank(-stars, na.last = TRUE, ties.method="max"))