基于R中的字段运行计数

时间:2020-12-09 08:49:11

I have a data set of this format

我有这种格式的数据集

User       
1 
2
3
2
3
1  
1      

Now I want to add a column saying count which counts the occurrence of the user. I want output in the below format.

现在我想添加一个列计数,用于计算用户的出现次数。我希望以下格式输出。

User    Count
1       1
2       1 
3       1
2       2
3       2
1       2
1       3

I have few solutions but all those solutions are somewhat slow.

我的解决方案很少,但所有这些解决方案都有点慢。

Running count variable in R

在R中运行计数变量

My data.frame has 100,000 rows now and soon it may go up to 1 million. I need a solution which is also fast.

我的data.frame现在有100,000行,很快就会达到100万行。我需要一个快速的解决方案。

3 个解决方案

#1


4  

You can use getanID from my "splitstackshape" package:

您可以使用我的“splitstackshape”包中的getanID:

library(splitstackshape)
getanID(mydf, "User")
##    User .id
## 1:    1   1
## 2:    2   1
## 3:    3   1
## 4:    2   2
## 5:    3   2
## 6:    1   2
## 7:    1   3

This is essentially an approach with "data.table" that looks something like the following:

这实际上是一种带有“data.table”的方法,如下所示:

as.data.table(mydf)[, count := seq(.N), by = "User"][]

#2


7  

An option using dplyr

使用dplyr的选项

 library(dplyr)
 df1 %>%
      group_by(User) %>%
      mutate(Count=row_number())
 #    User Count
 #1    1     1
 #2    2     1
 #3    3     1
 #4    2     2
 #5    3     2
 #6    1     2
 #7    1     3

Using sqldf

library(sqldf)
sqldf('select a.*, 
           count(*) as Count
           from df1 a, df1 b
           where a.User = b.User and b.rowid <= a.rowid
           group by a.rowid')
#   User Count
#1    1     1
#2    2     1
#3    3     1
#4    2     2
#5    3     2
#6    1     2
#7    1     3

#3


6  

This is fairly easy with ave and seq.int:

使用ave和seq.int这很容易:

> ave(User,User, FUN= seq.int)
[1] 1 1 1 2 2 2 3

This is a common strategy and is often used when the items are adjacent to each other. The second argument is the grouping variable and in this case the first argument is really kind of a dummy argument since the only thing that it contributes is a length, and it is not a requirement for ave to have adjacent rows for the values determined within groupings.

这是一种常见的策略,通常在物品彼此相邻时使用。第二个参数是分组变量,在这种情况下,第一个参数实际上是一个伪参数,因为它贡献的唯一一个是长度,并且不要求ave在分组中确定值的相邻行。

#1


4  

You can use getanID from my "splitstackshape" package:

您可以使用我的“splitstackshape”包中的getanID:

library(splitstackshape)
getanID(mydf, "User")
##    User .id
## 1:    1   1
## 2:    2   1
## 3:    3   1
## 4:    2   2
## 5:    3   2
## 6:    1   2
## 7:    1   3

This is essentially an approach with "data.table" that looks something like the following:

这实际上是一种带有“data.table”的方法,如下所示:

as.data.table(mydf)[, count := seq(.N), by = "User"][]

#2


7  

An option using dplyr

使用dplyr的选项

 library(dplyr)
 df1 %>%
      group_by(User) %>%
      mutate(Count=row_number())
 #    User Count
 #1    1     1
 #2    2     1
 #3    3     1
 #4    2     2
 #5    3     2
 #6    1     2
 #7    1     3

Using sqldf

library(sqldf)
sqldf('select a.*, 
           count(*) as Count
           from df1 a, df1 b
           where a.User = b.User and b.rowid <= a.rowid
           group by a.rowid')
#   User Count
#1    1     1
#2    2     1
#3    3     1
#4    2     2
#5    3     2
#6    1     2
#7    1     3

#3


6  

This is fairly easy with ave and seq.int:

使用ave和seq.int这很容易:

> ave(User,User, FUN= seq.int)
[1] 1 1 1 2 2 2 3

This is a common strategy and is often used when the items are adjacent to each other. The second argument is the grouping variable and in this case the first argument is really kind of a dummy argument since the only thing that it contributes is a length, and it is not a requirement for ave to have adjacent rows for the values determined within groupings.

这是一种常见的策略,通常在物品彼此相邻时使用。第二个参数是分组变量,在这种情况下,第一个参数实际上是一个伪参数,因为它贡献的唯一一个是长度,并且不要求ave在分组中确定值的相邻行。