[Background]
[背景]
I have some data of online activities from a group of users:
我有一些来自一群用户的在线活动数据:
-
userId
indicates the ID of the user. - userId表示用户的ID。
-
pageType
indicates the current page the user is on.home
indicates the homepage, whilecontent
indicates the content page. - pageType指示用户正在访问的当前页面。home表示主页,而content表示内容页面。
- The pages are already sorted by time, so row 1 happens before row 2, and row 2 happens before row 3, ...
- 页面已经按时间排序,所以第1行在第2行之前,第2行在第3行之前……
- Actual data has roughly 2 million rows, and 8 page types.
userId
is a 36-characteredjava.util.UUID
object. - 实际数据大约有200万行,8个页面类型。userId是一个36字符的java.util。对象UUID。
[Goal]
[目的]
I want to generate a new column for each pageType
and count the number of previous page views (not including current) of the the exact same type.
我想为每个pageType生成一个新的列,并计算相同类型的以前的页面视图(不包括当前视图)的数量。
[Sample Data]
(样本数据)
To generate a sample of the actual data:
生成实际数据的样本:
library(data.table)
DT <- data.table("userId"=rep(1:3, each=10),
"pageType"=c("home", "content", "home", "content", "home", "home", "content", "content", "home", "home",
"content", "content", "home", "home", "content", "home", "home", "content", "home", "content",
"home", "home", "content", "content", "home", "home", "content", "content", "home", "content"))
> DT
userId pageType
1: 1 home
2: 1 content
3: 1 home
4: 1 content
5: 1 home
6: 1 home
7: 1 content
8: 1 content
9: 1 home
10: 1 home
... ... ...
[My Attempts]
(我的尝试)
I have tried to solve this problem in two ways, but both of them are too slow. I also feel my solution didn't use data.table
the way it is designed for.
我试过用两种方法来解决这个问题,但这两种方法都太慢了。我也觉得我的解决方案没有使用数据。按设计的方式摆放。
Solution I
解我
- Filter by
pageType
and increment byuserId
. - 按页面类型筛选,按用户id递增。
- Set missing values for the other
pageType
. - 为其他pageType设置缺失值。
Below is the code:
下面是代码:
FixPageView <- function(data, type) {
val <- 0
for (i in 1:nrow(data)) {
if (is.na(data[[type]][i])) {
set(data, i, type, val)
} else {
val <- data[[type]][i]
}
}
}
DT[pageType=="home", numHomePagesViewed:=0:(.N-1), by=userId]
DT[pageType=="content", numContentPagesViewed:=0:(.N-1), by=userId]
FixPageView(DT, "numHomePagesViewed")
FixPageView(DT, "numContentPagesViewed")
> DT
userId pageType numHomePagesViewed numContentPagesViewed
1: 1 home 0 0
2: 1 content 0 0
3: 1 home 1 0
4: 1 content 1 1
5: 1 home 2 1
6: 1 home 3 1
7: 1 content 3 2
8: 1 content 3 3
9: 1 home 4 3
10: 1 home 5 3
... ... ... ... ...
Solution II
解决方案二
Double for
loop and set it row by row.
双for循环,并将它按行设置。
DT[, numHomePagesViewed := 0L][, numContentPagesViewed := 0L]
for (i in unique(DT$userId)) {
home_inc <- -1L
content_inc <- -1L
for (j in 1L:nrow(DT[userId==i])) {
if (DT$pageType[(i-1L)*10L + j] == "home") {
home_inc <- home_inc + 1L
set(DT, (i-1L)*10L + j, "numHomePagesViewed", home_inc)
} else {
set(DT, (i-1L)*10L + j, "numHomePagesViewed", max(0, home_inc))
}
if (DT$pageType[(i-1L)*10L + j] == "content") {
content_inc <- content_inc + 1L
set(DT, (i-1L)*10L + j, "numContentPagesViewed", content_inc)
} else {
set(DT, (i-1L)*10L + j, "numContentPagesViewed", max(0, content_inc))
}
}
}
> DT
userId pageType numHomePagesViewed numContentPagesViewed
1: 1 home 0 0
2: 1 content 0 0
3: 1 home 1 0
4: 1 content 1 1
5: 1 home 2 1
6: 1 home 3 1
7: 1 content 3 2
8: 1 content 3 3
9: 1 home 4 3
10: 1 home 5 3
... ... ... ... ...
[Question]
[问题]
- What can I do to improve the speed?
- 我能做些什么来提高速度呢?
- Is there a more "
data.table
" way to solve this problem? - 还有更多的“数据”吗?“解决这个问题的方法?”
1 个解决方案
#1
4
I'd try:
我试一试:
DT[,lapply(unique(pageType),
function(x) pmax(cumsum(pageType==x)-1,0)),by=userId]
Next, you have to rename the obtained columns.
接下来,您必须重命名已获得的列。
As suggested in the comments, you can assign the names with one line:
正如评论中提到的,你可以用一行来命名:
DT[, paste0("num",unique(DT$pageType),"PagesViewed") :=
lapply(unique(pageType), function(x) pmax(cumsum(pageType==x)-1,0)), by=userId]
#1
4
I'd try:
我试一试:
DT[,lapply(unique(pageType),
function(x) pmax(cumsum(pageType==x)-1,0)),by=userId]
Next, you have to rename the obtained columns.
接下来,您必须重命名已获得的列。
As suggested in the comments, you can assign the names with one line:
正如评论中提到的,你可以用一行来命名:
DT[, paste0("num",unique(DT$pageType),"PagesViewed") :=
lapply(unique(pageType), function(x) pmax(cumsum(pageType==x)-1,0)), by=userId]