I'm doing cross-sell analysis for several products with R. I've already transformed the transactional data and it looks like this -
我正在使用R对几个产品进行交叉销售分析。我已经转换了交易数据,它看起来像这样 -
df.articles <- cbind.data.frame(Art01,Art02,Art03)
Art01 Art02 Art03
bread yoghurt egg
butter bread yoghurt
cheese butter bread
egg cheese NA
potato NA NA
Actual data is 'data.frame': 69099 obs. of 33 variables.
I want to have the list of all distinct articles and their counts that was sold with an Article(say bread or yoghurt in this case) Actual data consists of 56 articles for which I need to check all the articles with which it was cross-sold. So the results that I would want to have has to be look like -
我希望列出所有与文章一起出售的不同文章及其计数(在这种情况下说面包或酸奶)实际数据包含56篇文章,我需要检查所有与之交叉销售的文章。所以我想要的结果必须是 -
Products sold with **bread** Products sold with **Yoghurt**
yoghurt 2 bread 2
egg 1 egg 1
cheese 1 butter 1
butter 1
.... and list goes on like this for say 52 different articles.
I've tried couple of things but it is too manual for this big dataset. It would be great to have this problem solved with the help of library(data.table), if not, that shall also be very fine. Thank you very much in advance.
我已经尝试了很多东西,但对于这个大数据集来说它太过手动了。在图书馆(data.table)的帮助下解决这个问题会很好,如果没有,那也应该很好。非常感谢你提前。
2 个解决方案
#1
3
There's...
library(data.table)
setDT(DF)
dat = setorder(melt(DF[, r := .I], id="r", na.rm=TRUE)[, !"variable"])
res = dat[, CJ(art = value, other_art = value), by=r][art != other_art, .N, keyby=.(art, other_art)]
art other_art N
1: bread butter 2
2: bread cheese 1
3: bread egg 1
4: bread yoghurt 2
5: butter bread 2
6: butter cheese 1
7: butter yoghurt 1
8: cheese bread 1
9: cheese butter 1
10: cheese egg 1
11: egg bread 1
12: egg cheese 1
13: egg yoghurt 1
14: yoghurt bread 2
15: yoghurt butter 1
16: yoghurt egg 1
Comment. The OP mentions having 56 distinct items, which means a single order (r
above) could have as many as 3136 = 56^2 rows after CJ
. With a few thousand orders, this rapidly becomes problematic. This is typical when doing combinatorial computations, so hopefully this task is just for browsing the data and not analysing it.
评论。 OP提到有56个不同的项目,这意味着单个订单(上面的r)在CJ之后可能有多达3136 = 56 ^ 2行。有几千个订单,这很快就会成为问题。这在进行组合计算时很典型,因此希望此任务仅用于浏览数据而不是分析数据。
Another idea when browsing, would be to use split
and lapply
to customize the display:
浏览时的另一个想法是使用split和lapply来自定义显示:
library(magrittr)
split(res, by="art", keep.by = FALSE) %>% lapply(. %$% setNames(N, other_art))
$bread
butter cheese egg yoghurt
2 1 1 2
$butter
bread cheese yoghurt
2 1 1
$cheese
bread butter egg
1 1 1
$egg
bread cheese yoghurt
1 1 1
$yoghurt
bread butter egg
2 1 1
I usually just explore with res[art == "bread"]
, res[art == "bread" & other_art == "butter"]
, etc, though, as @ycw suggested in a comment.
我通常只用res [art ==“bread”],res [art ==“bread”&other_art ==“butter”]等进行探索,但正如@ycw在评论中所建议的那样。
Magrittr isn't really needed here; it just allows for different syntax.
这里不需要马格里特;它只允许不同的语法。
#2
1
Here is an option. We can use some functions from tidyverse
to create a list with results. The a_list4
is the final output. Each element is an article with numbers of associated articles.
这是一个选项。我们可以使用tidyverse中的一些函数来创建包含结果的列表。 a_list4是最终输出。每个元素都是一篇包含相关文章数量的文章。
# Prepare the data frame "dt"
dt <- read.table(text = "Art01 Art02 Art03
bread yoghurt egg
butter bread yoghurt
cheese butter bread
egg cheese NA
potato NA NA",
header = TRUE, stringsAsFactors = FALSE)
# Load package
library(tidyverse)
# A vector with articles
articles <- unique(unlist(dt))
# Remove NA
articles <- articles[!is.na(articles)]
# A function to filter the data frame by articles
filter_fun <- function(article, dt){
dt2 <- dt %>% filter(rowSums(. == article) > 0)
return(dt2)
}
# Apply the filter_fun
a_list <- map(articles, filter_fun, dt = dt)
names(a_list) <- articles
# Get articles in each element of the list
a_list2 <- map(a_list, function(dt) unlist(dt))
# Remove the articles based on the name of that article
a_list3 <- map2(a_list2, names(a_list2), function(vec, article){
vec[!(vec %in% article)]
})
# Count the number
a_list4 <- map(a_list3, table)
# See the results
a_list4
$bread
butter cheese egg yoghurt
2 1 1 2
$butter
bread cheese yoghurt
2 1 1
$cheese
bread butter
1 1
$egg
bread yoghurt
1 1
$potato
< table of extent 0 >
$yoghurt
bread butter egg
2 1 1
#1
3
There's...
library(data.table)
setDT(DF)
dat = setorder(melt(DF[, r := .I], id="r", na.rm=TRUE)[, !"variable"])
res = dat[, CJ(art = value, other_art = value), by=r][art != other_art, .N, keyby=.(art, other_art)]
art other_art N
1: bread butter 2
2: bread cheese 1
3: bread egg 1
4: bread yoghurt 2
5: butter bread 2
6: butter cheese 1
7: butter yoghurt 1
8: cheese bread 1
9: cheese butter 1
10: cheese egg 1
11: egg bread 1
12: egg cheese 1
13: egg yoghurt 1
14: yoghurt bread 2
15: yoghurt butter 1
16: yoghurt egg 1
Comment. The OP mentions having 56 distinct items, which means a single order (r
above) could have as many as 3136 = 56^2 rows after CJ
. With a few thousand orders, this rapidly becomes problematic. This is typical when doing combinatorial computations, so hopefully this task is just for browsing the data and not analysing it.
评论。 OP提到有56个不同的项目,这意味着单个订单(上面的r)在CJ之后可能有多达3136 = 56 ^ 2行。有几千个订单,这很快就会成为问题。这在进行组合计算时很典型,因此希望此任务仅用于浏览数据而不是分析数据。
Another idea when browsing, would be to use split
and lapply
to customize the display:
浏览时的另一个想法是使用split和lapply来自定义显示:
library(magrittr)
split(res, by="art", keep.by = FALSE) %>% lapply(. %$% setNames(N, other_art))
$bread
butter cheese egg yoghurt
2 1 1 2
$butter
bread cheese yoghurt
2 1 1
$cheese
bread butter egg
1 1 1
$egg
bread cheese yoghurt
1 1 1
$yoghurt
bread butter egg
2 1 1
I usually just explore with res[art == "bread"]
, res[art == "bread" & other_art == "butter"]
, etc, though, as @ycw suggested in a comment.
我通常只用res [art ==“bread”],res [art ==“bread”&other_art ==“butter”]等进行探索,但正如@ycw在评论中所建议的那样。
Magrittr isn't really needed here; it just allows for different syntax.
这里不需要马格里特;它只允许不同的语法。
#2
1
Here is an option. We can use some functions from tidyverse
to create a list with results. The a_list4
is the final output. Each element is an article with numbers of associated articles.
这是一个选项。我们可以使用tidyverse中的一些函数来创建包含结果的列表。 a_list4是最终输出。每个元素都是一篇包含相关文章数量的文章。
# Prepare the data frame "dt"
dt <- read.table(text = "Art01 Art02 Art03
bread yoghurt egg
butter bread yoghurt
cheese butter bread
egg cheese NA
potato NA NA",
header = TRUE, stringsAsFactors = FALSE)
# Load package
library(tidyverse)
# A vector with articles
articles <- unique(unlist(dt))
# Remove NA
articles <- articles[!is.na(articles)]
# A function to filter the data frame by articles
filter_fun <- function(article, dt){
dt2 <- dt %>% filter(rowSums(. == article) > 0)
return(dt2)
}
# Apply the filter_fun
a_list <- map(articles, filter_fun, dt = dt)
names(a_list) <- articles
# Get articles in each element of the list
a_list2 <- map(a_list, function(dt) unlist(dt))
# Remove the articles based on the name of that article
a_list3 <- map2(a_list2, names(a_list2), function(vec, article){
vec[!(vec %in% article)]
})
# Count the number
a_list4 <- map(a_list3, table)
# See the results
a_list4
$bread
butter cheese egg yoghurt
2 1 1 2
$butter
bread cheese yoghurt
2 1 1
$cheese
bread butter
1 1
$egg
bread yoghurt
1 1
$potato
< table of extent 0 >
$yoghurt
bread butter egg
2 1 1