将数据从具有条件的一个表中子集到不同的表中

I'm doing cross-sell analysis for several products with R. I've already transformed the transactional data and it looks like this -

我正在使用R对几个产品进行交叉销售分析。我已经转换了交易数据,它看起来像这样 -

  df.articles <- cbind.data.frame(Art01,Art02,Art03)

  Art01         Art02      Art03
  bread         yoghurt    egg
  butter        bread      yoghurt
  cheese        butter     bread
  egg           cheese     NA
  potato        NA         NA

 Actual data is 'data.frame': 69099 obs. of  33 variables.

I want to have the list of all distinct articles and their counts that was sold with an Article(say bread or yoghurt in this case) Actual data consists of 56 articles for which I need to check all the articles with which it was cross-sold. So the results that I would want to have has to be look like -

我希望列出所有与文章一起出售的不同文章及其计数(在这种情况下说面包或酸奶)实际数据包含56篇文章,我需要检查所有与之交叉销售的文章。所以我想要的结果必须是 -

     Products sold with **bread**           Products sold with **Yoghurt**  

     yoghurt         2                        bread   2
     egg             1                        egg     1
     cheese          1                       butter   1
     butter          1          

     .... and list goes on like this for say 52 different articles.

I've tried couple of things but it is too manual for this big dataset. It would be great to have this problem solved with the help of library(data.table), if not, that shall also be very fine. Thank you very much in advance.

我已经尝试了很多东西,但对于这个大数据集来说它太过手动了。在图书馆(data.table)的帮助下解决这个问题会很好,如果没有,那也应该很好。非常感谢你提前。

2 个解决方案

#1

There's...

library(data.table)
setDT(DF)
dat = setorder(melt(DF[, r := .I], id="r", na.rm=TRUE)[, !"variable"])
res = dat[, CJ(art = value, other_art = value), by=r][art != other_art, .N, keyby=.(art, other_art)]

        art other_art N
 1:   bread    butter 2
 2:   bread    cheese 1
 3:   bread       egg 1
 4:   bread   yoghurt 2
 5:  butter     bread 2
 6:  butter    cheese 1
 7:  butter   yoghurt 1
 8:  cheese     bread 1
 9:  cheese    butter 1
10:  cheese       egg 1
11:     egg     bread 1
12:     egg    cheese 1
13:     egg   yoghurt 1
14: yoghurt     bread 2
15: yoghurt    butter 1
16: yoghurt       egg 1

Comment. The OP mentions having 56 distinct items, which means a single order (r above) could have as many as 3136 = 56^2 rows after CJ. With a few thousand orders, this rapidly becomes problematic. This is typical when doing combinatorial computations, so hopefully this task is just for browsing the data and not analysing it.

评论。 OP提到有56个不同的项目,这意味着单个订单(上面的r)在CJ之后可能有多达3136 = 56 ^ 2行。有几千个订单,这很快就会成为问题。这在进行组合计算时很典型,因此希望此任务仅用于浏览数据而不是分析数据。

Another idea when browsing, would be to use split and lapply to customize the display:

浏览时的另一个想法是使用split和lapply来自定义显示:

library(magrittr)
split(res, by="art", keep.by = FALSE) %>% lapply(. %$% setNames(N, other_art))

$bread
 butter  cheese     egg yoghurt 
      2       1       1       2 

$butter
  bread  cheese yoghurt 
      2       1       1 

$cheese
 bread butter    egg 
     1      1      1 

$egg
  bread  cheese yoghurt 
      1       1       1 

$yoghurt
 bread butter    egg 
     2      1      1

I usually just explore with res[art == "bread"], res[art == "bread" & other_art == "butter"], etc, though, as @ycw suggested in a comment.

我通常只用res [art ==“bread”],res [art ==“bread”&other_art ==“butter”]等进行探索,但正如@ycw在评论中所建议的那样。

Magrittr isn't really needed here; it just allows for different syntax.

这里不需要马格里特;它只允许不同的语法。

#2

Here is an option. We can use some functions from tidyverse to create a list with results. The a_list4 is the final output. Each element is an article with numbers of associated articles.

这是一个选项。我们可以使用tidyverse中的一些函数来创建包含结果的列表。 a_list4是最终输出。每个元素都是一篇包含相关文章数量的文章。

# Prepare the data frame "dt"
dt <- read.table(text = "Art01         Art02      Art03
  bread         yoghurt    egg
                 butter        bread      yoghurt
                 cheese        butter     bread
                 egg           cheese     NA
                 potato        NA         NA",
                 header = TRUE, stringsAsFactors = FALSE)

# Load package
library(tidyverse)

# A vector with articles
articles <- unique(unlist(dt))

# Remove NA
articles <- articles[!is.na(articles)]

# A function to filter the data frame by articles
filter_fun <- function(article, dt){
  dt2 <- dt %>% filter(rowSums(. == article) > 0)
  return(dt2)
}

# Apply the filter_fun
a_list <- map(articles, filter_fun, dt = dt)
names(a_list) <- articles

# Get articles in each element of the list
a_list2 <- map(a_list, function(dt) unlist(dt))

# Remove the articles based on the name of that article
a_list3 <- map2(a_list2, names(a_list2), function(vec, article){
  vec[!(vec %in% article)]
})

# Count the number
a_list4 <- map(a_list3, table)

# See the results
a_list4

$bread

 butter  cheese     egg yoghurt 
      2       1       1       2 

$butter

  bread  cheese yoghurt 
      2       1       1 

$cheese

 bread butter 
     1      1 

$egg

  bread yoghurt 
      1       1 

$potato
< table of extent 0 >

$yoghurt

 bread butter    egg 
     2      1      1

#1