I have a SQL table that maps, say, authors and books. I would like to group linked authors and books (books written by the same author, and authors who co-wrote a book) together and ascertain how big these groups get. For example, if J.K. Rowling co-wrote with Junot Diaz, and Junot Diaz co-wrote a book with Zadie Smith, then I would want all three authors in the same group.
我有一个SQL表格,可以映射,比如说,作者和书籍。我想把联系在一起的作者和书(同一作者写的书,和共同写书的作者写的书)放在一起,并确定这些组有多大。例如,如果J.K.罗琳与朱诺特·迪亚兹合著,朱诺特·迪亚兹与扎迪·史密斯合著,那么我希望这三位作家都在同一组。
Here's a toy data set (h/t Matthew Dowle) with some of the relationships I am talking about:
这里有一个玩具数据集(h/t Matthew Dowle)和我正在谈论的一些关系:
set.seed(1)
authors <- replicate(100,sample(1:3,1))
book_id <- rep(1:100,times=authors)
author_id <- c(lapply(authors,sample,x=1:100,replace=FALSE),recursive=TRUE)
aubk <- data.table(author_id = author_id,book_id = book_id)
aubk[order(book_id,author_id),]
Here one sees that authors 27 and 36 co-wrote book 2, so they should be in the same group. The same for authors 63 and 100 for 3; and D, F and L for 4. And so on.
这里我们看到作者27和36共同写了第二本书,所以他们应该在同一组。作者63人,100人,3人;D F L是4。等等。
I can't think of a good way to do this other than a for-loop, which (as you can guess) is slow. I tried a bit of data.table
to avoid unnecessary copying. Is there a better way of doing it?
我想不出除了for循环之外还有什么更好的方法,因为它(您可以猜到)很慢。我尝试了一些数据。表以避免不必要的复制。有更好的方法吗?
aubk$group <- integer(dim(aubk)[1])
library(data.table)
aubk <- data.table(aubk)
#system.time({
for (x in 1:dim(aubk)[1]) {
if(identical(x,1)) {
value <- 1L
} else {
sb <- aubk[1:(x-1),]
index <- match(aubk[x,author_id],sb[,author_id])
if (identical(index,NA_integer_)) {
index <- match(aubk[x,book_id],sb[,book_id])
if (identical(index,NA_integer_)) {
value <- x
} else {
value <- aubk[index,group]
}
} else {
value <- aubk[index,group]
}
}
aubk[x,group:=value]
}
#})
EDIT: As mentioned by @Josh O'Brien and @thelatemail, my problem can also be worded as looking for the connected components of a graph from a two-column list where every edge is a row, and the two columns are the nodes connected.
编辑:正如@Josh O'Brien和@thelatemail所提到的,我的问题也可以写成从一个两列列表中查找图的连接组件,其中每条边都是一行,这两列是连接的节点。
3 个解决方案
#1
3
Converting 500K nodes into an adjacency matrix was too much for my computer's memory, so I couldn't use igraph
. The RBGL
package isn't updated for R version 2.15.1, so that was out as well.
将500K节点转换为邻接矩阵对我的计算机内存来说太难了,所以我不能使用igraph。RBGL包没有更新为R版本2.15.1,所以也没有更新。
After writing a lot of dumb code that doesn't seem to work, I think the following gets me to the right answer.
在写了很多看起来不怎么有用的愚蠢代码之后,我认为下面的内容可以帮助我找到正确的答案。
aubk[,grp := author_id]
num.grp.old <- aubk[,length(unique(grp))]
iterations <- 0
repeat {
aubk[,grp := min(grp),by=author_id]
aubk[,grp := min(grp), by=book_id]
num.grp.new <- aubk[,length(unique(grp))]
if(num.grp.new == num.grp.old) {break}
num.grp.old <- num.grp.new
iterations <- iterations + 1
}
#2
1
Here's a go re-hashing my answer to an old question of mine that Josh O'Brien linked in the comments ( identify groups of linked episodes which chain together ). This answer uses the igraph
library.
下面是我对Josh O'Brien在评论中提到的一个老问题的回答。这个答案使用了igraph库。
# Dummy data that might be easier to interpret to show it worked
# Authors 1,2 and 3,4 should group. author 5 is a group to themselves
aubk <- data.frame(author_id=c(1,2,3,4,5),book_id=c(1,1,2,2,5))
# identify authors with a bit of leading text to prevent *es
# with the book ids
aubk$author_id2 <- paste0("au",aubk$author_id)
library(igraph)
#create a graph - this needs to be matrix input
au_graph <- graph.edgelist(as.matrix(aubk[c("author_id2","book_id")]))
# get the ids of the authors
result <- data.frame(author_id=names(au_graph[1]),stringsAsFactors=FALSE)
# get the corresponding group membership of the authors
result$group <- clusters(au_graph)$membership
# subset to only the authors data
result <- result[substr(result$author_id,1,2)=="au",]
# make the author_id variable numeric again
result$author_id <- as.numeric(substr(result$author_id,3,nchar(result$author_id)))
> result
author_id group
1 1 1
3 2 1
4 3 2
6 4 2
7 5 3
#3
0
A couple of suggestions
两个建议
aubk[,list(author_list = list(sort(author_id))), by = book_id]
will give a list of author groups
会给出作者组的列表吗
The followingwill create a unique identifier for each group of authors and then return a list with
接下来将为每组作者创建一个唯一的标识符,然后返回一个列表
- the number of books
- 书的数量
- A list of the book ids
- 图书id的列表。
- A unique identifier of the book_ids
- book_ids的唯一标识符
- number of authors
- 的作者
for each unique group of authors
对每个独特的作者组。
aubk[, list(author_list = list(sort(author_id)),
group_id = paste0(sort(author_id), collapse=','),
n_authors = .N),by = book_id][,
list(n_books = .N,
n_authors = unique(n_authors),
book_list = list(book_id),
book_ids = paste0(book_id, collapse = ', ')) ,by = group_id]
If the author order matters, just remove the sort
with the definitions of author_list
and group_id
如果作者顺序重要,只需删除具有author_list和group_id定义的排序
EDIT
noting that the above, while useful does not do the appropriate grouping
注意,上面的内容虽然有用,但并不能进行适当的分组
Perhaps the following will
也许下面会
# the unique groups of authors by book
unique_authors <- aubk[, list(sort(author_id)), by = book_id]
# some helper functions
# a filter function that allows arguments to be passed
.Filter <- function (f, x,...)
{
ind <- as.logical(sapply(x, f,...))
x[!is.na(ind) & ind]
}
# any(x in y)?
`%%in%%` <- function(x,table){any(unlist(x) %in% table)}
# function to filter a list and return the unique elements from
# flattened values
FilterList <- function(.list, table) {
unique(unlist(.Filter(`%%in%%`, .list, table =table)))
}
# all the authors
all_authors <- unique(unlist(unique_authors))
# with names!
setattr(all_authors, 'names', all_authors)
# get for each author, the authors with whom they have
# collaborated in at least 1 book
lapply(all_authors, FilterList, .list = unique_authors)
#1
3
Converting 500K nodes into an adjacency matrix was too much for my computer's memory, so I couldn't use igraph
. The RBGL
package isn't updated for R version 2.15.1, so that was out as well.
将500K节点转换为邻接矩阵对我的计算机内存来说太难了,所以我不能使用igraph。RBGL包没有更新为R版本2.15.1,所以也没有更新。
After writing a lot of dumb code that doesn't seem to work, I think the following gets me to the right answer.
在写了很多看起来不怎么有用的愚蠢代码之后,我认为下面的内容可以帮助我找到正确的答案。
aubk[,grp := author_id]
num.grp.old <- aubk[,length(unique(grp))]
iterations <- 0
repeat {
aubk[,grp := min(grp),by=author_id]
aubk[,grp := min(grp), by=book_id]
num.grp.new <- aubk[,length(unique(grp))]
if(num.grp.new == num.grp.old) {break}
num.grp.old <- num.grp.new
iterations <- iterations + 1
}
#2
1
Here's a go re-hashing my answer to an old question of mine that Josh O'Brien linked in the comments ( identify groups of linked episodes which chain together ). This answer uses the igraph
library.
下面是我对Josh O'Brien在评论中提到的一个老问题的回答。这个答案使用了igraph库。
# Dummy data that might be easier to interpret to show it worked
# Authors 1,2 and 3,4 should group. author 5 is a group to themselves
aubk <- data.frame(author_id=c(1,2,3,4,5),book_id=c(1,1,2,2,5))
# identify authors with a bit of leading text to prevent *es
# with the book ids
aubk$author_id2 <- paste0("au",aubk$author_id)
library(igraph)
#create a graph - this needs to be matrix input
au_graph <- graph.edgelist(as.matrix(aubk[c("author_id2","book_id")]))
# get the ids of the authors
result <- data.frame(author_id=names(au_graph[1]),stringsAsFactors=FALSE)
# get the corresponding group membership of the authors
result$group <- clusters(au_graph)$membership
# subset to only the authors data
result <- result[substr(result$author_id,1,2)=="au",]
# make the author_id variable numeric again
result$author_id <- as.numeric(substr(result$author_id,3,nchar(result$author_id)))
> result
author_id group
1 1 1
3 2 1
4 3 2
6 4 2
7 5 3
#3
0
A couple of suggestions
两个建议
aubk[,list(author_list = list(sort(author_id))), by = book_id]
will give a list of author groups
会给出作者组的列表吗
The followingwill create a unique identifier for each group of authors and then return a list with
接下来将为每组作者创建一个唯一的标识符,然后返回一个列表
- the number of books
- 书的数量
- A list of the book ids
- 图书id的列表。
- A unique identifier of the book_ids
- book_ids的唯一标识符
- number of authors
- 的作者
for each unique group of authors
对每个独特的作者组。
aubk[, list(author_list = list(sort(author_id)),
group_id = paste0(sort(author_id), collapse=','),
n_authors = .N),by = book_id][,
list(n_books = .N,
n_authors = unique(n_authors),
book_list = list(book_id),
book_ids = paste0(book_id, collapse = ', ')) ,by = group_id]
If the author order matters, just remove the sort
with the definitions of author_list
and group_id
如果作者顺序重要,只需删除具有author_list和group_id定义的排序
EDIT
noting that the above, while useful does not do the appropriate grouping
注意,上面的内容虽然有用,但并不能进行适当的分组
Perhaps the following will
也许下面会
# the unique groups of authors by book
unique_authors <- aubk[, list(sort(author_id)), by = book_id]
# some helper functions
# a filter function that allows arguments to be passed
.Filter <- function (f, x,...)
{
ind <- as.logical(sapply(x, f,...))
x[!is.na(ind) & ind]
}
# any(x in y)?
`%%in%%` <- function(x,table){any(unlist(x) %in% table)}
# function to filter a list and return the unique elements from
# flattened values
FilterList <- function(.list, table) {
unique(unlist(.Filter(`%%in%%`, .list, table =table)))
}
# all the authors
all_authors <- unique(unlist(unique_authors))
# with names!
setattr(all_authors, 'names', all_authors)
# get for each author, the authors with whom they have
# collaborated in at least 1 book
lapply(all_authors, FilterList, .list = unique_authors)