I have a very large csv file (about 91 million rows so a for loop takes too long in R) of similarities between keywords that when I read into a data.frame looks like:
我有一个非常大的csv文件(大约9100万行,因此for循环在R中花费太长时间)关键字之间的相似性,当我读入data.frame看起来像:
> df
kwd1 kwd2 similarity
a b 1
b a 1
c a 2
a c 2
It is a sparse list and I would like to convert it into a sparse matrix:
它是一个稀疏列表,我想将其转换为稀疏矩阵:
> myMatrix
a b c
a . 1 2
b 1 . .
c 2 . .
I tried using sparseMatrix(), but converting the keyword names to integer indexes takes too much time.
我尝试使用sparseMatrix(),但将关键字名称转换为整数索引需要花费太多时间。
Thanks for any help!
谢谢你的帮助!
1 个解决方案
#1
1
acast
from the reshape2
package will do this nicely. There are base R solutions but I find the syntax much more difficult.
来自reshape2包的acast会做得很好。有基本R解决方案,但我发现语法更难。
library(reshape2)
df <- structure(list(kwd1 = structure(c(1L, 2L, 3L, 1L), .Label = c("a",
"b", "c"), class = "factor"), kwd2 = structure(c(2L, 1L, 1L,
3L), .Label = c("a", "b", "c"), class = "factor"), similarity = c(1L,
1L, 2L, 2L)), .Names = c("kwd1", "kwd2", "similarity"), class = "data.frame", row.names = c(NA,
-4L))
acast(df, kwd1 ~ kwd2, value.var='similarity', fill=0)
a b c
a 0 1 2
b 1 0 0
c 2 0 0
>
using sparseMatrix
from the Matrix
package:
使用Matrix包中的sparseMatrix:
library(Matrix)
df$kwd1 <- factor(df$kwd1)
df$kwd2 <- factor(df$kwd2)
foo <- sparseMatrix(as.integer(df$kwd1), as.integer(df$kwd2), x=df$similarity)
> foo
3 x 3 sparse Matrix of class "dgCMatrix"
foo <- sparseMatrix(as.integer(df$kwd1), as.integer(df$kwd2), x=df$similarity, dimnames=list(levels(df$kwd1), levels(df$kwd2)))
> foo
3 x 3 sparse Matrix of class "dgCMatrix"
a b c
a . 1 2
b 1 . .
c 2 . .
#1
1
acast
from the reshape2
package will do this nicely. There are base R solutions but I find the syntax much more difficult.
来自reshape2包的acast会做得很好。有基本R解决方案,但我发现语法更难。
library(reshape2)
df <- structure(list(kwd1 = structure(c(1L, 2L, 3L, 1L), .Label = c("a",
"b", "c"), class = "factor"), kwd2 = structure(c(2L, 1L, 1L,
3L), .Label = c("a", "b", "c"), class = "factor"), similarity = c(1L,
1L, 2L, 2L)), .Names = c("kwd1", "kwd2", "similarity"), class = "data.frame", row.names = c(NA,
-4L))
acast(df, kwd1 ~ kwd2, value.var='similarity', fill=0)
a b c
a 0 1 2
b 1 0 0
c 2 0 0
>
using sparseMatrix
from the Matrix
package:
使用Matrix包中的sparseMatrix:
library(Matrix)
df$kwd1 <- factor(df$kwd1)
df$kwd2 <- factor(df$kwd2)
foo <- sparseMatrix(as.integer(df$kwd1), as.integer(df$kwd2), x=df$similarity)
> foo
3 x 3 sparse Matrix of class "dgCMatrix"
foo <- sparseMatrix(as.integer(df$kwd1), as.integer(df$kwd2), x=df$similarity, dimnames=list(levels(df$kwd1), levels(df$kwd2)))
> foo
3 x 3 sparse Matrix of class "dgCMatrix"
a b c
a . 1 2
b 1 . .
c 2 . .