I am creating a document term matrix (dtm for short) for a Naive Bayes implementation (I know there is a function for this, but I have to code it myself for homework.) I wrote a function that successfully creates the dtm, the problem is that the resulting matrix is taking up too much memory. For example a 100 x 32000 matrix (of 0's and 1's) is 24MB in size! This is resulting in crashy behavior in r when trying to work with the full 10k documents. The functions follow and a toy example is in the last 3 lines. Can anyone spot why the "sparser" function in particular is returning such memory-intensive results?
我正在为朴素贝叶斯实现创建一个文档术语矩阵(简称为dtm)(我知道有一个函数用于此,但我必须自己编写它来完成作业。)我写了一个函数,成功创建了dtm,问题结果矩阵占用了太多的内存。例如,100 x 32000矩阵(0和1)的大小为24MB!当尝试使用完整的10k文档时,这会导致r中的崩溃行为。功能如下,玩具示例在最后3行。任何人都可以发现为什么“稀疏”功能特别是返回这样的内存密集型结果?
listAllWords <- function(docs)
{
str1 <- strsplit(x=docs, split="\\s", fixed=FALSE)
dictDupl <- unlist(str1)[!(unlist(str1) %in% stopWords)]
dictionary <- unique(dictDupl)
}
#function to create the sparse matrix of words as they appear in each article segment
sparser <- function (docs, dictionary)
{
num.docs <- length(docs) #dtm rows
num.words <- length(dictionary) #dtm columns
dtm <- mat.or.vec(num.docs,num.words) # Instantiate dtm of zeroes
for (i in 1:num.docs)
{
doc.temp <- unlist(strsplit(x=docs[i], split="\\s", fixed=FALSE)) #vectorize words
num.words.doc <- length(doc.temp)
for (j in 1:num.words.doc)
{
ind <- which(dictionary == doc.temp[j]) #loop over words and find index in dict.
dtm[i,ind] <- 1 #indicate this word is in this document
}
}
return(dtm)
}
docs <- c("the first document contains words", "the second document is also made of words", "the third document is words and a number 4")
dictionary <- listAllWords(docs)
dtm <- sparser(docs,dictionary)
If it makes any difference I am running this in R Studio in Mac OSX, 64 bit
如果它有任何区别我在Mac OSX的R Studio中运行64位
3 个解决方案
#1
1
Surely part of your problem is that you are not actually storing integers, but doubles. Note:
当然,问题的一部分是你实际上并没有存储整数,而是双倍。注意:
m <- mat.or.vec(100,32000)
m1 <- matrix(0L,100,32000)
> object.size(m)
25600200 bytes
> object.size(m1)
12800200 bytes
And note the lack of the "L" in the code for mat.or.vec
:
请注意mat.or.vec代码中缺少“L”:
> mat.or.vec
function (nr, nc)
if (nc == 1L) numeric(nr) else matrix(0, nr, nc)
<bytecode: 0x1089984d8>
<environment: namespace:base>
You will also want to explicitly assign 1L
, otherwise R will convert everything to doubles upon the first assignment, I think. You can verify that by simply assigning one value of m1
above the value 1 and recheck the object size.
您还需要明确指定1L,否则R会在第一次分配时将所有内容转换为双打,我想。您可以通过简单地在值1上方分配一个m1值并重新检查对象大小来验证这一点。
I should probably also mention the function storage.mode
which can help you to verify that you're using integers.
我可能还应该提到函数storage.mode,它可以帮助您验证您是否正在使用整数。
#2
0
If you want to store 0/1 values economically, I would suggest raw
type.
如果你想经济地存储0/1值,我会建议原始类型。
m8 <- matrix(0,100,32000)
m4 <- matrix(0L,100,32000)
m1 <- matrix(raw(1),100,32000)
The raw
type takes just 1 byte per value:
原始类型每个值只需1个字节:
> object.size(m8)
25600200 bytes
> object.size(m4)
12800200 bytes
> object.size(m1)
3200200 bytes
Here is how to operate with it:
以下是如何使用它:
> m1[2,2] = as.raw(1)
> m1[2,2]
[1] 01
> as.integer(m1[2,2])
[1] 1
#1
1
Surely part of your problem is that you are not actually storing integers, but doubles. Note:
当然,问题的一部分是你实际上并没有存储整数,而是双倍。注意:
m <- mat.or.vec(100,32000)
m1 <- matrix(0L,100,32000)
> object.size(m)
25600200 bytes
> object.size(m1)
12800200 bytes
And note the lack of the "L" in the code for mat.or.vec
:
请注意mat.or.vec代码中缺少“L”:
> mat.or.vec
function (nr, nc)
if (nc == 1L) numeric(nr) else matrix(0, nr, nc)
<bytecode: 0x1089984d8>
<environment: namespace:base>
You will also want to explicitly assign 1L
, otherwise R will convert everything to doubles upon the first assignment, I think. You can verify that by simply assigning one value of m1
above the value 1 and recheck the object size.
您还需要明确指定1L,否则R会在第一次分配时将所有内容转换为双打,我想。您可以通过简单地在值1上方分配一个m1值并重新检查对象大小来验证这一点。
I should probably also mention the function storage.mode
which can help you to verify that you're using integers.
我可能还应该提到函数storage.mode,它可以帮助您验证您是否正在使用整数。
#2
0
If you want to store 0/1 values economically, I would suggest raw
type.
如果你想经济地存储0/1值,我会建议原始类型。
m8 <- matrix(0,100,32000)
m4 <- matrix(0L,100,32000)
m1 <- matrix(raw(1),100,32000)
The raw
type takes just 1 byte per value:
原始类型每个值只需1个字节:
> object.size(m8)
25600200 bytes
> object.size(m4)
12800200 bytes
> object.size(m1)
3200200 bytes
Here is how to operate with it:
以下是如何使用它:
> m1[2,2] = as.raw(1)
> m1[2,2]
[1] 01
> as.integer(m1[2,2])
[1] 1