tm_map具有并行::mclapply错误发生在Mac上的R 3.0.1中

I am using R 3.0.1 on Platform: x86_64-apple-darwin10.8.0 (64-bit)

我在平台上使用R 3.0.1: x86_64-苹果-达尔文10.8.0(64位)

I am trying to use tm_map from the tm library. But when I execute the this code

我正在尝试使用来自tm库的tm_map。但是当我执行这个代码时

library(tm)
data('crude')
tm_map(crude, stemDocument)

I get this error:

我得到这个错误:

Warning message:
In parallel::mclapply(x, FUN, ...) :
  all scheduled cores encountered errors in user code

Does anyone know a solution for this?

有人知道解决方法吗?

7 个解决方案

#1

I suspect you don't have the SnowballC package installed, which seems to be required. tm_map is supposed to run stemDocument on all the documents using mclapply. Try just running the stemDocument function on one document, so you can extract the error:

我怀疑您没有安装SnowballC包，这似乎是必需的。tm_map应该使用mclapply在所有文档上运行stemDocument。尝试在一个文档上运行stemDocument函数，这样您就可以提取错误:

stemDocument(crude[[1]])

For me, I got an error:

对我来说，我犯了一个错误:

Error in loadNamespace(name) : there is no package called ‘SnowballC’

So I just went ahead and installed SnowballC and it worked. Clearly, SnowballC should be a dependency.

所以我就安装了SnowballC，它很好用。显然，雪球应该是一个依赖项。

#2

I just ran into this. It took me a bit of digging but I found out what was happening.

我刚遇到这个。我花了一点时间去挖掘，但我发现了正在发生的事情。

I had a line of code 'rdevel <- tm_map(rdevel, asPlainTextDocument)'

我有一行代码“rdevel <- tm_map(rdevel, asPlainTextDocument)”
Running this produced the error

运行它会产生错误


    In parallel::mclapply(x, FUN, ...) :
      all scheduled cores encountered errors in user code

It turns out that 'tm_map' calls some code in 'parallel' which attempts to figure out how many cores you have. To see what it's thinking, type
事实证明，‘tm_map’调用了‘parallel’中的一些代码，它试图计算出你有多少个核。看看它在想什么，打字。


    > getOption("mc.cores", 2L)
    [1] 2
    >

Aha moment! Tell the 'tm_map' call to only use one core!
开心的时刻!告诉“tm_map”调用只使用一个内核!


    > rdevel <- tm_map(rdevel, asPlainTextDocument, mc.cores=1)
    Error in match.fun(FUN) : object 'asPlainTextDocument' not found
    > rdevel <- tm_map(rdevel, asPlainTextDocument, mc.cores=4)
    Warning message:
    In parallel::mclapply(x, FUN, ...) :
      all scheduled cores encountered errors in user code
    >

So ... with more than one core, rather than give you the error message, 'parallel' just tells you there was an error in each core. Not helpful, parallel! I forgot the dot - the function name is supposed to be 'as.PlainTextDocument'!

所以…与多个核心相比，“parallel”只是告诉你每个核心都有一个错误，而不是给出错误信息。不帮助,平行!我忘了点——函数名应该是'as.PlainTextDocument'!

So - if you get this error, add 'mc.cores=1' to the 'tm_map' call and run it again.

因此——如果您得到这个错误，将'mc.core =1'添加到'tm_map'调用并再次运行它。

#3

I found an answer to this that was successful for me in this question: Charles Copley, in his answer, indicates he thinks the new tm package requires lazy = TRUE to be explicitly defined.

我在这个问题中找到了一个成功的答案:Charles Copley在他的回答中指出，他认为新的tm包需要lazy = TRUE来明确定义。

So, your code would look like this

代码是这样的。

library(tm)
data('crude')
tm_map(crude, stemDocument, lazy = TRUE)

I also tried it without SnowballC to see if it was a combination of those two answers. It did not appear to affect the result either way.

我也尝试了没有雪球的游戏，看看这两个答案是否结合在一起。它似乎并没有对结果产生任何影响。

#4

I have been facing same issue but finally got it fixed. My guess is that if I name the corpus as "longName" or "companyNewsCorpus", I get the issue but if I use corpus value as "a", it works well. Really weird.

我一直面临着同样的问题，但最终还是解决了。我的猜测是，如果我将语料库命名为“longName”或“companyNewsCorpus”，我就会有问题，但如果我将语料库值命名为“a”，那么它就会运行得很好。真的奇怪。

Below code gives same error message mentioned in this thread

下面的代码给出了这个线程中提到的相同的错误消息

companyNewsCorpus  <-Corpus(DirSource("SourceDirectory"),
                            readerControl = list(language="english"))
companyNewsCorpus <- tm_map(companyNewsCorpus, 
                            removeWords, stopwords("english"))

But if I convert this in below, it works without issues.

但是如果我把它转换到下面，它就可以正常工作了。

a  <-Corpus(DirSource("SourceDirectory"), 
            readerControl = list(language="english"))
a <- tm_map(a, removeWords, stopwords("english"))

#5

I ran into the same problem in tm using an Intel quad core I7 running on Mac OS X 10.10.5, and got the following warning:

我在tm遇到了同样的问题，使用运行在Mac OS X 10.10.5上的英特尔四核I7，得到以下警告:

In mclapply(content(x), FUN, ...) scheduled core 1 encountered error in user code, all values of the job will be affected

在mclapply(content(x)， FUN，…)计划的core 1在用户代码中遇到错误时，作业的所有值都会受到影响

I was creating a corpus after downloading Twitter data.

我在下载Twitter数据后创建了一个语料库。

Charles Copley's solution worked for me as well. I used: tm_map(*filename*, stemDocument, lazy = TRUE) after creating my corpus and then tm worked correctly.

查尔斯·科普利(Charles Copley)的解决方案对我也起了作用。我使用:tm_map(*filename*， stemDocument, lazy = TRUE)创建了我的语料库，然后tm正常工作。

#6

I also ran into this same issue while using the tm library's removeWords function. Some of the other answers such as setting the number of cores to 1 did work for removing the set of English stop words, however I wanted to also remove a custom list of first names and surnames from my corpus, and these lists were upwards of 100,000 words long each.

我在使用tm库的removeWords函数时也遇到了同样的问题。等一些其他的答案设置内核的数量1为删除的工作英语停止的话,不过我想也自定义名字和姓氏从列表中删除我的文集,这些列表被超过100000字。

None of the other suggestions would help this issue and it turns out that through some trial and error that removeWords seemed to have a limitation of 1000 words in a vector. So to I wrote this function that solved the issue for me:

没有任何其他的建议可以帮助这个问题，通过一些尝试和错误，删除的词在向量中似乎有1000个单词的限制。所以我写了这个函数为我解决了这个问题:

# Let x be a corpus
# Let y be a vector containing words to remove
removeManyWords <- function (x, y) {

      n <- ceiling(length(y)/1000)
      s <- 1
      e <- 1000

      for (i in 1:n) {

            x <- tm_map(x, content_transformer(removeWords), y[s:e])
            s <- s + 1000
            e <- e + 1000

      }

      x

 }

This function essentially counts how many words are in the vector of words I want to remove, and then divides it by 1000 and rounds up to the nearest whole number, n. We then loop through the vector of words to remove n times. With this method I didn't need to use lazy = TRUE or change the number of cores to use as can be seen from the actual removeWords call in the function. Hope this helps!

这个函数本质上是计算我想要删除的单词向量中有多少个单词，然后除以1000，最后四舍五入到最接近的整数n，然后循环遍历单词向量，删除n次。使用这个方法，我不需要使用lazy = TRUE或更改要使用的内核数量，从函数中的实际removeWords调用可以看到。希望这可以帮助!

#7

I was working on Twitter data and got the same error in the original question while I was trying to convert all text to lower with tm_map() function

我正在处理Twitter数据，当我试图用tm_map()函数将所有文本转换为low时，我在最初的问题中得到了相同的错误

Warning message: In parallel::mclapply(x, FUN, ...) :   
all scheduled cores encountered errors in user code

Installing and loading package SnowballC resolved the problem completely. Hope this helps.

安装和加载程序包SnowballC完全解决了这个问题。希望这个有帮助。

#1

stemDocument(crude[[1]])