如何用更少的内存完成文本分类任务

时间:2022-03-20 00:11:51

(1)My goal: I am trying to use SVM to classify 10000 documents(each with 400 words) into 10 classes(evenly distributed). The features explored in my work include word n-gram (n=1~4),character n-gram(n=1~6).

(1)我的目标:我正在尝试使用SVM将10000个文档(每个文档400个单词)分成10个类(均匀分布)。在我的工作中探索的特征包括单词n-gram(n=1~ 4),字符n-gram(n=1~6)。

(2)My approach: I am representing each document using vectors of frequency values for each feature in the document. And using TF-IDF to formalize the vectors. parts of my code are below:

(2)我的方法:我使用文档中每个特性的频率值向量来表示每个文档。用TF-IDF将向量形式化。我的部分代码如下:

def commonVec(dicts,count1,count2):
    ''' put features with frequency between count1 and count2 into a common vector used for SVM training''' 
    global_vector = []
    master = {}
    for i, d in enumerate(dicts):
        for k in d:
            master.setdefault(k, []).append(i)
    for key in master:
        if (len(master[key])>=count1 and len(master[key])<=count2):  
            global_vector.append(key)
    global_vector1 = sorted(global_vector)
    return global_vector1 
def featureComb(mix,count1,count2,res1):
    '''combine word n-gram and character n-gram into a vector'''
    if mix[0]:
        common_vector1 = []
        for i in mix[0]:
            dicts1 = []
            for res in res1: #I stored all documents into database. res1 is the document result set and res is each document. 
                dicts1.append(ngram.characterNgrams(res[1], i)) # characterNgrams()will return a dictionary with feature name as the key, frequency as the value.
            common_vector1.extend(commonVec(dicts1, count1, count2))
    else:
        common_vector1 = []
    if mix[1]:
        common_vector2 = []
        for j in mix[1]:
            dicts2 = []
            for res in res1:
                dicts2.append(ngram.wordNgrams(res[1], j))        
            common_vector2.extend(commonVec(dicts2, count1, count2))
    else:
        common_vector2 = []
    return common_vector1+common_vector2

def svmCombineVector(mix,global_combine,label,X,y,res1):
    '''Construct X vector that can be used to train SVM'''
    lstm = []
    for res in res1:            
        y.append(label[res[0]]) # insert class label into y

        dici1 = {}
        dici2 = {}
        freq_term_vector = []
        for i in mix[0]:             
            dici1.update(ngram.characterNgrams(res[1], i))
        freq_term_vector.extend(dici1[gram] if gram in dici1 else 0 for gram in global_combine)    
        for j in mix[1]:
            dici2.update(ngram.wordNgrams(res[1], j))
        freq_term_vector.extend(dici2[gram] if gram in dici2 else 0 for gram in global_combine)
        lstm.append(freq_term_vector)
    freq_term_matrix = np.matrix(lstm)
    transformer = TfidfTransformer(norm="l2")
    tfidf = transformer.fit_transform(freq_term_matrix)
    X.extend(tfidf.toarray())

X = []
y = []
character = [1,2,3,4,5,6]
word = [1,2,3,4]
mix = [character,word]
global_vector_combine = featureComb(mix, 2, 5000, res1)
print len(global_vector_combine) # 542401
svmCombineVector(mix,global_vector_combine,label,X,y,res1)
clf1 = svm.LinearSVC()
clf1.fit(X, y)

(3)My problem: However, when I ran the source code, a memory error occurred.

(3)我的问题:但是,当我运行源代码时,发生了内存错误。

Traceback (most recent call last):
      File "svm.py", line 110, in <module>
        functions.svmCombineVector(mix,global_vector_combine,label,X,y,res1)
      File "/home/work/functions.py", line 201, in svmCombineVector
        X.extend(tfidf.toarray())
      File "/home/anaconda/lib/python2.7/site-packages/scipy/sparse/compressed.py", line 901, in toarray
        return self.tocoo(copy=False).toarray(order=order, out=out)
      File "/home/anaconda/lib/python2.7/site-packages/scipy/sparse/coo.py", line 269, in toarray
        B = self._process_toarray_args(order, out)
      File "/home/anaconda/lib/python2.7/site-packages/scipy/sparse/base.py", line 789, in _process_toarray
    _args
        return np.zeros(self.shape, dtype=self.dtype, order=order)
    MemoryError

I really have a hard time with it and need help from *.

我真的很难过,需要*的帮助。

  1. Could anyone explain some details and give me some idea how to solve it?
  2. 谁能解释一下细节,并给我一些解决办法吗?
  3. could anyone check my source code and show me some other methods to make use of memory more effectively?
  4. 谁能检查一下我的源代码并给我看一些其他的方法来更有效地利用内存吗?

1 个解决方案

#1


1  

The main problem you're facing is that you're using far too many features. It's actually quite extraordinary that you've managed to generate 542401 features from documents that contain just 400 words! I've seen SVM classifiers separate spam from non-spam with high accuracy using just 150 features -- word counts of selected words that say a lot about whether the document is spam. These use stemming and other normalization tricks to make the features more effective.

您面临的主要问题是您使用了太多的特性。实际上,您成功地从包含400个单词的文档中生成了542401个特性,这是非常不寻常的!我已经看到SVM分类器使用150个特性将垃圾邮件从非垃圾邮件中分离出来,这些特性非常精确——所选单词的字数可以说明文档是否为垃圾邮件。它们使用词干和其他规范化技巧使特性更有效。

You need to spend some time thinning out your features. Think about which features are most likely to contain information useful for this task. Experiment with different features. As long as you keep throwing everything but the kitchen sink in, you'll get memory errors. Right now you're trying to pass 10000 data points with 542401 dimensions each to your SVM. That's 542401 * 10000 * 4 = 21 gigabytes (conservatively) of data. My computer only has 4 gigabytes of RAM. You've got to pare this way down.1

你需要花些时间减少你的特征。考虑哪些特性最有可能包含对该任务有用的信息。尝试不同的特性。只要你把厨房里的东西都扔进去,你就会有记忆错误。现在,您正在尝试将每个具有542401维的10000个数据点传递给您的SVM。这是542401 * 10000 * 4 = 210g(保守的)数据。我的电脑只有4g内存。你必须这样削减

A first step towards doing so would be to think about how big your total vocabulary size is. Each document has only 400 words, but let's say those 400 words are taken from a vocabulary of 5000 words. That means there will be 5000 ** 4 = 6.25 * 10 ** 14 possible 4-grams. That's half a quadrillion possible 4-grams. Of course not all those 4-grams will appear in your documents, but this goes a long way towards explaining why you're running out of memory. Do you really need these 4-grams? Could you get away with 2-grams only? There are a measly 5000 ** 2 = 25 million possible 2-grams. That will fit much more easily in memory, even if all possible 2-grams appear (unlikely).

要做到这一点,首先要考虑的是你的词汇量有多大。每个文档只有400个单词,但我们假设这400个单词来自5000个单词的词汇。这意味着有5000 * 4 = 6.25 * 10 * 14可能的4克。那是可能的5千万亿分之一4克。当然,并不是所有的4克都将出现在文档中,但这对于解释为什么内存不足有很大帮助。你真的需要这些4克吗?你能只拿2克吗?有一个微不足道的5000 ** 2 = 2500万可能的2克。即使所有可能的2克(不太可能)都出现,它也能更容易地存储在内存中。

Also keep in mind that even if the SVM could handle quadrillions of datapoints, it would probably give bad results, because when you give any learning algorithm too many features, it will tend to overfit, picking up on irrelevant patterns and overgeneralizing from them. There are ways of dealing with this, but it's best not to deal with it at all if you can help it.

还要记住,即使SVM可以处理成千上万的数据池,它也可能会产生糟糕的结果,因为当你给任何一个学习算法太多的特性时,它将趋向于过度适应,选择不相关的模式并从中过度概括。有很多方法可以解决这个问题,但是如果你能帮助它,最好不要处理它。

I will also mention that these are not "newbie" problems. These are problems that machine learning specialists with PhDs have to deal with. They come up with lots of clever solutions, but we're not so clever that way, so we have to be clever a different way.

我还将提到,这些不是“新手”问题。有博士学位的机器学习专家必须解决这些问题。他们想出了很多聪明的解决办法,但我们没有那么聪明,所以我们必须用另一种聪明的方法。

Although I can't offer you specific suggestions for cleverness without knowing more, I would say that, first, stemming is a good idea in at least some cases. Stemming simply removes grammatical inflection, so that different forms of the same word ("swim" and "swimming") are treated as identical. This will probably reduce your vocabulary size significantly, at least if you're dealing with English text. A common choice is the porter stemmer, which is included in nltk, as well as in a number of other packages. Also, if you aren't already, you should probably strip punctuation and reduce all words to lower-case. From there, it really depends. Stylometry (identifying authors) sometimes requires only particles ("a", "an", "the"), conjunctions ("and", "but") and other very common words; spam, on the other hand, has its own oddball vocabularies of interest. At this level, it is very difficult to say in advance what will work; you'll almost certainly need to try different approaches to see which is most effective. As always, testing is crucial!

虽然我不能在不了解更多的情况下为你提供聪明的具体建议,但我想说,首先,至少在某些情况下,词干是一个好主意。词干只是去掉了语法上的变化,所以同一词(“swim”和“swimming”)的不同形式被认为是相同的。这可能会大大减少你的词汇量,至少如果你正在处理英语文本的话。一个常见的选择是porter stemmer,它包含在nltk中,以及其他一些包中。而且,如果你还没有写完,你应该去掉标点符号,把所有的词都换成小写的。从那以后,就得看情况了。文体计量学(识别作者)有时只需要粒子(“a”、“an”、“the”)、连词(“and”、“but”)和其他非常常见的词;另一方面,垃圾邮件也有自己古怪的词汇。在这一水平上,很难预先说什么会起作用;您几乎肯定需要尝试不同的方法来确定哪种方法最有效。一如既往,测试是至关重要的!

1. Well, possibly you have a huge amount of RAM at your disposal. For example, I have access to a machine with 48G of RAM at my current workplace. But I doubt it could handle this either, because the SVM will have its own internal representation of the data, which means there will be at least one copy at some point; if a second copy is needed at any point -- kaboom.

1。嗯,可能你有大量的随机存取存储器供你使用。例如,我在当前的工作环境中可以访问一台具有48G RAM的机器。但我怀疑它也不能处理这个问题,因为SVM将有自己的数据内部表示,这意味着在某个时刻至少会有一个拷贝;如果在任何时候都需要第二份拷贝——kaboom。

#1


1  

The main problem you're facing is that you're using far too many features. It's actually quite extraordinary that you've managed to generate 542401 features from documents that contain just 400 words! I've seen SVM classifiers separate spam from non-spam with high accuracy using just 150 features -- word counts of selected words that say a lot about whether the document is spam. These use stemming and other normalization tricks to make the features more effective.

您面临的主要问题是您使用了太多的特性。实际上,您成功地从包含400个单词的文档中生成了542401个特性,这是非常不寻常的!我已经看到SVM分类器使用150个特性将垃圾邮件从非垃圾邮件中分离出来,这些特性非常精确——所选单词的字数可以说明文档是否为垃圾邮件。它们使用词干和其他规范化技巧使特性更有效。

You need to spend some time thinning out your features. Think about which features are most likely to contain information useful for this task. Experiment with different features. As long as you keep throwing everything but the kitchen sink in, you'll get memory errors. Right now you're trying to pass 10000 data points with 542401 dimensions each to your SVM. That's 542401 * 10000 * 4 = 21 gigabytes (conservatively) of data. My computer only has 4 gigabytes of RAM. You've got to pare this way down.1

你需要花些时间减少你的特征。考虑哪些特性最有可能包含对该任务有用的信息。尝试不同的特性。只要你把厨房里的东西都扔进去,你就会有记忆错误。现在,您正在尝试将每个具有542401维的10000个数据点传递给您的SVM。这是542401 * 10000 * 4 = 210g(保守的)数据。我的电脑只有4g内存。你必须这样削减

A first step towards doing so would be to think about how big your total vocabulary size is. Each document has only 400 words, but let's say those 400 words are taken from a vocabulary of 5000 words. That means there will be 5000 ** 4 = 6.25 * 10 ** 14 possible 4-grams. That's half a quadrillion possible 4-grams. Of course not all those 4-grams will appear in your documents, but this goes a long way towards explaining why you're running out of memory. Do you really need these 4-grams? Could you get away with 2-grams only? There are a measly 5000 ** 2 = 25 million possible 2-grams. That will fit much more easily in memory, even if all possible 2-grams appear (unlikely).

要做到这一点,首先要考虑的是你的词汇量有多大。每个文档只有400个单词,但我们假设这400个单词来自5000个单词的词汇。这意味着有5000 * 4 = 6.25 * 10 * 14可能的4克。那是可能的5千万亿分之一4克。当然,并不是所有的4克都将出现在文档中,但这对于解释为什么内存不足有很大帮助。你真的需要这些4克吗?你能只拿2克吗?有一个微不足道的5000 ** 2 = 2500万可能的2克。即使所有可能的2克(不太可能)都出现,它也能更容易地存储在内存中。

Also keep in mind that even if the SVM could handle quadrillions of datapoints, it would probably give bad results, because when you give any learning algorithm too many features, it will tend to overfit, picking up on irrelevant patterns and overgeneralizing from them. There are ways of dealing with this, but it's best not to deal with it at all if you can help it.

还要记住,即使SVM可以处理成千上万的数据池,它也可能会产生糟糕的结果,因为当你给任何一个学习算法太多的特性时,它将趋向于过度适应,选择不相关的模式并从中过度概括。有很多方法可以解决这个问题,但是如果你能帮助它,最好不要处理它。

I will also mention that these are not "newbie" problems. These are problems that machine learning specialists with PhDs have to deal with. They come up with lots of clever solutions, but we're not so clever that way, so we have to be clever a different way.

我还将提到,这些不是“新手”问题。有博士学位的机器学习专家必须解决这些问题。他们想出了很多聪明的解决办法,但我们没有那么聪明,所以我们必须用另一种聪明的方法。

Although I can't offer you specific suggestions for cleverness without knowing more, I would say that, first, stemming is a good idea in at least some cases. Stemming simply removes grammatical inflection, so that different forms of the same word ("swim" and "swimming") are treated as identical. This will probably reduce your vocabulary size significantly, at least if you're dealing with English text. A common choice is the porter stemmer, which is included in nltk, as well as in a number of other packages. Also, if you aren't already, you should probably strip punctuation and reduce all words to lower-case. From there, it really depends. Stylometry (identifying authors) sometimes requires only particles ("a", "an", "the"), conjunctions ("and", "but") and other very common words; spam, on the other hand, has its own oddball vocabularies of interest. At this level, it is very difficult to say in advance what will work; you'll almost certainly need to try different approaches to see which is most effective. As always, testing is crucial!

虽然我不能在不了解更多的情况下为你提供聪明的具体建议,但我想说,首先,至少在某些情况下,词干是一个好主意。词干只是去掉了语法上的变化,所以同一词(“swim”和“swimming”)的不同形式被认为是相同的。这可能会大大减少你的词汇量,至少如果你正在处理英语文本的话。一个常见的选择是porter stemmer,它包含在nltk中,以及其他一些包中。而且,如果你还没有写完,你应该去掉标点符号,把所有的词都换成小写的。从那以后,就得看情况了。文体计量学(识别作者)有时只需要粒子(“a”、“an”、“the”)、连词(“and”、“but”)和其他非常常见的词;另一方面,垃圾邮件也有自己古怪的词汇。在这一水平上,很难预先说什么会起作用;您几乎肯定需要尝试不同的方法来确定哪种方法最有效。一如既往,测试是至关重要的!

1. Well, possibly you have a huge amount of RAM at your disposal. For example, I have access to a machine with 48G of RAM at my current workplace. But I doubt it could handle this either, because the SVM will have its own internal representation of the data, which means there will be at least one copy at some point; if a second copy is needed at any point -- kaboom.

1。嗯,可能你有大量的随机存取存储器供你使用。例如,我在当前的工作环境中可以访问一台具有48G RAM的机器。但我怀疑它也不能处理这个问题,因为SVM将有自己的数据内部表示,这意味着在某个时刻至少会有一个拷贝;如果在任何时候都需要第二份拷贝——kaboom。