《机器学习实战》中贝叶斯分类中导入RSS源例子

跟着书中代码往下写在这里卡住了，考虑到可能还会有其他同学也遇到了这样的问题，记下来分享。

先吐槽一下，相信大部分网友在这里卡住的主要原因是伟大的GFW，所以无论是软件*还是肉身*的小伙伴们估计是无论如何也看不到这篇博文的，不想往下看的请自觉使用*技能。

怎么安装feedparser？

按书中提供的网址直接安装feedparser会提示错误说没有setuptools，然后去找setuptools，官方的说法是windows最好用ez_setup.py安装，我确实下载不下来官网的那个ez_etup.py，这个帖子给出了解决方案：http://adesquared.wordpress.com/2013/07/07/setting-up-python-and-easy_install-on-windows-7/

ez_setup.py

将这个文件直接拷贝到C:\\python27文件夹中，输入命令行：python ez_setup.py install

然后转到放feedparser安装文件的文件夹中，命令行输入：python setup.py install

作者提供的RSS源链接“http://newyork.craigslist.org/stp/index.rss”不可访问怎么办？

书中作者的意思是以来自源 http://newyork.craigslist.org/stp/index.rss 中的文章作为分类为1的文章，以来自源 http://sfbay.craigslist.org/stp/index.rss 中的文章作为分类为0的文章

为了能够跑通示例代码，可以找两可用的RSS源作为替代。

我用的是这两个源：

NASA Image of the Day：http://www.nasa.gov/rss/dyn/image_of_the_day.rss

Yahoo Sports - NBA - Houston Rockets News：http://sports.yahoo.com/nba/teams/hou/rss.xml

也就是说，如果算法运行正确的话，所有来自于 nasa 的文章将会被分类为1，所有来自于yahoo sports的休斯顿火箭队新闻将会分类为0

使用自己定义的RSS源，当程序运行到trainNB0(array(trainMat),array(trainClasses))时会报错，怎么办？

从书中作者的例子来看，作者使用的源中文章数量较多，len(ny['entries']) 为 100，我找的几个 RSS 源只有10-20个左右。

>>> import feedparser
>>>ny=feedparser.parse('http://newyork.craigslist.org/stp/index.rss')
>>> ny['entries']
>>> len(ny['entries'])
100

因为作者的一个RSS源有100篇文章，所以他可以在代码中剔除了30个“停用词”，随机选择20篇文章作为测试集，但是当我们使用替代RSS源时我们只有10篇文章却要取出20篇文章作为测试集，这样显然是会出错的。只要自己调整下测试集的数量就可以让代码跑通；如果文章中的词太少，减少剔除的“停用词”数量可以提高算法的准确度。

如果不想将出现频率排序最高的30个单词移除，该如何去掉“停用词”？

可以把要去掉的停用词存放到txt文件中，使用时读取（替代移除高频词的代码）。具体需要停用哪些词可以参考这里 http://www.ranks.nl/stopwords

以下代码想正常运行需要将停用词保存至stopword.txt中。

我的txt中保存了以下单词，效果还不错：

a
about
above
after
again
against
all
am
an
and
any
are
aren't
as
at
be
because
been
before
being
below
between
both
but
by
can't
cannot
could
couldn't
did
didn't
do
does
doesn't
doing
don't
down
during
each
few
for
from
further
had
hadn't
has
hasn't
have
haven't
having
he
he'd
he'll
he's
her
here
here's
hers
herself
him
himself
his
how
how's
i
i'd
i'll
i'm
i've
if
in
into
is
isn't
it
it's
its
itself
let's
me
more
most
mustn't
my
myself
no
nor
not
of
off
on
once
only
or
other
ought
our
ours
ourselves
out
over
own
same
shan't
she
she'd
she'll
she's
should
shouldn't
so
some
such
than
that
that's
the
their
theirs
them
themselves
then
there
there's
these
they
they'd
they'll
they're
they've
this
those
through
to
too
under
until
up
very
was
wasn't
we
we'd
we'll
we're
we've
were
weren't
what
what's
when
when's
where
where's
which
while
who
who's
whom
why
why's
with
won't
would
wouldn't
you
you'd
you'll
you're
you've
your
yours
yourself
yourselves

'''

Created on Oct 19, 2010

@author: Peter

'''

from numpy import *

def loadDataSet():

    postingList=[['my', 'dog', 'has', 'flea', 'problems', 'help','my','dog', 'please'],

                 ['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'],

                 ['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'],

                 ['stop', 'posting', 'stupid', 'worthless', 'garbage'],

                 ['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'],

                 ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']]

    classVec = [0,1,0,1,0,1]    #1 is abusive, 0 not

    return postingList,classVec

def createVocabList(dataSet):

    vocabSet = set([])  #create empty set

    for document in dataSet:

        vocabSet = vocabSet | set(document) #union of the two sets

    return list(vocabSet)

def bagOfWords2Vec(vocabList, inputSet):

    returnVec = [0]*len(vocabList)

    for word in inputSet:

        if word in vocabList:

            returnVec[vocabList.index(word)] += 1

        else: print "the word: %s is not in my Vocabulary!" % word

    return returnVec

def trainNB0(trainMatrix,trainCategory):

    numTrainDocs = len(trainMatrix)

    numWords = len(trainMatrix[0])

    pAbusive = sum(trainCategory)/float(numTrainDocs)

    p0Num = ones(numWords); p1Num = ones(numWords)      #change to ones()

    p0Denom = 2.0; p1Denom = 2.0                        #change to 2.0

    for i in range(numTrainDocs):

        if trainCategory[i] == 1:

            p1Num += trainMatrix[i]

            p1Denom += sum(trainMatrix[i])

        else:

            p0Num += trainMatrix[i]

            p0Denom += sum(trainMatrix[i])

    p1Vect = log(p1Num/p1Denom)          #change to log()

    p0Vect = log(p0Num/p0Denom)          #change to log()

    return p0Vect,p1Vect,pAbusive

def classifyNB(vec2Classify, p0Vec, p1Vec, pClass1):

    p1 = sum(vec2Classify * p1Vec) + log(pClass1)    #element-wise mult

    p0 = sum(vec2Classify * p0Vec) + log(1.0 - pClass1)

    if p1 > p0:

        return 1

    else:

        return 0

def bagOfWords2VecMN(vocabList, inputSet):

    returnVec = [0]*len(vocabList)

    for word in inputSet:

        if word in vocabList:

            returnVec[vocabList.index(word)] += 1

    return returnVec

def testingNB():

    print '*** load dataset for training ***'

    listOPosts,listClasses = loadDataSet()

    print 'listOPost:\n',listOPosts

    print 'listClasses:\n',listClasses

    print '\n*** create Vocab List ***'

    myVocabList = createVocabList(listOPosts)

    print 'myVocabList:\n',myVocabList

    print '\n*** Vocab show in post Vector Matrix ***'

    trainMat=[]

    for postinDoc in listOPosts:

        trainMat.append(bagOfWords2Vec(myVocabList, postinDoc))

    print 'train matrix:',trainMat

    print '\n*** train P0V p1V pAb ***'

    p0V,p1V,pAb = trainNB0(array(trainMat),array(listClasses))

    print 'p0V:\n',p0V

    print 'p1V:\n',p1V

    print 'pAb:\n',pAb

    print '\n*** classify ***'

    testEntry = ['love', 'my', 'dalmation']

    thisDoc = array(bagOfWords2Vec(myVocabList, testEntry))

    print testEntry,'classified as: ',classifyNB(thisDoc,p0V,p1V,pAb)

    testEntry = ['stupid', 'garbage']

    thisDoc = array(bagOfWords2Vec(myVocabList, testEntry))

    print testEntry,'classified as: ',classifyNB(thisDoc,p0V,p1V,pAb)

def textParse(bigString):    #input is big string, #output is word list

    import re

    listOfTokens = re.split(r'\W*', bigString)

    return [tok.lower() for tok in listOfTokens if len(tok) > 2] 

def spamTest():

    docList=[]; classList = []; fullText =[]

    for i in range(1,26):

        wordList = textParse(open('email/spam/%d.txt' % i).read())

        docList.append(wordList)

        fullText.extend(wordList)

        classList.append(1)

        wordList = textParse(open('email/ham/%d.txt' % i).read())

        docList.append(wordList)

        fullText.extend(wordList)

        classList.append(0)

    vocabList = createVocabList(docList)#create vocabulary

    trainingSet = range(50); testSet=[]           #create test set

    for i in range(10):

        randIndex = int(random.uniform(0,len(trainingSet)))

        testSet.append(trainingSet[randIndex])

        del(trainingSet[randIndex])

    trainMat=[]; trainClasses = []

    for docIndex in trainingSet:#train the classifier (get probs) trainNB0

        trainMat.append(bagOfWords2VecMN(vocabList, docList[docIndex]))

        trainClasses.append(classList[docIndex])

    p0V,p1V,pSpam = trainNB0(array(trainMat),array(trainClasses))

    errorCount = 0

    for docIndex in testSet:        #classify the remaining items

        wordVector = bagOfWords2VecMN(vocabList, docList[docIndex])

        if classifyNB(array(wordVector),p0V,p1V,pSpam) != classList[docIndex]:

            errorCount += 1

            print "classification error",docList[docIndex]

    print 'the error rate is: ',float(errorCount)/len(testSet)

    #return vocabList,fullText

def calcMostFreq(vocabList,fullText):

    import operator

    freqDict = {}

    for token in vocabList:

        freqDict[token]=fullText.count(token)

    sortedFreq = sorted(freqDict.iteritems(), key=operator.itemgetter(1), reverse=True)

    return sortedFreq[:30]       

def stopWords():

    import re

    wordList =  open('stopword.txt').read() # see http://www.ranks.nl/stopwords

    listOfTokens = re.split(r'\W*', wordList)

    return [tok.lower() for tok in listOfTokens]

    print 'read stop word from \'stopword.txt\':',listOfTokens

    return listOfTokens

def localWords(feed1,feed0):

    import feedparser

    docList=[]; classList = []; fullText =[]

    print 'feed1 entries length: ', len(feed1['entries']), '\nfeed0 entries length: ', len(feed0['entries'])

    minLen = min(len(feed1['entries']),len(feed0['entries']))

    print '\nmin Length: ', minLen

    for i in range(minLen):

        wordList = textParse(feed1['entries'][i]['summary'])

        print '\nfeed1\'s entries[',i,']\'s summary - ','parse text:\n',wordList

        docList.append(wordList)

        fullText.extend(wordList)

        classList.append(1) #NY is class 1

        wordList = textParse(feed0['entries'][i]['summary'])

        print '\nfeed0\'s entries[',i,']\'s summary - ','parse text:\n',wordList

        docList.append(wordList)

        fullText.extend(wordList)

        classList.append(0)

    vocabList = createVocabList(docList)#create vocabulary

    print '\nVocabList is ',vocabList

    print '\nRemove Stop Word:'

    stopWordList = stopWords()

    for stopWord in stopWordList:

        if stopWord in vocabList:

            vocabList.remove(stopWord)

            print 'Removed: ',stopWord

##    top30Words = calcMostFreq(vocabList,fullText)   #remove top 30 words

##    print '\nTop 30 words: ', top30Words

##    for pairW in top30Words:

##        if pairW[0] in vocabList:

##            vocabList.remove(pairW[0])

##            print '\nRemoved: ',pairW[0]

    trainingSet = range(2*minLen); testSet=[]           #create test set

    print '\n\nBegin to create a test set: \ntrainingSet:',trainingSet,'\ntestSet',testSet

    for i in range(5):

        randIndex = int(random.uniform(0,len(trainingSet)))

        testSet.append(trainingSet[randIndex])

        del(trainingSet[randIndex])

    print 'random select 5 sets as the testSet:\ntrainingSet:',trainingSet,'\ntestSet',testSet

    trainMat=[]; trainClasses = []

    for docIndex in trainingSet:#train the classifier (get probs) trainNB0

        trainMat.append(bagOfWords2VecMN(vocabList, docList[docIndex]))

        trainClasses.append(classList[docIndex])

    print '\ntrainMat length:',len(trainMat)

    print '\ntrainClasses',trainClasses

    print '\n\ntrainNB0:'

    p0V,p1V,pSpam = trainNB0(array(trainMat),array(trainClasses))

    #print '\np0V:',p0V,'\np1V',p1V,'\npSpam',pSpam

    errorCount = 0

    for docIndex in testSet:        #classify the remaining items

        wordVector = bagOfWords2VecMN(vocabList, docList[docIndex])

        classifiedClass = classifyNB(array(wordVector),p0V,p1V,pSpam)

        originalClass = classList[docIndex]

        result =  classifiedClass != originalClass

        if result:

            errorCount += 1

        print '\n',docList[docIndex],'\nis classified as: ',classifiedClass,', while the original class is: ',originalClass,'. --',not result

    print '\nthe error rate is: ',float(errorCount)/len(testSet)

    return vocabList,p0V,p1V

def testRSS():

    import feedparser

    ny=feedparser.parse('http://www.nasa.gov/rss/dyn/image_of_the_day.rss')

    sf=feedparser.parse('http://sports.yahoo.com/nba/teams/hou/rss.xml')

    vocabList,pSF,pNY = localWords(ny,sf)

def testTopWords():

    import feedparser

    ny=feedparser.parse('http://www.nasa.gov/rss/dyn/image_of_the_day.rss')

    sf=feedparser.parse('http://sports.yahoo.com/nba/teams/hou/rss.xml')

    getTopWords(ny,sf)

def getTopWords(ny,sf):

    import operator

    vocabList,p0V,p1V=localWords(ny,sf)

    topNY=[]; topSF=[]

    for i in range(len(p0V)):

        if p0V[i] > -6.0 : topSF.append((vocabList[i],p0V[i]))

        if p1V[i] > -6.0 : topNY.append((vocabList[i],p1V[i]))

    sortedSF = sorted(topSF, key=lambda pair: pair[1], reverse=True)

    print "SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**"

    for item in sortedSF:

        print item[0]

    sortedNY = sorted(topNY, key=lambda pair: pair[1], reverse=True)

    print "NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**"

    for item in sortedNY:

        print item[0]

def test42():

    print '\n*** Load DataSet ***'

    listOPosts,listClasses = loadDataSet()

    print 'List of posts:\n', listOPosts

    print 'List of Classes:\n', listClasses

    print '\n*** Create Vocab List ***'

    myVocabList = createVocabList(listOPosts)

    print 'Vocab List from posts:\n', myVocabList

    print '\n*** Vocab show in post Vector Matrix ***'

    trainMat=[]

    for postinDoc in listOPosts:

        trainMat.append(bagOfWords2Vec(myVocabList,postinDoc))

    print 'Train Matrix:\n', trainMat

    print '\n*** Train ***'

    p0V,p1V,pAb = trainNB0(trainMat,listClasses)

    print 'p0V:\n',p0V

    print 'p1V:\n',p1V

    print 'pAb:\n',pAb

秒客网

《机器学习实战》中贝叶斯分类中导入RSS源例子

怎么安装feedparser？

作者提供的RSS源链接“http://newyork.craigslist.org/stp/index.rss”不可访问怎么办？

使用自己定义的RSS源，当程序运行到trainNB0(array(trainMat),array(trainClasses))时会报错，怎么办？

如果不想将出现频率排序最高的30个单词移除，该如何去掉“停用词”？

相关文章