C1W2.Assignment: Naive Bayes

理论课：C1W2.Sentiment Analysis with Naïve Bayes

文章目录

加载包与数据
Part 1: Process the Data
- Part 1.1 Implementing your helper functions
- - Instructions
Part 2: Train your model using Naive Bayes
- 步骤
- Prior and Logprior
- Positive and Negative Probability of a Word
- Log likelihood
Part 3: Test your naive bayes
- 实现test_naive_bayes
Part 4: Filter words by Ratio of positive to negative counts
- 实现get_ratio
- 实现get_words_by_threshold(freqs,label,threshold)
Part 5: Error Analysis
Part 6: Predict with your own tweet

理论课： C1W2.Sentiment Analysis with Naïve Bayes

加载包与数据

from utils import process_tweet, lookup
import pdb
from nltk.corpus import stopwords, twitter_samples
import numpy as np
import pandas as pd
import nltk
import string
from nltk.tokenize import TweetTokenizer
from os import getcwd
import w2_unittest

注意，需要提前下载NLTK推特数据集和停用词表
加载数据并切分训练和测试集

# get the sets of positive and negative tweets
all_positive_tweets = twitter_samples.strings('positive_tweets.json')
all_negative_tweets = twitter_samples.strings('negative_tweets.json')

# split the data into two pieces, one for training and one for testing (validation set)
test_pos = all_positive_tweets[4000:]
train_pos = all_positive_tweets[:4000]
test_neg = all_negative_tweets[4000:]
train_neg = all_negative_tweets[:4000]

train_x = train_pos + train_neg
test_x = test_pos + test_neg

# avoid assumptions about the length of all_positive_tweets
train_y = np.append(np.ones(len(train_pos)), np.zeros(len(train_neg)))
test_y = np.append(np.ones(len(test_pos)), np.zeros(len(test_neg)))

Part 1: Process the Data

对于任何机器学习项目来说，一旦收集到数据，第一步就是对其进行处理，为模型提供有用的输入。

去除噪音：首先，要去除数据中的噪音，也就是去除那些不能说明太多内容的词语（停用词）。其中包括 "I, you, are, is,… "等所有常见词语，这些词语无法提供足够的情感信息。
删除股市行情、转发符号、超链接和标签，因为它们无法提供大量的情感信息。
删除推文中的所有标点符号。这样做的原因是，我们希望将带标点符号或不带标点符号的词视为同一个词，而不是将 “happy”、“happy?”、“happy!”、"happy "和 "happy. "视为不同的词。
使用词干法，只跟踪每个词的词干。换句话说，我们将把 “motivation”、"motivated "和 "motivate "归入同一个词干 “motiv-”，从而对它们进行类似的处理。

之前的C1W1中的函数 process_tweet，可以完成以上功能。

custom_tweet = "RT @Twitter @chapagain Hello There! Have a great day. :) #good #morning http://chapagain.com.np"

# print cleaned tweet
print(process_tweet(custom_tweet))

答案：[‘hello’, ‘great’, ‘day’, ‘????’, ‘good’, ‘morn’]

Part 1.1 Implementing your helper functions

这里要完成情感词频字典（freqs）的构建，字典的键是一个元组（单词、标签），值是相应的频率。
在utils.py中，还实现了一个查询辅助函数（lookup），该函数接收 freqs 词典、一个单词和一个标签（1 或 0），并返回该单词和标签元组在推文集合中出现的次数。
例如，给定以下推文：
[“i am rather excited”, “you are rather happy”]
标签都是为1，函数将返回一个包含以下键值对的字典：
{
(“rather”, 1): 2,
(“happi”, 1) : 1,
(“excit”, 1) : 1
}

请注意，对于给定字符串中的每个单词，都会为其分配相同的标签 1。
请注意，"i "和 "am "这两个词没有被保存，因为它是一个停用词，已被 process_tweet 删除。
请注意 "ather "这个词在推文列表中出现了两次，因此它的计数值是 2。

Instructions

创建一个函数 count_tweets，将推文列表作为输入，对所有推文进行数据预处理，然后返回一个字典。

字典中的键是一个元组，包含词干及其类别标签，例如（“happi”,1）。
值是该词在给定推文中出现的次数（整数）。

# UNQ_C1 GRADED FUNCTION: count_tweets

def count_tweets(result, tweets, ys):
    '''
    Input:
        result: a dictionary that will be used to map each pair to its frequency
        tweets: a list of tweets
        ys: a list corresponding to the sentiment of each tweet (either 0 or 1)
    Output:
        result: a dictionary mapping each pair to its frequency
    '''
    ### START CODE HERE ###
    for y, tweet in zip(ys, tweets):
        for word in process_tweet(tweet):
            # define the key, which is the word and label tuple
            pair = (word,y)
            
            # if the key exists in the dictionary, increment the count
            if pair in result:
                result[pair] += 1

            # else, if the key is new, add it to the dictionary and set the count to 1
            else:
                result[pair] = 1
    ### END CODE HERE ###

    return result

# Testing your function

result = {}
tweets = ['i am happy', 'i am tricked', 'i am sad', 'i am tired', 'i am tired']
ys = [1, 0, 0, 0, 0]
count_tweets(result, tweets, ys)

结果：
{(‘happi’, 1): 1, (‘trick’, 0): 1, (‘sad’, 0): 1, (‘tire’, 0): 2}

Part 2: Train your model using Naive Bayes

步骤

训练朴素贝叶斯分类器的第一个步骤是确定类别的数量。
然后为每个类别创建一个概率。
$P(D_{pos})$ 是文档为正类的概率。
$P(D_{neg})$ 是文档为负类的概率。
使用以下公式，并将值存储在字典中：
$P(D_{pos}) = \frac{D_{pos}}{D}\tag{1}$

$P(D_{neg}) = \frac{D_{neg}}{D}\tag{2}$

其中， $D$ 是推文的总数， $D_{pos}$ 是正面推文的总数， $D_{neg}$ 是负面推文的总数。

Prior and Logprior

先验概率表示目标人群中某条推文是正面还是负面的潜在概率。换句话说，如果我们没有任何具体信息，盲目地从人群集合中挑选一条推文，那么这条推文是正面还是负面的概率是多少？具体公式为：
$\cfrac{P(D_{pos})}{P(D_{neg})}$
我们可以取先验值的对数来调整它的大小，我们称之为 logprior：
$\text{logprior} = \log \left( \frac{P(D_{pos})}{P(D_{neg})} \right) = \log \left( \frac{D_{pos}}{D_{neg}} \right)$
注意， $log(\frac{A}{B})$ 等于 $l o g (A) - l o g (B)$ 。因此，logprior 也可以计算为两个对数之差：
$\text{logprior} = \log (P(D_{pos})) - \log (P(D_{neg})) = \log (D_{pos}) - \log (D_{neg})\tag{3}$

Positive and Negative Probability of a Word

要计算词汇表中某个特定单词的正向概率和负向概率，可使用以下输入：

$freq_{pos}$ 和 $freq_{neg}$ 是该词在正分类或负分类中的频率。换句话说，一个词的正向频率是该词被标为 1 的次数。
$N_{pos}$ 和 $N_{neg}$ 分别是所有文档（所有推文）中正词和负词的总数。
$V$ 是整个文档集合中所有类别（无论是正面还是负面）的唯一单词数。

可使用以下公式计算特定单词的正负概率：
$P(W_{pos}) = \frac{freq_{pos} + 1}{N_{pos} + V}\tag{4}$

秒客网