Before We Classify
- 给定一个电影的评论(文本信息),我们想要知道这个评论的语气是积极(+1)的还是消极的(-1)。本文利用 naive bayes分类模型来解决这个问题。朴素贝叶斯的原理是计算某个样本属于某个类的概率。计算公式是基于贝叶斯理论:P(A∣B)=P(B∣A)/P(A)P(B),意思是给定B,计算A的概率。
# Here
# For each day, it contains whether or not the person ran, and whether or not they were tired.
days = [["ran", "was tired"], ["ran", "was not tired"], ["didn't run", "was tired"], ["ran", "was tired"], ["didn't run", "was not tired"], ["ran", "was not tired"], ["ran", "was tired"]]
# This is P(A):the probability of being tired
prob_tired = len([d for d in days if d[1] == "was tired"]) / len(days)
# This is P(B):the probability of running
prob_ran = len([d for d in days if d[0] == "ran"]) / len(days)
# This is P(B|A):the probability of running given that you are tired
prob_ran_given_tired = len([d for d in days if d[0] == "ran" and d[1] == "was tired"]) / len([d for d in days if d[1] == "was tired"])
# Now we can calculate P(A|B).
prob_tired_given_ran = (prob_ran_given_tired * prob_tired) / prob_ran
print("Probability of being tired given that you ran: {0}".format(prob_tired_given_ran))
Probability of being tired given that you ran: 0.6
Naive Bayes Intro
- 上一个例子中只有一个属性:跑步,而是否累是预测变量,所以可以使用贝叶斯公式:P(A∣B)=P(B∣A)/P(A)P(B),但是当属性多余一个时,这个公式就不好计算了,此时就引出了朴素贝叶斯理论。朴素贝叶斯有一个条件独立假设,公式如下:
- 下面这个例子中有两个属性,是否跑步以及是否早起,给定一个样本[“ran”, “didn’t wake up early”],预测是否tired:
days = [["ran", "was tired", "woke up early"], ["ran", "was not tired", "didn't wake up early"], ["didn't run", "was tired", "woke up early"], ["ran", "was tired", "didn't wake up early"], ["didn't run", "was tired", "woke up early"], ["ran", "was not tired", "didn't wake up early"], ["ran", "was tired", "woke up early"]]
new_day = ["ran", "didn't wake up early"]
def calc_y_probability(y_label, days):
return len([d for d in days if d[1] == y_label]) / len(days)
def calc_ran_probability_given_y(ran_label, y_label, days):
return len([d for d in days if d[1] == y_label and d[0] == ran_label]) / len(days)
def calc_woke_early_probability_given_y(woke_label, y_label, days):
return len([d for d in days if d[1] == y_label and d[2] == woke_label]) / len(days)
denominator = len([d for d in days if d[0] == new_day[0] and d[2] == new_day[1]]) / len(days)
prob_tired = (calc_y_probability("was tired", days) * calc_ran_probability_given_y(new_day[0], "was tired", days) * calc_woke_early_probability_given_y(new_day[1], "was tired", days)) / denominator
prob_not_tired = (calc_y_probability("was not tired", days) * calc_ran_probability_given_y(new_day[0], "was not tired", days) * calc_woke_early_probability_given_y(new_day[1], "was not tired", days)) / denominator
classification = "was tired"
if prob_not_tired > prob_tired:
classification = "was not tired"
print("Final classification for new day: {0}. Tired probability: {1}. Not tired probability: {2}.".format(classification, prob_tired, prob_not_tired))
'''
Final classification for new day: was tired.
Tired probability: 0.10204081632653061.
Not tired probability: 0.054421768707482984.
'''
Finding Word Counts
- 对于上面那个计算公式,可以稍作修改。由于在求解每个样本属于正类负类的过程中分母都是计算的样本的概率,这两个式子中的分母相同,而我们需要得到的是属于正类负类的一个概率比较,因此可以忽略对分母的求解。对于文本分类问题,它的特征取值通常是单词频数:
'''
评论样本
[['plot : two teen couples go to a church party drink and then drive . they get into an accident . one of the guys dies but his girlfriend continues to see him in her life and has nightmares . what\'s the deal ? watch the movie and " sorta " find out . . . critique : a mind-fuck movie for the teen generation that touches on a very cool idea but presents it in a very bad package . which is what makes this review an even harder one to write since i generally applaud films which attempt',
'-1'],...
'''
from collections import Counter
import csv
import re
with open("train.csv", 'r') as file:
reviews = list(csv.reader(file))
def get_text(reviews, score):
return " ".join([r[0].lower() for r in reviews if r[1] == str(score)])
def count_text(text):
words = re.split("\s+", text)
return Counter(words)
negative_text = get_text(reviews, -1)
positive_text = get_text(reviews, 1)
negative_counts = count_text(negative_text)
positive_counts = count_text(positive_text)
print("Negative text sample: {0}".format(negative_text[:100]))
print("Positive text sample: {0}".format(positive_text[:100]))
'''
Negative text sample: plot : two teen couples go to a church party drink and then drive . they get into an accident . one
Positive text sample: films adapted from comic books have had plenty of success whether they're about superheroes ( batman
'''
- Counter这个方法就是传入一个元素列表,返回每个元素出现的频数,字典格式。
Making Predictions
- 在进行分类预测时候,需要计算该样本属于每个类的概率。P(A|B) = P(B|A)P(A)=P(w1,w2…|A)P(A)=P(w1|A)P(w2|A)…P(A)。其中P(wi|A)表示A类中wi出现的概率,为了避免P(wi|A)为0,需要进行拉普拉斯平滑,求概率的时候分子+1,分母+类别的个数。
import re
from collections import Counter
def get_y_count(score):
return len([r for r in reviews if r[1] == str(score)])
positive_review_count = get_y_count(1)
negative_review_count = get_y_count(-1)
prob_positive = positive_review_count / len(reviews)
prob_negative = negative_review_count / len(reviews)
def make_class_prediction(text, counts, class_prob, class_count):
prediction = 1
text_counts = Counter(re.split("\s+", text))
for word in text_counts:
prediction *= text_counts.get(word) * ((counts.get(word, 0) + 1) / (sum(counts.values()) + class_count))
return prediction * class_prob
print("Review: {0}".format(reviews[0][0]))
print("Negative prediction: {0}".format(make_class_prediction(reviews[0][0], negative_counts, prob_negative, negative_review_count)))
print("Positive prediction: {0}".format(make_class_prediction(reviews[0][0], positive_counts, prob_positive, positive_review_count)))
'''
Review: plot : two teen couples go to a church party drink and then drive . they get into an accident . one of the guys dies but his girlfriend continues to see him in her life and has nightmares . what's the deal ? watch the movie and " sorta " find out . . . critique : a mind-fuck movie for the teen generation that touches on a very cool idea but presents it in a very bad package . which is what makes this review an even harder one to write since i generally applaud films which attempt
Negative prediction: 3.0050530362356515e-221
Positive prediction: 1.3071705466906787e-226
'''
Predicting The Test Set
import csv
def make_decision(text, make_class_prediction):
negative_prediction = make_class_prediction(text, negative_counts, prob_negative, negative_review_count)
positive_prediction = make_class_prediction(text, positive_counts, prob_positive, positive_review_count)
if negative_prediction > positive_prediction:
return -1
return 1
with open("test.csv", 'r') as file:
test = list(csv.reader(file))
predictions = [make_decision(r[0], make_class_prediction) for r in test]
'''
predictions : list (<class 'list'>)
[-1,
-1,
-1,
1,
'''
Computing Error
actual = [int(r[1]) for r in test]
from sklearn import metrics
# Generate the roc curve using scikits-learn.
fpr, tpr, thresholds = metrics.roc_curve(actual, predictions, pos_label=1)
# Measure the area under the curve. The closer to 1, the "better" the predictions.
print("AUC of the predictions: {0}".format(metrics.auc(fpr, tpr)))
'''
AUC of the predictions: 0.680701754385965
'''
A Faster Way To Predict
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn import metrics
vectorizer = CountVectorizer(stop_words='english')
train_features = vectorizer.fit_transform([r[0] for r in reviews])
test_features = vectorizer.transform([r[0] for r in test])
nb = MultinomialNB()
nb.fit(train_features, [int(r[1]) for r in reviews])
predictions = nb.predict(test_features)
fpr, tpr, thresholds = metrics.roc_curve(actual, predictions, pos_label=1)
print("Multinomal naive bayes AUC: {0}".format(metrics.auc(fpr, tpr)))
'''
Multinomal naive bayes AUC: 0.6509287925696594
'''