第0节:导论
欢迎来到“用Python进行贝叶斯模型建模”——用 Python 学习贝叶斯统计的教程,你可以在 项目首页(https://github.com/markdregan/Hangout-with-PyMC3) 中找到本教程的目录。
统计学是我大学时期从未认同过的一门学科。我们被传授的频率论方法(如 p 值等)看起来很牵强。最终,我对统计学失去兴趣并放弃了它。
直到我偶然发现贝叶斯统计——统计学的一个分支,与大多数大学教授的传统频率论统计有很大不同。许多出版物、博客和视频给了我指引,我强烈推荐这些作为贝叶斯统计的入门教材。它们包括:
Doing Bayesian Data Analysis by John Kruschke
Python port of John Kruschke’s examples by Osvaldo Martin
John Kruschke 的例子中的 Python 接口
黑客的贝叶斯方法 在我学习贝叶斯统计中给了我很多启示。为了显示这种影响,我采用了和 BMH 相同的视觉风格。
While My MCMC Gently Samples blog by Thomas Wiecki
Healthy Algorithms blog by Abraham Flaxman
Scipy Tutorial 2014 by Chris Fonnesbeck
希望这篇教程对别人有用,能够帮助他们学习贝叶斯方法,就像上面那些资源帮助过我一样。欢迎来自社区的联系/评论/贡献。
载入你的 Google Hangout 聊天数据
在这篇教程中,我们将使用我的 Google Hangout 聊天数据作为数据集。我已经删除了消息内容并且对我朋友的名字采用了化名,数据集的剩下部分都是未改变的。
如果你想使用自己的Hangout聊天数据同时按照这篇教程学习,你可以从 Google Takeout 下载自己的 Google Hangout 数据。Hangout 数据的 JSON 格式是可以下载的。下载完成后,替换掉数据文件夹中的 hangouts.json 文件。
json 文件的结构深度嵌套,包含大量冗余信息。
import json
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn.apionly as sns
from datetime import datetime
%matplotlib inline
plt.style.use('bmh')
colors = ['#348ABD', '#A60628', '#7A68A6', '#467821', '#D55E00',
'#CC79A7', '#56B4E9', '#009E73', '#F0E442', '#0072B2']
下面的代码用于载入 json 数据,把每一条信息解析成 pandas DataFrame 的一行。
注意:data/ 路径下缺少 hangouts.json 文件,你必须按照上述方法下载并添加自己的 JSON 文件。或者,你可以跳过进入下一节,在下一节,我会载入匿名处理后的数据集。
# Import json data
with open('data/hangouts.json') as json_file:
json_data = json.load(json_file)
# Generate map from gaia_id to real name
def user_name_mapping(data):
user_map = {'gaia_id': ''}
for state in data['conversation_state']:
participants = state['conversation_state']['conversation']['participant_data']
for participant in participants:
if 'fallback_name' in participant:
user_map[participant['id']['gaia_id']] = participant['fallback_name']
return user_map
user_dict = user_name_mapping(json_data)
# Parse data into flat list
def fetch_messages(data):
messages = []
for state in data['conversation_state']:
conversation_state = state['conversation_state']
conversation = conversation_state['conversation']
conversation_id = conversation_state['conversation']['id']['id']
participants = conversation['participant_data']
all_participants = []
for participant in participants:
if 'fallback_name' in participant:
user = participant['fallback_name']
else:
# Scope to call G+ API to get name
user = participant['id']['gaia_id']
all_participants.append(user)
num_participants = len(all_participants)
for event in conversation_state['event']:
try:
sender = user_dict[event['sender_id']['gaia_id']]
except:
sender = event['sender_id']['gaia_id']
timestamp = datetime.fromtimestamp(float(float(event['timestamp'])/10**6.))
event_id = event['event_id']
if 'chat_message' in event:
content = event['chat_message']['message_content']
if 'segment' in content:
segments = content['segment']
for segment in segments:
if 'text' in segment:
message = segment['text']
message_length = len(message)
message_type = segment['type']
if len(message) > 0:
messages.append((conversation_id,
event_id,
timestamp,
sender,
message,
message_length,
all_participants,
', '.join(all_participants),
num_participants,
message_type))
messages.sort(key=lambda x: x[0])
return messages
# Parse data into data frame
cols = ['conversation_id', 'event_id', 'timestamp', 'sender',
'message', 'message_length', 'participants', 'participants_str',
'num_participants', 'message_type']
messages = pd.DataFrame(fetch_messages(json_data), columns=cols).sort(['conversation_id', 'timestamp'])
# Engineer features
messages['prev_timestamp'] = messages.groupby(['conversation_id'])['timestamp'].shift(1)
messages['prev_sender'] = messages.groupby(['conversation_id'])['sender'].shift(1)
# Exclude messages are are replies to oneself (not first reply)
messages = messages[messages['sender'] != messages['prev_sender']]
# Time delay
messages['time_delay_seconds'] = (messages['timestamp'] - messages['prev_timestamp']).astype('timedelta64[s]')
messages = messages[messages['time_delay_seconds'].notnull()]
messages['time_delay_mins'] = np.ceil(messages['time_delay_seconds'].astype(int)/60.0)
# Time attributes
messages['day_of_week'] = messages['timestamp'].apply(lambda x: x.dayofweek)
messages['year_month'] = messages['timestamp'].apply(lambda x: x.strftime("%Y-%m"))
messages['is_weekend'] = messages['day_of_week'].isin([5,6]).apply(lambda x: 1 if x == True else 0)
# Limit to messages sent by me and exclude all messages between me and Alison
messages = messages[(messages['sender'] == 'Mark Regan') & (messages['participants_str'] != 'Alison Darcy, Mark Regan')]
# Remove messages not responded within 60 seconds
# This introduces an issue by right censoring the data (might return to address)
messages = messages[messages['time_delay_seconds'] < 60]
messages.head(1)
此处原本有一个宽度超大、列数很多的表格,这里放不下。请到英文原文查看。
现在,我们有了数据模型,使用更加方便。上面的表格显示了 pandas DataFrame 的一行。我对我多久才回复消息感兴趣,我们通过一些图来描述我的一些有代表性的回复时间。
fig = plt.figure(figsize=(10,7))
ax = fig.add_subplot(211)
order = np.sort(messages['year_month'].unique())
sns.boxplot(x=messages['year_month'], y=messages['time_delay_seconds'], order=order, orient="v", color=colors[5],linewidth=1, ax=ax)
_ = ax.set_title('Response time distribution by month')
_ = ax.set_xlabel('Month-Year')
_ = ax.set_ylabel('Response time')
_ = plt.xticks(rotation=30)
ax = fig.add_subplot(212)
plt.hist(messages['time_delay_seconds'].values, range=[0, 60], bins=60, histtype='stepfilled', color=colors[0])
_ = ax.set_title('Response time distribution')
_ = ax.set_xlabel('Response time (seconds)')
_ = ax.set_ylabel('Number of messages')
plt.tight_layout()