1. 特征向量的选择
当我们观察一封邮件的组成部分的时候,我们可以看到到以下5个部分:发件人,收件人,发送时间,邮件主题,邮件内容。
那么这五个特征值中哪些可以用来帮助区分一封邮件是否是垃圾邮件呢。答案是,我们需要统计垃圾邮件在每个特征上的分布,如果正常邮件和垃圾邮件在该特征值上是均匀分布的,那么这个特征值对区分垃圾邮件的帮助就不是很大,可以删除。
下面说一下如何将每一封邮件的关键信息提取出来,整合成一份数据集
第一步:将标签值由英文单词spam,ham转化为1,0
def read_label2():
convert_dict = {"spam":1,"ham":0}
##利用pandas读取文件,header表示没有列名,所以需要给每一列添加一个列名
df = pd.read_csv("G:/data/trec06c/trec06c/full/index",sep=" ",header=None,names=["label","path"]);
labelList = .to_list()#将Series对象转化为List对象
print(type(labelList))
pathList = .to_list()
length = len(labelList)#计算List的长度
label_dict = {}#初始化字典
for i in range(0,length,1):
label_dict[pathList[i][-8:]] = convert_dict[labelList[i]]#向字段中添加元素
return label_dict
第二步:整合邮件内容
将一封邮件的关键信息整合成一条记录;整合所有的邮件,那将会得到一个数据集。
下面的代码使用外层List内存dict的数据结构来保存所有的邮件内容。
def textConvertToDic(pathroot):
re = (pathroot);
result = [];
index_dict = read_label();
for dir in re:
pathlevel2 = pathroot + "/"+dir;
emailDir = (pathlevel2);
for emailName in emailDir:
emailpath = pathlevel2 + "/"+emailName;
#读取到特定的文件
email_file = open(emailpath,encoding="gb2312", errors="ignore");
email_dict = {}
isContent = False;
content = "";
#将邮件中的发件人,收件问,日期,内容,整合成字典的形式
for line in email_file:
#去掉每行中的空格
line = ();
if ("From"):
email_dict["From"]=line[5:]
if ("To"):
email_dict["To"]=line[3:]
if ("Date"):
email_dict["Date"]=line[5:]
if len(line) == 0:
isContent = True;
if isContent:
content = content + line
email_dict["Content"] = content;
key = "/"+dir+"/"+emailName;
lable = index_dict.get(key)
email_dict["Label"] = lable;
(email_dict)
return result;
下面代码演示,将初步处理后数据存放到硬盘当中
def dictConvertToline(pathroot):
emailList = textConvertToDic(pathroot);
result = [];
lineText = "";
##将字典整合成字符串的信息,适合输出保存
for emaiDcit in emailList:
fromLabel = ("From","noknow").replace(",", "").strip()+",";
toLabel = ("To","noknow").replace(",","").strip()+",";
dateLabel = ("Date","noknow").replace(",", "").strip()+",";
contentLabel = ("Content", "noknow").replace(",", "").strip()+",";
label = ("Label", "noknow");
lineText = fromLabel+toLabel+dateLabel+contentLabel+label
(lineText);
return result;
##将处理后的数据输出保存
pathroot = "G:/data/trec06c/trec06c/data";
data = dictConvertToline(pathroot);
targetText = "G:/data/trec06c/trec06c/allData3";
writer = open(targetText,"w",encoding="UTF-8",errors="ignore");
for text in data:
(text+"\n");
();
下面说一下每一个特征向量是如何处理的
1. 发件人,收件人,利用正则表达式提取出与邮件格式匹配的部分,如果匹配不出来,则设置为unknown
def getEmailFromAddress(strl):
rule = r'[0-9a-zA-Z_]{0,19}@[0-9a-zA-Z_]{0,19}.*[0-9a-zA-Z_]';
result = "";
it = (rule,str(strl));
if len(it)>0:
result = it[0];
else:
result = "unknown"
return result;
2. 发送邮件时间提取比较复杂,方向是提取出发送时间是星期几,整点时间是多少
def getCleaningDate(date):
if not isinstance(date,str):
date = str(date);
length = len(date);
if length < 16:
week = "unknown"
hour = "unknown"
timeQuantum = "unknown"
elif length == 16:
rex = r"(\d{2}):\d{2}"
it = (rex,date);
if len(it) >= 1:
hour = it[0]
else:
hour = "unknown"
week = "Fri"
timeQuantum = "0"
pass
elif length == 19:
week = "Sep"
hour = "01"
timeQuantum = "3"
elif length == 21:
week = "Wed"
hour = "17"
timeQuantum = "1"
pass
else:
rex = r"([A-Za-z]+\d?[A-Za-z]*).*?(\d{2}):\d{2}:\d{2}.*"
it = (rex,date)
if len(it) == 1 and len(it[0]) == 2:
week = it[0][0][-3:]
hour = it[0][1]
intHour = int(hour)
if intHour < 8:
timeQuantum = "3"
if intHour <13:
timeQuantum = "0"
if intHour < 19:
timeQuantum = "1"
else:
timeQuantum = "2"
pass
else:
week = "unknown"
hour = "unknown"
timeQuantum = "unknown"
week = ()
hour = ()
timeQuantum = ()
return (week,hour,timeQuantum)
3. 邮件内容处理,利用jiebo进行分词处理
df["cutContent"] = list(map(lambda text:" ".join((text)),df["content"]))
特征向量处理完之后,怎么知道特征向量是否可以是有效区分垃圾邮件的特征向量呢?需要利用下面一些统计函数来进行辅助判断
df["fromAddress"].value_counts() ##按fromAdress中的内容分组统计(也可以将其转化为特定的数值,然后求方差判断)
查看按多个字段分组后的统计结果:
print(df[["species","population"]].groupby(["species","population"])["population"].count())
下面补充一些pandas类使用的心得:
关于pandas这个类的使用问题:
1. pandas读出来的数据对象是什么类?其中的每一个特征列又是什么类?
df = pd.read_csv(path,sep=',',header=None,names=["from","to","date","content","label"])
df -> <class ''>
-> <class ''>
df["fromAddress"] -> <class ''>
df["species"].value_counts() -> <class ''>
2. 如何初始化一个dataFrame
df = ({'species': ['bear', 'bear', 'marsupial'],
'population': [1864, 22000, 80000]},
index=['panda', 'polar', 'koala'])
3. 如何将Series格式转成DataFrame
df["fromAddress"].value_counts().to_frame()
4. 如何流式处理Series中的数据
df["to_address"] = (map(lambda str:extract_email_server_address(str),df["to"]))
5. 基于现有列的逻辑添加新的列
df["has_date"] = (lambda c: 0 if c['A'] == "4" else 1,axis=1)
6. 一些查看的小技巧
取某行:df[0:2] 取第一行和第2行
取某列:或者df["fromAddress"]
查看dataFrame的大小: -》(3,2) 第一个值表示行数,第二个值表示列数
查看Series对象中的前5个值:df["species"].value_counts().head(5)
查看Series去重后的结果:df["fromAddress"].unique()
查看统计数量后,数量小于10的记录:
a = df["fromAddress"].value_counts().to_frame();
print(a[<10])
查看按多个字段分组后的统计结果:
print(df[["species","population"]].groupby(["species","population"])["population"].count())