概述
从今天开始我们将开启一段自然语言处理 (NLP) 的旅程. 自然语言处理可以让来处理, 理解, 以及运用人类的语言, 实现机器语言和人类语言之间的沟通桥梁.
数据介绍
该数据是家庭暴力的一份司法数据.分为 4 个不同类别: 报警人被老公打,报警人被老婆打,报警人被儿子打,报警人被女儿打. 今天我们就要运用我们前几次学到的知识, 来实现一个 NLP 分类问题.
词频统计
CountVectorizer
是一个文本特征提取的方法. 可以帮助我们计算每种词汇在该训练文本中出现的频率, 以获得词频矩阵.
格式:
vec = CountVectorizer( analyzer="word", max_features=4000 )
参数:
analyzer
: 对语料进行 “word” 或 “char” 分析
max_features
: 最大关键词集 方法列表 作用fit()拟合transform()返回词频统计矩阵
朴素贝叶斯
MultinomialNB
多项式朴素贝叶斯, 是一种非常常见的分类方法.
公式:
P(B|A) = P(B)*P(A|B)/P(A)
例子:
假设北京冬天堵车的概率是 80% P(B) = 0.8
假设北京冬天下雪的概率是 10% P(A) = 0.1
如果某一天堵车,下雪的概率是 10%, P(A|B) = 0.1
我们就可以得到P(B|A)= P(B)堵车的概率0.8 * P(A|B),如果某一天堵车,下雪的概率 0.1
除以P(A) 下雪的概率 = 0.1,等到0.8
也就是说如果北京冬天某一天下雪,那么有80%可能性当天会堵车
代码实现
预处理
import random import jieba import pandas as pd def load_data(): """ 加载数据, 进行基本转换 :return: 老公, 老婆, 儿子, 女儿 (家暴数据) """ # 加载停用词 stopwords = pd.read_csv("data/stopwords.txt", index_col=False, quoting=3, sep=" ", names=["stopword"], encoding="utf-8") stopwords = stopwords["stopword"].values print(stopwords, len(stopwords)) # 加载语料 laogong_df = pd.read_csv("data/beilaogongda.csv", encoding="utf-8", sep=",") laopo_df = pd.read_csv("data/beilaopoda.csv", encoding="utf-8", sep=",") erzi_df = pd.read_csv("data/beierzida.csv", encoding="utf-8", sep=",") nver_df = pd.read_csv("data/beinverda.csv", encoding="utf-8", sep=",") # 去除nan laogong_df.dropna(inplace=True) laopo_df.dropna(inplace=True) erzi_df.dropna(inplace=True) nver_df.dropna(inplace=True) # 转换 laogong = laogong_df.segment.values.tolist() laopo = laopo_df.segment.values.tolist() erzi = erzi_df.segment.values.tolist() nver = nver_df.segment.values.tolist() # 调试输出 print(laogong[:5]) print(laopo[:5]) print(erzi[:5]) print(nver[:5]) return laogong, laopo, erzi, nver, stopwords def pre_process_data(content_lines, category, stop_words): """ 数据预处理 :param content_lines: 语料 :param category: 分类 :param stop_words: 停用词 :return: 预处理完的数据 """ # 存放结果 sentences = [] # 遍历 for line in content_lines: try: segs = jieba.lcut(line) segs = [v for v in segs if not str(v).isdigit()] # 去除数字 segs = list(filter(lambda x: x.strip(), segs)) # 去除左右空格 segs = list(filter(lambda x: len(x) > 1, segs)) # 长度为1的字符 segs = list(filter(lambda x: x not in stop_words, segs)) # 去除停用词 result = (" ".join(segs), category) # 空格拼接 sentences.append(result) except Exception: # 打印错误行 print(line) continue return sentences def pre_process(): """ 数据预处理主函数 :return: 返回预处理好的语料 (分词 + 标注) """ # 读取数据 laogong, laopo, erzi, nver, stop_words = load_data() # 预处理 laogong = pre_process_data(laogong, 0, stop_words) laopo = pre_process_data(laopo, 1, stop_words) erzi = pre_process_data(erzi, 2, stop_words) nver = pre_process_data(nver, 3, stop_words) # 调试输出 print(laogong[:2]) print(laopo[:2]) print(erzi[:2]) print(nver[:2]) # 拼接 result = laogong + laopo + erzi + nver return result if __name__ == "__main__": pre_process()
主函数
from sklearn.feature_extraction.text import CountVectorizer from sklearn.model_selection import train_test_split from sklearn.naive_bayes import MultinomialNB from pre_peocessing import pre_process def main(sentences): """主函数""" # 实例化 vec = CountVectorizer( analyzer="word", max_features=4000 ) # 取出语料和标签 x, y = zip(*sentences) # 分割数据集 X_train, X_test, y_train, y_test = train_test_split(x, y, random_state=0) # 转换为词袋模型 vec.fit(X_train) print(vec.get_feature_names()) # 实例化朴素贝叶斯 classifier = MultinomialNB() classifier.fit(vec.transform(X_train), y_train) # 预测 y_predit = classifier.predict(vec.transform(X_test)) # print(y_predit) # print(y_test) # 计算准确率 score = classifier.score(vec.transform(X_test), y_test) print(score) if __name__ == "__main__": data = pre_process() main(data)
输出结果:
["!" """ "#" ... "450" "22549" "22544"] 2627 ["报警人被老公打,请民警到场处理。", "看到上址女子被老公打 持刀 需要救护 (已通知120,如民警到场不需要,请致电120或110)请民警带好必要的防护设备,并且注意自身安全。", "报警人被老公打,醉酒持刀,(请民警携带必要个人防护装备到场处理,并注意自身安全。)", "报警人被老公打,对方人在,无需救护,请民警到场处理。", "报警人称 被老公打 1人伤 无需120 请民警到场处理。"] ["报警人称被妻子打,未持械,人伤,无需120,妻子在场,请民警注意自身安全,请民警到场处理。", "家暴,称被其老婆打了,无持械,1人伤无需救护,请民警到场处理。", "报警人被老婆打,持械,无人伤,请民警到场处理,并注意自身安全。", "家庭纠纷报警人被老婆打 无需救护,请民警到场处理。", "闹离婚引发被老婆打,无持械,人无伤,请民警到场处理。"] ["报警人被儿子打,无人伤,请民警到场处理。", "报警人称被儿子打 请民警到场处理。(内线:22649)", "报警人被儿子打 无人伤,无持械, 请民警携带必要防护装备并注意自身安全。", "报警人被儿子打,请民警到场处理", "报警人称被儿子打,人轻伤(一人伤),无持械。请民警携带必要的防护设备,并注意自身安全 请民警到场处理。"] ["报警人称 被女儿打,1人伤 无需120,请民警到场处理。", "报警人被女儿打,因家庭纠纷,对方离开,请民警携带必要的防护设备,并注意自身安全。", "报警人被女儿打,无持械,请民警到场处理。", "报警人称被女儿打,无持械,人无事,请民警到场处理。请携带必要的防护装备,并请注意自身安全。", "报警人称其老婆被女儿打,无持械,人未伤,请民警到场处理。"] [("报警 老公 民警 到场", 0), ("上址 老公 持刀 救护 通知 民警 到场 致电 民警 防护 设备", 0)] [("报警 人称 妻子 持械 人伤 无需 妻子 在场 民警 民警 到场", 1), ("家暴 老婆 持械 人伤 无需 救护 民警 到场", 1)] [("报警 儿子 无人 民警 到场", 2), ("报警 人称 儿子 民警 到场", 2)] [("报警 人称 女儿 人伤 无需 民警 到场", 3), ("报警 女儿 家庭 纠纷 离开 民警 携带 防护 设备", 3)] ["aa67c3", "q5", "一人", "一人伤", "一名", "一拳", "一楼", "一辆", "丈夫", "上址", "不上", "不住", "不倒翁", "不明", "不清", "不用", "不行", "不让", "不详", "不通", "不需", "东西", "中断", "中有", "中称", "丰路", "乒乓", "九亭", "九泾路", "争吵", "亚美尼亚人", "人代报", "人伤", "人借", "人头", "人手", "人无事", "人无伤", "人未伤", "人称", "人系", "代为", "代报", "休假", "伤及", "住户", "保安", "保温瓶", "做好", "催促", "催问", "儿子", "儿称", "充电器", "公交车站", "公分", "公路", "关在", "关机", "其称", "其近", "具体地址", "冲突", "几天", "凳子", "出血", "出轨", "分处", "*", "分已", "分所处", "分钟", "刚刚", "到场", "前妻", "剪刀", "割伤", "加拿大", "区划", "医治", "医院", "十一", "卧室", "卫生局", "去过", "又称", "反打", "反锁", "发生", "受伤", "变更", "口角", "口齿不清", "后往", "告知", "咬伤", "咱不需", "啤酒瓶", "喉咙", "喊救命", "喜泰路", "喝酒", "嘴唇", "回到", "回去", "回家", "回来", "在场", "在家", "地上", "地址", "坐在", "处置", "处警", "夏梦霭", "外伤", "外国人", "外面", "多岁", "大桥", "大理石", "大碍", "夫妻", "头上", "头伤", "头晕", "头痛", "头部", "奥迪", "女儿", "妇女", "妈妈", "妹妹", "妻子", "威胁", "婚外情", "婴儿", "媳妇", "孙女", "孤老", "定位", "家中", "家庭", "家庭成员", "家庭暴力", "家暴", "家门", "对峙", "对象", "将门", "小区", "小姑", "小孩", "尾号", "居委", "居委会", "居民", "岳母", "工作", "工作人员", "工具", "工号", "已处", "市场", "并称", "座机", "开门", "异地", "弄口", "引发", "弟弟", "当事人", "得知", "必备", "怀孕", "急救车", "情况", "情况不明", "情绪", "情节", "成功", "手上", "手持", "手指", "手机", "手机号", "手痛", "手部", "手里", "打人", "打伤", "打倒", "打其", "打打", "打架", "打死", "打电话", "打破", "打耳光", "打请", "扫帚", "抓伤", "报称", "报警", "担子", "拖鞋", "拦不住", "拿出", "拿到", "拿尺", "拿长", "拿鞋", "持刀", "持械", "持续", "持饭", "掐着", "接电话", "控制", "措施", "携带", "放下", "放到", "救命", "救护", "救护车", "无人", "无碍", "无需", "早上", "昨天", "暂时", "有伤", "有刀", "木棍", "杀人", "村里", "来电", "杯子", "松江", "桌上", "梅陇", "棍子", "棒头", "椅子", "楼上", "榔头", "此警", "武器", "武器装备", "残疾人", "母亲", "母子", "毛巾", "民警", "水壶", "水果刀", "求助", "沈杨", "没事", "沪牌", "注意安全", "活动室", "派出所", "流血", "浦东", "激动", "烟缸", "烧纸", "照片", "爬不起来", "父亲", "父女", "牙齿", "物业公司", "物品", "玩具", "现人", "现刀", "现场", "现称", "现要", "现跑", "玻璃", "瓶子", "用具", "用脚", "电告", "电线", "电视机", "电话", "疑似", "疾病", "白色", "皮带", "盒子", "相关", "看不见", "眼睛", "矛盾", "砍刀", "砸坏", "确认", "离去", "离婚", "离开", "离异", "称其", "称属", "称有", "称现", "称要", "稍后", "窗口", "竹棍", "等候", "等同", "筷子", "精神", "精神病", "纠纷", "经济纠纷", "翟路", "翟路纪", "老人", "老伯伯", "老伴", "老公", "老北", "老大爷", "老太", "老太太", "老头", "老婆", "老年", "联系电话", "肋骨", "肯德基", "脖子", "脸盆", "自动", "自残", "自称", "自行", "致电", "英语翻译", "菜刀", "虐待", "螺丝刀", "衣服", "衣架", "补充", "装备", "装机", "西门", "解释", "警卫室", "警察", "设备", "询问", "该户", "赌博", "走后", "赶出来", "起因", "路人报", "路边", "路近", "身上", "转接", "轻伤", "轻微伤", "轿车", "辛耕路", "过场", "过警", "过顾", "还称", "进屋", "追其", "逃出来", "逃逸", "通知", "通话", "邻居", "酒瓶", "醉酒", "钥桥", "铁棍", "铁质", "锁事", "锅铲", "锤子", "门卫", "门卫室", "门口", "门外", "门岗", "闵行", "防护", "阿姨", "陈路", "隔壁", "鞋子", "韩国", "项链", "验伤", "骨折", "黑色", "鼻子", "龙州", "龙舟"] C:UsersWindowsAnaconda3libsite-packagessklearnfeature_extractionimage.py:167: DeprecationWarning: `np.int` is a deprecated alias for the builtin `int`. To silence this warning, use `int` by itself. Doing this will not modify any behavior and is safe. When replacing `np.int`, you may wish to use e.g. `np.int64` or `np.int32` to specify the precision. If you wish to review your current use, check the release note link for additional information. Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations dtype=np.int): Building prefix dict from the default dictionary ... Loading model from cache C:UsersWindowsAppDataLocalTempjieba.cache Loading model cost 0.922 seconds. Prefix dict has been built successfully. 1.0 Process finished with exit code 0
准确率基本为 100%. 妈妈再也不同担心我被家暴啦!
以上就是Python机器学习NLP自然语言处理基本操作家暴归类的详细内容,更多关于NLP自然语言处理的资料请关注服务器之家其它相关文章!
原文链接:https://blog.csdn.net/weixin_46274168/article/details/120230247?spm=1001.2014.3001.5501