Python机器学习NLP自然语言处理基本操作家暴归类

时间:2022-09-14 00:06:08

 

概述

从今天开始我们将开启一段自然语言处理 (NLP) 的旅程. 自然语言处理可以让来处理, 理解, 以及运用人类的语言, 实现机器语言和人类语言之间的沟通桥梁.

Python机器学习NLP自然语言处理基本操作家暴归类

 

数据介绍

该数据是家庭暴力的一份司法数据.分为 4 个不同类别: 报警人被老公打,报警人被老婆打,报警人被儿子打,报警人被女儿打. 今天我们就要运用我们前几次学到的知识, 来实现一个 NLP 分类问题.

Python机器学习NLP自然语言处理基本操作家暴归类

 

词频统计

CountVectorizer是一个文本特征提取的方法. 可以帮助我们计算每种词汇在该训练文本中出现的频率, 以获得词频矩阵.

格式:

vec = CountVectorizer(
        analyzer="word",
        max_features=4000
    )

参数:

analyzer: 对语料进行 “word” 或 “char” 分析

max_features: 最大关键词集 方法列表 作用fit()拟合transform()返回词频统计矩阵

 

朴素贝叶斯

MultinomialNB多项式朴素贝叶斯, 是一种非常常见的分类方法.

公式:

 P(B|A) = P(B)*P(A|B)/P(A)

例子:

假设北京冬天堵车的概率是 80% P(B) = 0.8
假设北京冬天下雪的概率是 10% P(A) = 0.1
如果某一天堵车,下雪的概率是 10%, P(A|B) = 0.1
我们就可以得到P(B|A)= P(B)堵车的概率0.8 * P(A|B),如果某一天堵车,下雪的概率 0.1
除以P(A) 下雪的概率 = 0.1,等到0.8
也就是说如果北京冬天某一天下雪,那么有80%可能性当天会堵车

 

代码实现

Python机器学习NLP自然语言处理基本操作家暴归类

 

预处理

import random
import jieba
import pandas as pd
def load_data():
    """
    加载数据, 进行基本转换
    :return: 老公, 老婆, 儿子, 女儿 (家暴数据)
    """
    # 加载停用词
    stopwords = pd.read_csv("data/stopwords.txt", index_col=False, quoting=3, sep="	", names=["stopword"],
                            encoding="utf-8")
    stopwords = stopwords["stopword"].values
    print(stopwords, len(stopwords))
    # 加载语料
    laogong_df = pd.read_csv("data/beilaogongda.csv", encoding="utf-8", sep=",")
    laopo_df = pd.read_csv("data/beilaopoda.csv", encoding="utf-8", sep=",")
    erzi_df = pd.read_csv("data/beierzida.csv", encoding="utf-8", sep=",")
    nver_df = pd.read_csv("data/beinverda.csv", encoding="utf-8", sep=",")
    # 去除nan
    laogong_df.dropna(inplace=True)
    laopo_df.dropna(inplace=True)
    erzi_df.dropna(inplace=True)
    nver_df.dropna(inplace=True)
    # 转换
    laogong = laogong_df.segment.values.tolist()
    laopo = laopo_df.segment.values.tolist()
    erzi = erzi_df.segment.values.tolist()
    nver = nver_df.segment.values.tolist()
    # 调试输出
    print(laogong[:5])
    print(laopo[:5])
    print(erzi[:5])
    print(nver[:5])
    return laogong, laopo, erzi, nver, stopwords
def pre_process_data(content_lines, category, stop_words):
    """
    数据预处理
    :param content_lines: 语料
    :param category: 分类
    :param stop_words: 停用词
    :return: 预处理完的数据
    """
    # 存放结果
    sentences = []
    # 遍历
    for line in content_lines:
        try:
            segs = jieba.lcut(line)
            segs = [v for v in segs if not str(v).isdigit()]  # 去除数字
            segs = list(filter(lambda x: x.strip(), segs))  # 去除左右空格
            segs = list(filter(lambda x: len(x) > 1, segs))  # 长度为1的字符
            segs = list(filter(lambda x: x not in stop_words, segs))  # 去除停用词
            result = (" ".join(segs), category)  # 空格拼接
            sentences.append(result)
        except Exception:
            # 打印错误行
            print(line)
            continue
    return sentences
def pre_process():
    """
    数据预处理主函数
    :return: 返回预处理好的语料 (分词 + 标注)
    """
    # 读取数据
    laogong, laopo, erzi, nver, stop_words = load_data()
    # 预处理
    laogong = pre_process_data(laogong, 0, stop_words)
    laopo = pre_process_data(laopo, 1, stop_words)
    erzi = pre_process_data(erzi, 2, stop_words)
    nver = pre_process_data(nver, 3, stop_words)
    # 调试输出
    print(laogong[:2])
    print(laopo[:2])
    print(erzi[:2])
    print(nver[:2])
    # 拼接
    result = laogong + laopo + erzi + nver
    return result
if __name__ == "__main__":
    pre_process()

 

主函数

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from pre_peocessing import pre_process
def main(sentences):
    """主函数"""
    # 实例化
    vec = CountVectorizer(
        analyzer="word",
        max_features=4000
    )
    # 取出语料和标签
    x, y = zip(*sentences)
    # 分割数据集
    X_train, X_test, y_train, y_test = train_test_split(x, y, random_state=0)
    # 转换为词袋模型
    vec.fit(X_train)
    print(vec.get_feature_names())
    # 实例化朴素贝叶斯
    classifier = MultinomialNB()
    classifier.fit(vec.transform(X_train), y_train)
    # 预测
    y_predit = classifier.predict(vec.transform(X_test))
    # print(y_predit)
    # print(y_test)
    # 计算准确率
    score = classifier.score(vec.transform(X_test), y_test)
    print(score)
if __name__ == "__main__":
    data = pre_process()
    main(data)

输出结果:

["!" """ "#" ... "450" "22549" "22544"] 2627
["报警人被老公打,请民警到场处理。", "看到上址女子被老公打  持刀 需要救护 (已通知120,如民警到场不需要,请致电120或110)请民警带好必要的防护设备,并且注意自身安全。", "报警人被老公打,醉酒持刀,(请民警携带必要个人防护装备到场处理,并注意自身安全。)", "报警人被老公打,对方人在,无需救护,请民警到场处理。", "报警人称 被老公打 1人伤 无需120 请民警到场处理。"]
["报警人称被妻子打,未持械,人伤,无需120,妻子在场,请民警注意自身安全,请民警到场处理。", "家暴,称被其老婆打了,无持械,1人伤无需救护,请民警到场处理。", "报警人被老婆打,持械,无人伤,请民警到场处理,并注意自身安全。", "家庭纠纷报警人被老婆打 无需救护,请民警到场处理。", "闹离婚引发被老婆打,无持械,人无伤,请民警到场处理。"]
["报警人被儿子打,无人伤,请民警到场处理。", "报警人称被儿子打 请民警到场处理。(内线:22649)", "报警人被儿子打 无人伤,无持械, 请民警携带必要防护装备并注意自身安全。", "报警人被儿子打,请民警到场处理", "报警人称被儿子打,人轻伤(一人伤),无持械。请民警携带必要的防护设备,并注意自身安全 请民警到场处理。"]
["报警人称 被女儿打,1人伤 无需120,请民警到场处理。", "报警人被女儿打,因家庭纠纷,对方离开,请民警携带必要的防护设备,并注意自身安全。", "报警人被女儿打,无持械,请民警到场处理。", "报警人称被女儿打,无持械,人无事,请民警到场处理。请携带必要的防护装备,并请注意自身安全。", "报警人称其老婆被女儿打,无持械,人未伤,请民警到场处理。"]
[("报警 老公 民警 到场", 0), ("上址 老公 持刀 救护 通知 民警 到场 致电 民警 防护 设备", 0)]
[("报警 人称 妻子 持械 人伤 无需 妻子 在场 民警 民警 到场", 1), ("家暴 老婆 持械 人伤 无需 救护 民警 到场", 1)]
[("报警 儿子 无人 民警 到场", 2), ("报警 人称 儿子 民警 到场", 2)]
[("报警 人称 女儿 人伤 无需 民警 到场", 3), ("报警 女儿 家庭 纠纷 离开 民警 携带 防护 设备", 3)]
["aa67c3", "q5", "一人", "一人伤", "一名", "一拳", "一楼", "一辆", "丈夫", "上址", "不上", "不住", "不倒翁", "不明", "不清", "不用", "不行", "不让", "不详", "不通", "不需", "东西", "中断", "中有", "中称", "丰路", "乒乓", "九亭", "九泾路", "争吵", "亚美尼亚人", "人代报", "人伤", "人借", "人头", "人手", "人无事", "人无伤", "人未伤", "人称", "人系", "代为", "代报", "休假", "伤及", "住户", "保安", "保温瓶", "做好", "催促", "催问", "儿子", "儿称", "充电器", "公交车站", "公分", "公路", "关在", "关机", "其称", "其近", "具体地址", "冲突", "几天", "凳子", "出血", "出轨", "分处", "*", "分已", "分所处", "分钟", "刚刚", "到场", "前妻", "剪刀", "割伤", "加拿大", "区划", "医治", "医院", "十一", "卧室", "卫生局", "去过", "又称", "反打", "反锁", "发生", "受伤", "变更", "口角", "口齿不清", "后往", "告知", "咬伤", "咱不需", "啤酒瓶", "喉咙", "喊救命", "喜泰路", "喝酒", "嘴唇", "回到", "回去", "回家", "回来", "在场", "在家", "地上", "地址", "坐在", "处置", "处警", "夏梦霭", "外伤", "外国人", "外面", "多岁", "大桥", "大理石", "大碍", "夫妻", "头上", "头伤", "头晕", "头痛", "头部", "奥迪", "女儿", "妇女", "妈妈", "妹妹", "妻子", "威胁", "婚外情", "婴儿", "媳妇", "孙女", "孤老", "定位", "家中", "家庭", "家庭成员", "家庭暴力", "家暴", "家门", "对峙", "对象", "将门", "小区", "小姑", "小孩", "尾号", "居委", "居委会", "居民", "岳母", "工作", "工作人员", "工具", "工号", "已处", "市场", "并称", "座机", "开门", "异地", "弄口", "引发", "弟弟", "当事人", "得知", "必备", "怀孕", "急救车", "情况", "情况不明", "情绪", "情节", "成功", "手上", "手持", "手指", "手机", "手机号", "手痛", "手部", "手里", "打人", "打伤", "打倒", "打其", "打打", "打架", "打死", "打电话", "打破", "打耳光", "打请", "扫帚", "抓伤", "报称", "报警", "担子", "拖鞋", "拦不住", "拿出", "拿到", "拿尺", "拿长", "拿鞋", "持刀", "持械", "持续", "持饭", "掐着", "接电话", "控制", "措施", "携带", "放下", "放到", "救命", "救护", "救护车", "无人", "无碍", "无需", "早上", "昨天", "暂时", "有伤", "有刀", "木棍", "杀人", "村里", "来电", "杯子", "松江", "桌上", "梅陇", "棍子", "棒头", "椅子", "楼上", "榔头", "此警", "武器", "武器装备", "残疾人", "母亲", "母子", "毛巾", "民警", "水壶", "水果刀", "求助", "沈杨", "没事", "沪牌", "注意安全", "活动室", "派出所", "流血", "浦东", "激动", "烟缸", "烧纸", "照片", "爬不起来", "父亲", "父女", "牙齿", "物业公司", "物品", "玩具", "现人", "现刀", "现场", "现称", "现要", "现跑", "玻璃", "瓶子", "用具", "用脚", "电告", "电线", "电视机", "电话", "疑似", "疾病", "白色", "皮带", "盒子", "相关", "看不见", "眼睛", "矛盾", "砍刀", "砸坏", "确认", "离去", "离婚", "离开", "离异", "称其", "称属", "称有", "称现", "称要", "稍后", "窗口", "竹棍", "等候", "等同", "筷子", "精神", "精神病", "纠纷", "经济纠纷", "翟路", "翟路纪", "老人", "老伯伯", "老伴", "老公", "老北", "老大爷", "老太", "老太太", "老头", "老婆", "老年", "联系电话", "肋骨", "肯德基", "脖子", "脸盆", "自动", "自残", "自称", "自行", "致电", "英语翻译", "菜刀", "虐待", "螺丝刀", "衣服", "衣架", "补充", "装备", "装机", "西门", "解释", "警卫室", "警察", "设备", "询问", "该户", "赌博", "走后", "赶出来", "起因", "路人报", "路边", "路近", "身上", "转接", "轻伤", "轻微伤", "轿车", "辛耕路", "过场", "过警", "过顾", "还称", "进屋", "追其", "逃出来", "逃逸", "通知", "通话", "邻居", "酒瓶", "醉酒", "钥桥", "铁棍", "铁质", "锁事", "锅铲", "锤子", "门卫", "门卫室", "门口", "门外", "门岗", "闵行", "防护", "阿姨", "陈路", "隔壁", "鞋子", "韩国", "项链", "验伤", "骨折", "黑色", "鼻子", "龙州", "龙舟"]
C:UsersWindowsAnaconda3libsite-packagessklearnfeature_extractionimage.py:167: DeprecationWarning: `np.int` is a deprecated alias for the builtin `int`. To silence this warning, use `int` by itself. Doing this will not modify any behavior and is safe. When replacing `np.int`, you may wish to use e.g. `np.int64` or `np.int32` to specify the precision. If you wish to review your current use, check the release note link for additional information.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  dtype=np.int):
Building prefix dict from the default dictionary ...
Loading model from cache C:UsersWindowsAppDataLocalTempjieba.cache
Loading model cost 0.922 seconds.
Prefix dict has been built successfully.
1.0
Process finished with exit code 0

准确率基本为 100%. 妈妈再也不同担心我被家暴啦!

以上就是Python机器学习NLP自然语言处理基本操作家暴归类的详细内容,更多关于NLP自然语言处理的资料请关注服务器之家其它相关文章!

原文链接:https://blog.csdn.net/weixin_46274168/article/details/120230247?spm=1001.2014.3001.5501