Python3词云之wordcloud入门

时间:2022-11-28 06:21:27

Python版本: python3.+
运行环境: Mac OS
IDE: pycharm

一 wordcloud简介

wordcloud是基于Python的词云生成类库,很好用,而且功能强大。
github:https://github.com/amueller/word_cloud
官方地址:https://amueller.github.io/word_cloud/

二 wordcloud使用方法

1 先来看一下WordCloud中的参数

def __init__(font_path=None, #字体 默认是none 在英文词云中可以不用设置, 若要显示中文词云,则需要自定字体
width=400, height=200, margin=2, #生成词云画布的宽(默认400)/高(默认200)/词边距(默认2)
ranks_only=None, #该参数实际上已经被废除,但是为了向下兼容,依然保留,但无效果
prefer_horizontal=.9, #横纵比 默认0.9; 当prefer_horizontal<1,且word不适应(当前空间)的时候,会把该词旋转
mask=None, #背景图 默认None 参数是nd-array类型;当mask不为None 则width和height无效 被mask的形状替代;除了白色块(#FF or #FFFFFF),mask其他部位会被作为填充单词的区域
scale=1, #默认值1 意义不明 暂不解释
color_func=None, #默认None ; 重写了colormap方法
max_words=200, #词云中词的最大个数 默认 200 ;
min_font_size=4, #词的字体大小的最小值 默认 4;
stopwords=None, #字符串集合 默认None;当为None时,会默认调用 wordcloud库内建的 STOPWORDS
random_state=None, #该参数会在color_func参数中被调用 默认 None;实际上的作用是作为随机数的种子
background_color='black', #词云背景色 默认 black (黑色);
max_font_size=None, #最大字体大小 默认 None;当为None时 图像的 高 将会被作为最大字体大小
font_step=1, #字体的步长, 默认 1;当font_step>1时,可能会加快计算,但可能会产生更糟糕的适应
mode="RGB", # 默认 RGB ;当mode=RGBA或者background_color为None时 会生成透明的背景
relative_scaling=.5, #单词频率对字体大小的重要性 默认0.5; 影响词频对词的大小
regexp=None, #正则表达式 默认None; 在process_text时会调用该正则表达式作为分词依据,当为None时 使用r"\w[\w']+"
collocations=True, #默认 True; 是否包括2个词的搭配
colormap=None, #string 或者 matplotlib库的colormap方法作为参数 默认 'viridis';当color_func被指定是忽略该参数
normalize_plurals=True #默认 True ;是否删除英文单词中的复数形式结尾的s
)

其中各个参数的作用如下所示:

  • font_path:字体 默认是none 在英文词云中可以不用设置, 若要显示中文词云,则需要自定字体
  • width, height, margin:生成词云画布的宽(默认400)/高(默认200)/词边距(默认2)
  • ranks_only:该参数实际上已经被废除,但是为了向下兼容,依然保留,但无效果
  • prefer_horizontal:横纵比 默认0.9; 当prefer_horizontal<1,且word不适应(当前空间)的时候,会把该词旋转
  • mask:背景图 默认None 参数是nd-array类型;当mask不为None 则width和height无效 被mask的形状替代;除了白色块(#FF or #FFFFFF),mask其他部位会被作为填充单词的区域
  • scale:默认值1 意义不明 暂不解释
  • color_func:默认None ; 重写了colormap方法
  • max_words:词云中词的最大个数 默认 200 ;
  • min_font_size:词的字体大小的最小值 默认 4;
  • stopwords:字符串集合 默认None;当为None时,会默认调用 wordcloud库内建的 STOPWORDS
  • random_state:该参数会在color_func参数中被调用 默认 None;实际上的作用是作为随机数的种子
  • background_color:词云背景色 默认 black (黑色);
  • max_font_size:最大字体大小 默认 None;当为None时 图像的 高 将会被作为最大字体大小
  • font_step:字体的步长, 默认 1;当font_step>1时,可能会加快计算,但可能会产生更糟糕的适应
  • mode:默认 RGB ;当mode=RGBA或者background_color为None时 会生成透明的背景
  • relative_scaling:单词频率对字体大小的重要性 默认0.5; 影响词频对词的大小
  • regexp:正则表达式 默认None; 在process_text时会调用该正则表达式作为分词依据,当为None时 使用r”\w[\w’]+”
  • collocations:默认 True; 是否包括2个词的搭配
  • colormap:string 或者 matplotlib库的colormap方法作为参数 默认 ‘viridis’;当color_func被指定是忽略该参数
  • normalize_plurals:默认 True ;是否删除英文单词中的复数形式结尾的s

由于是博主的个人理解,有些参数并没有理解透彻,若有错,望指正

而常用的参数有以下几个:

font_path           # 设置字体
background_color # 背景颜色
max_words # 词云显示的最大词数
mask # 设置背景图片
max_font_size # 字体最大值
random_state # 作为随机数的种子
width,height,margin # 宽,高,词边距

生成图云主要有2个方法:

generate(str) #该方法能先进行分词,再进行词频统计,最后生成词云,但是对中文的支持不太好...
generate_from_frequencies(dict)#根据词频来生成词云,对中文分词,推荐使用该方法

2 简单实例

我们以如下短文为例:

.S. President Donald Trump1 arrived in Beijing Wednesday afternoon, beginning his three-day state visit to China.
It is Trump's first visit to the country since he assumed the presidency2 in January. He is the first head of state to visit China since the landmark3 19th National Congress of the * Party of China.
During his stay in Beijing, Trump will hold talks with Chinese President Xi * and meet with other Chinese leaders.
Xi and Trump will hold strategic communications on significant issues of common concern to build new consensus4, enhance mutual5 understanding and friendship, and promote bilateral6 relations in all spheres, according to Vice7 Foreign Minister Zheng Zeguang.
Apart from formal activities commensurate with a state visit, "informal interactions" will be arranged for the presidents of the two countries, Zheng said.
This is the third meeting between Xi and Trump following their first meeting at Mar-a-Lago, Florida in April and the second in Hamburg, Germany on the sidelines of the G20 summit in July.
This year marks the 45th anniversary of former U.S. President Richard Nixon's "ice-breaking" visit to China, which began the normalization8 of relations between the two countries.
text = """
.S. President Donald Trump1 arrived in Beijing Wednesday afternoon, beginning his three-day state visit to China.
It is Trump's first visit to the country since he assumed the presidency2 in January. He is the first head of state to visit China since the landmark3 19th National Congress of the * Party of China.
During his stay in Beijing, Trump will hold talks with Chinese President Xi * and meet with other Chinese leaders.
Xi and Trump will hold strategic communications on significant issues of common concern to build new consensus4, enhance mutual5 understanding and friendship, and promote bilateral6 relations in all spheres, according to Vice7 Foreign Minister Zheng Zeguang.
Apart from formal activities commensurate with a state visit, "informal interactions" will be arranged for the presidents of the two countries, Zheng said.
This is the third meeting between Xi and Trump following their first meeting at Mar-a-Lago, Florida in April and the second in Hamburg, Germany on the sidelines of the G20 summit in July.
This year marks the 45th anniversary of former U.S. President Richard Nixon's "ice-breaking" visit to China, which began the normalization8 of relations between the two countries.
"""


mask_color_path = "bg_2.png" # 设置背景图片路径
font_path = '/Library/Fonts/华文黑体.ttf' # 为matplotlib设置中文字体路径没;路径需要改成你本地的字体路径,若是全英文,也可不设字体路径
imgname1 = "en_WordCloud_DefautColors.png" # 保存的图片名字1(只按照背景图片形状)
imgname2 = "en_WordCloud_ColorsByImg.png" # 保存的图片名字2(颜色按照背景图片颜色布局生成)
width = 1000
height = 860
margin = 2
# 设置背景图片
mask_coloring = imread(mask_color_path)
wc = WordCloud(font_path=font_path,
background_color="white", # 背景颜色
max_words=200, # 词云显示的最大词数
mask=mask_coloring, # 设置背景图片
max_font_size=200, # 字体最大值
# random_state=42,
width=width, height=height, margin=margin,
)
wc.generate(text)
plt.figure()
# 以下代码显示图片
# 绘制词云
plt.imshow(wc)
plt.axis("off")
plt.show()

# 保存图片
wc.to_file(imgname1)

#获得背景图的颜色
img_color = ImageColorGenerator(mask_coloring)
wc.recolor(color_func=img_color)
plt.figure()
# 以下代码显示图片
# 绘制词云
plt.imshow(wc)
plt.axis("off")
plt.show()

# 保存图片
wc.to_file(imgname2)

背景图如下:
Python3词云之wordcloud入门

获得的结果如下(默认字体颜色结果):
Python3词云之wordcloud入门

(按背景图颜色结果):
Python3词云之wordcloud入门

三 小结

这是一个很简单的wordcloud的入门内容,主要还是记录自己的所学,在写博客的过程中,加深了对知识点的理解,还搞懂了一些原本没注意的问题。 与君共勉 ^_^