python 文本单词提取和词频统计的实例

这些对文本的操作经常用到，那我就总结一下。陆续补充。。。

操作：

strip_html(cls, text) 去除html标签

separate_words(cls, text, min_lenth=3) 文本提取

get_words_frequency(cls, words_list) 获取词频

源码：

				?

									class DocProcess(object):

									 @classmethod

									 def strip_html(cls, text):

									  """

									   Delete html tags in text.

									   text is String

									  """

									  new_text = " "

									  is_html = False

									  for character in text:

									   if character == "<":

									    is_html = True

									   elif character == ">":

									    is_html = False

									    new_text += " "

									   elif is_html is False:

									    new_text += character

									  return new_text

									 @classmethod

									 def separate_words(cls, text, min_lenth=3):

									  """

									   Separate text into words in list.

									  """

									  splitter = re.compile("\\W+")

									  return [s.lower() for s in splitter.split(text) if len(s) > min_lenth]

									 @classmethod

									 def get_words_frequency(cls, words_list):

									  """

									   Get frequency of words in words_list.

									   return a dict.

									  """

									  num_words = {}

									  for word in words_list:

									   num_words[word] = num_words.get(word, 0) + 1

									  return num_words

以上这篇python 文本单词提取和词频统计的实例就是小编分享给大家的全部内容了，希望能给大家一个参考，也希望大家多多支持服务器之家。

原文链接：https://blog.csdn.net/autoliuweijie/article/details/50687419

秒客网

python 文本单词提取和词频统计的实例

相关文章