I would like to preprocess a corpus of documents using Python in the same way that I can in R. For example, given an initial corpus, corpus
, I would like to end up with a preprocessed corpus that corresponds to the one produced using the following R code:
我希望使用Python以与我在R中相同的方式预处理文档语料库。例如,给定初始语料库,语料库,我想最终得到一个预处理语料库,该语料库对应于使用以下语句生成的语料库R代码:
library(tm)
library(SnowballC)
corpus = tm_map(corpus, tolower)
corpus = tm_map(corpus, removePunctuation)
corpus = tm_map(corpus, removeWords, c("myword", stopwords("english")))
corpus = tm_map(corpus, stemDocument)
Is there a simple or straightforward — preferably pre-built — method of doing this in Python? Is there a way to ensure exactly the same results?
是否有一个简单或直接 - 最好是预先构建 - 在Python中执行此操作的方法?有没有办法确保完全相同的结果?
For example, I would like to preprocess
例如,我想预处理
@Apple ear pods are AMAZING! Best sound from in-ear headphones I've ever had!
@Apple ear pods令人惊叹!我曾经拥有的入耳式耳机的最佳声音!
into
成
ear pod amaz best sound inear headphon ive ever
ear pod amaz最好的声音inear headphon我曾经
2 个解决方案
#1
3
It seems tricky to get things exactly the same between nltk
and tm
on the preprocessing steps, so I think the best approach is to use rpy2
to run the preprocessing in R and pull the results into python:
在预处理步骤中使nltk和tm之间的事情完全相同似乎很棘手,所以我认为最好的方法是使用rpy2在R中运行预处理并将结果拉入python:
import rpy2.robjects as ro
preproc = [x[0] for x in ro.r('''
tweets = read.csv("tweets.csv", stringsAsFactors=FALSE)
library(tm)
library(SnowballC)
corpus = Corpus(VectorSource(tweets$Tweet))
corpus = tm_map(corpus, tolower)
corpus = tm_map(corpus, removePunctuation)
corpus = tm_map(corpus, removeWords, c("apple", stopwords("english")))
corpus = tm_map(corpus, stemDocument)''')]
Then, you can load it into scikit-learn
-- the only thing you'll need to do to get things to match between the CountVectorizer
and the DocumentTermMatrix
is to remove terms of length less than 3:
然后,您可以将其加载到scikit-learn中 - 您需要做的唯一事情就是在CountVectorizer和DocumentTermMatrix之间匹配,删除长度小于3的条款:
from sklearn.feature_extraction.text import CountVectorizer
def mytokenizer(x):
return [y for y in x.split() if len(y) > 2]
# Full document-term matrix
cv = CountVectorizer(tokenizer=mytokenizer)
X = cv.fit_transform(preproc)
X
# <1181x3289 sparse matrix of type '<type 'numpy.int64'>'
# with 8980 stored elements in Compressed Sparse Column format>
# Sparse terms removed
cv2 = CountVectorizer(tokenizer=mytokenizer, min_df=0.005)
X2 = cv2.fit_transform(preproc)
X2
# <1181x309 sparse matrix of type '<type 'numpy.int64'>'
# with 4669 stored elements in Compressed Sparse Column format>
Let's verify this matches with R:
让我们验证这与R匹配:
tweets = read.csv("tweets.csv", stringsAsFactors=FALSE)
library(tm)
library(SnowballC)
corpus = Corpus(VectorSource(tweets$Tweet))
corpus = tm_map(corpus, tolower)
corpus = tm_map(corpus, removePunctuation)
corpus = tm_map(corpus, removeWords, c("apple", stopwords("english")))
corpus = tm_map(corpus, stemDocument)
dtm = DocumentTermMatrix(corpus)
dtm
# A document-term matrix (1181 documents, 3289 terms)
#
# Non-/sparse entries: 8980/3875329
# Sparsity : 100%
# Maximal term length: 115
# Weighting : term frequency (tf)
sparse = removeSparseTerms(dtm, 0.995)
sparse
# A document-term matrix (1181 documents, 309 terms)
#
# Non-/sparse entries: 4669/360260
# Sparsity : 99%
# Maximal term length: 20
# Weighting : term frequency (tf)
As you can see, the number of stored elements and terms exactly match between the two approaches now.
如您所见,现在两种方法之间存储的元素和术语的数量完全匹配。
#2
1
CountVectorizer
and TfidfVectorizer
can be customized as described in the docs. In particular, you'll want to write a custom tokenizer, which is a function that takes a document and returns a list of terms. Using NLTK:
可以按照文档中的描述自定义CountVectorizer和TfidfVectorizer。特别是,您需要编写自定义标记生成器,这是一个获取文档并返回术语列表的函数。使用NLTK:
import nltk.corpus.stopwords
import nltk.stem
def smart_tokenizer(doc):
doc = doc.lower()
doc = re.findall(r'\w+', doc, re.UNICODE)
return [nltk.stem.PorterStemmer().stem(term)
for term in doc
if term not in nltk.corpus.stopwords.words('english')]
Demo:
演示:
>>> v = CountVectorizer(tokenizer=smart_tokenizer)
>>> v.fit_transform([doc]).toarray()
array([[1, 1, 1, 2, 1, 1, 1, 1, 1]])
>>> from pprint import pprint
>>> pprint(v.vocabulary_)
{u'amaz': 0,
u'appl': 1,
u'best': 2,
u'ear': 3,
u'ever': 4,
u'headphon': 5,
u'pod': 6,
u'sound': 7,
u've': 8}
(The example I linked to actually uses a class to cache the lemmatizer, but a function works too.)
(我链接到的示例实际上使用一个类来缓存引理器,但函数也可以工作。)
#1
3
It seems tricky to get things exactly the same between nltk
and tm
on the preprocessing steps, so I think the best approach is to use rpy2
to run the preprocessing in R and pull the results into python:
在预处理步骤中使nltk和tm之间的事情完全相同似乎很棘手,所以我认为最好的方法是使用rpy2在R中运行预处理并将结果拉入python:
import rpy2.robjects as ro
preproc = [x[0] for x in ro.r('''
tweets = read.csv("tweets.csv", stringsAsFactors=FALSE)
library(tm)
library(SnowballC)
corpus = Corpus(VectorSource(tweets$Tweet))
corpus = tm_map(corpus, tolower)
corpus = tm_map(corpus, removePunctuation)
corpus = tm_map(corpus, removeWords, c("apple", stopwords("english")))
corpus = tm_map(corpus, stemDocument)''')]
Then, you can load it into scikit-learn
-- the only thing you'll need to do to get things to match between the CountVectorizer
and the DocumentTermMatrix
is to remove terms of length less than 3:
然后,您可以将其加载到scikit-learn中 - 您需要做的唯一事情就是在CountVectorizer和DocumentTermMatrix之间匹配,删除长度小于3的条款:
from sklearn.feature_extraction.text import CountVectorizer
def mytokenizer(x):
return [y for y in x.split() if len(y) > 2]
# Full document-term matrix
cv = CountVectorizer(tokenizer=mytokenizer)
X = cv.fit_transform(preproc)
X
# <1181x3289 sparse matrix of type '<type 'numpy.int64'>'
# with 8980 stored elements in Compressed Sparse Column format>
# Sparse terms removed
cv2 = CountVectorizer(tokenizer=mytokenizer, min_df=0.005)
X2 = cv2.fit_transform(preproc)
X2
# <1181x309 sparse matrix of type '<type 'numpy.int64'>'
# with 4669 stored elements in Compressed Sparse Column format>
Let's verify this matches with R:
让我们验证这与R匹配:
tweets = read.csv("tweets.csv", stringsAsFactors=FALSE)
library(tm)
library(SnowballC)
corpus = Corpus(VectorSource(tweets$Tweet))
corpus = tm_map(corpus, tolower)
corpus = tm_map(corpus, removePunctuation)
corpus = tm_map(corpus, removeWords, c("apple", stopwords("english")))
corpus = tm_map(corpus, stemDocument)
dtm = DocumentTermMatrix(corpus)
dtm
# A document-term matrix (1181 documents, 3289 terms)
#
# Non-/sparse entries: 8980/3875329
# Sparsity : 100%
# Maximal term length: 115
# Weighting : term frequency (tf)
sparse = removeSparseTerms(dtm, 0.995)
sparse
# A document-term matrix (1181 documents, 309 terms)
#
# Non-/sparse entries: 4669/360260
# Sparsity : 99%
# Maximal term length: 20
# Weighting : term frequency (tf)
As you can see, the number of stored elements and terms exactly match between the two approaches now.
如您所见,现在两种方法之间存储的元素和术语的数量完全匹配。
#2
1
CountVectorizer
and TfidfVectorizer
can be customized as described in the docs. In particular, you'll want to write a custom tokenizer, which is a function that takes a document and returns a list of terms. Using NLTK:
可以按照文档中的描述自定义CountVectorizer和TfidfVectorizer。特别是,您需要编写自定义标记生成器,这是一个获取文档并返回术语列表的函数。使用NLTK:
import nltk.corpus.stopwords
import nltk.stem
def smart_tokenizer(doc):
doc = doc.lower()
doc = re.findall(r'\w+', doc, re.UNICODE)
return [nltk.stem.PorterStemmer().stem(term)
for term in doc
if term not in nltk.corpus.stopwords.words('english')]
Demo:
演示:
>>> v = CountVectorizer(tokenizer=smart_tokenizer)
>>> v.fit_transform([doc]).toarray()
array([[1, 1, 1, 2, 1, 1, 1, 1, 1]])
>>> from pprint import pprint
>>> pprint(v.vocabulary_)
{u'amaz': 0,
u'appl': 1,
u'best': 2,
u'ear': 3,
u'ever': 4,
u'headphon': 5,
u'pod': 6,
u'sound': 7,
u've': 8}
(The example I linked to actually uses a class to cache the lemmatizer, but a function works too.)
(我链接到的示例实际上使用一个类来缓存引理器,但函数也可以工作。)