I would like to preprocess a corpus of documents using Python in the same way that I can in R. For example, given an initial corpus, corpus
, I would like to end up with a preprocessed corpus that corresponds to the one produced using the following R code:
corpus = tm_map(corpus, tolower)
corpus = tm_map(corpus, removePunctuation)
corpus = tm_map(corpus, removeWords, c("myword", stopwords("english")))
corpus = tm_map(corpus, stemDocument)
Is there a simple or straightforward — preferably pre-built — method of doing this in Python? Is there a way to ensure exactly the same results?
是否有一个简单或直接 - 最好是预先构建 - 在Python中执行此操作的方法?有没有办法确保完全相同的结果?
For example, I would like to preprocess
@Apple ear pods are AMAZING! Best sound from in-ear headphones I've ever had!
@Apple ear pods令人惊叹!我曾经拥有的入耳式耳机的最佳声音!
ear pod amaz best sound inear headphon ive ever
ear pod amaz最好的声音inear headphon我曾经
2 个解决方案
It seems tricky to get things exactly the same between nltk
and tm
on the preprocessing steps, so I think the best approach is to use rpy2
to run the preprocessing in R and pull the results into python:
import rpy2.robjects as ro
preproc = [x[0] for x in ro.r('''
tweets = read.csv("tweets.csv", stringsAsFactors=FALSE)
corpus = Corpus(VectorSource(tweets$Tweet))
corpus = tm_map(corpus, tolower)
corpus = tm_map(corpus, removePunctuation)
corpus = tm_map(corpus, removeWords, c("apple", stopwords("english")))
corpus = tm_map(corpus, stemDocument)''')]
Then, you can load it into scikit-learn
-- the only thing you'll need to do to get things to match between the CountVectorizer
and the DocumentTermMatrix
is to remove terms of length less than 3:
然后,您可以将其加载到scikit-learn中 - 您需要做的唯一事情就是在CountVectorizer和DocumentTermMatrix之间匹配,删除长度小于3的条款:
from sklearn.feature_extraction.text import CountVectorizer
def mytokenizer(x):
return [y for y in x.split() if len(y) > 2]
# Full document-term matrix
cv = CountVectorizer(tokenizer=mytokenizer)
X = cv.fit_transform(preproc)
# <1181x3289 sparse matrix of type '<type 'numpy.int64'>'
# with 8980 stored elements in Compressed Sparse Column format>
# Sparse terms removed
cv2 = CountVectorizer(tokenizer=mytokenizer, min_df=0.005)
X2 = cv2.fit_transform(preproc)
# <1181x309 sparse matrix of type '<type 'numpy.int64'>'
# with 4669 stored elements in Compressed Sparse Column format>
Let's verify this matches with R:
tweets = read.csv("tweets.csv", stringsAsFactors=FALSE)
corpus = Corpus(VectorSource(tweets$Tweet))
corpus = tm_map(corpus, tolower)
corpus = tm_map(corpus, removePunctuation)
corpus = tm_map(corpus, removeWords, c("apple", stopwords("english")))
corpus = tm_map(corpus, stemDocument)
dtm = DocumentTermMatrix(corpus)
# A document-term matrix (1181 documents, 3289 terms)
# Non-/sparse entries: 8980/3875329
# Sparsity : 100%
# Maximal term length: 115
# Weighting : term frequency (tf)
sparse = removeSparseTerms(dtm, 0.995)
# A document-term matrix (1181 documents, 309 terms)
# Non-/sparse entries: 4669/360260
# Sparsity : 99%
# Maximal term length: 20
# Weighting : term frequency (tf)
As you can see, the number of stored elements and terms exactly match between the two approaches now.
and TfidfVectorizer
can be customized as described in the docs. In particular, you'll want to write a custom tokenizer, which is a function that takes a document and returns a list of terms. Using NLTK:
import nltk.corpus.stopwords
import nltk.stem
def smart_tokenizer(doc):
doc = doc.lower()
doc = re.findall(r'\w+', doc, re.UNICODE)
return [nltk.stem.PorterStemmer().stem(term)
for term in doc
if term not in nltk.corpus.stopwords.words('english')]
>>> v = CountVectorizer(tokenizer=smart_tokenizer)
>>> v.fit_transform([doc]).toarray()
array([[1, 1, 1, 2, 1, 1, 1, 1, 1]])
>>> from pprint import pprint
>>> pprint(v.vocabulary_)
{u'amaz': 0,
u'appl': 1,
u'best': 2,
u'ear': 3,
u'ever': 4,
u'headphon': 5,
u'pod': 6,
u'sound': 7,
u've': 8}
(The example I linked to actually uses a class to cache the lemmatizer, but a function works too.)
It seems tricky to get things exactly the same between nltk
and tm
on the preprocessing steps, so I think the best approach is to use rpy2
to run the preprocessing in R and pull the results into python:
import rpy2.robjects as ro
preproc = [x[0] for x in ro.r('''
tweets = read.csv("tweets.csv", stringsAsFactors=FALSE)
corpus = Corpus(VectorSource(tweets$Tweet))
corpus = tm_map(corpus, tolower)
corpus = tm_map(corpus, removePunctuation)
corpus = tm_map(corpus, removeWords, c("apple", stopwords("english")))
corpus = tm_map(corpus, stemDocument)''')]
Then, you can load it into scikit-learn
-- the only thing you'll need to do to get things to match between the CountVectorizer
and the DocumentTermMatrix
is to remove terms of length less than 3:
然后,您可以将其加载到scikit-learn中 - 您需要做的唯一事情就是在CountVectorizer和DocumentTermMatrix之间匹配,删除长度小于3的条款:
from sklearn.feature_extraction.text import CountVectorizer
def mytokenizer(x):
return [y for y in x.split() if len(y) > 2]
# Full document-term matrix
cv = CountVectorizer(tokenizer=mytokenizer)
X = cv.fit_transform(preproc)
# <1181x3289 sparse matrix of type '<type 'numpy.int64'>'
# with 8980 stored elements in Compressed Sparse Column format>
# Sparse terms removed
cv2 = CountVectorizer(tokenizer=mytokenizer, min_df=0.005)
X2 = cv2.fit_transform(preproc)
# <1181x309 sparse matrix of type '<type 'numpy.int64'>'
# with 4669 stored elements in Compressed Sparse Column format>
Let's verify this matches with R:
tweets = read.csv("tweets.csv", stringsAsFactors=FALSE)
corpus = Corpus(VectorSource(tweets$Tweet))
corpus = tm_map(corpus, tolower)
corpus = tm_map(corpus, removePunctuation)
corpus = tm_map(corpus, removeWords, c("apple", stopwords("english")))
corpus = tm_map(corpus, stemDocument)
dtm = DocumentTermMatrix(corpus)
# A document-term matrix (1181 documents, 3289 terms)
# Non-/sparse entries: 8980/3875329
# Sparsity : 100%
# Maximal term length: 115
# Weighting : term frequency (tf)
sparse = removeSparseTerms(dtm, 0.995)
# A document-term matrix (1181 documents, 309 terms)
# Non-/sparse entries: 4669/360260
# Sparsity : 99%
# Maximal term length: 20
# Weighting : term frequency (tf)
As you can see, the number of stored elements and terms exactly match between the two approaches now.
and TfidfVectorizer
can be customized as described in the docs. In particular, you'll want to write a custom tokenizer, which is a function that takes a document and returns a list of terms. Using NLTK:
import nltk.corpus.stopwords
import nltk.stem
def smart_tokenizer(doc):
doc = doc.lower()
doc = re.findall(r'\w+', doc, re.UNICODE)
return [nltk.stem.PorterStemmer().stem(term)
for term in doc
if term not in nltk.corpus.stopwords.words('english')]
>>> v = CountVectorizer(tokenizer=smart_tokenizer)
>>> v.fit_transform([doc]).toarray()
array([[1, 1, 1, 2, 1, 1, 1, 1, 1]])
>>> from pprint import pprint
>>> pprint(v.vocabulary_)
{u'amaz': 0,
u'appl': 1,
u'best': 2,
u'ear': 3,
u'ever': 4,
u'headphon': 5,
u'pod': 6,
u'sound': 7,
u've': 8}
(The example I linked to actually uses a class to cache the lemmatizer, but a function works too.)