We are working on a data mining project and have used the removeSparseTerms function in the tm package in R for reducing the features of our document term matrix.
我们正在进行一个数据挖掘项目,并使用了tm包中的removeSparseTerms函数来减少文档术语表的功能。
However, we are looking to port the code to python. Is there a function in sklearn, nltk or some other package which can give the same functionality?
但是,我们希望将代码移植到python。sklearn、nltk或其他包中是否有一个函数可以提供相同的功能?
Thanks!
谢谢!
1 个解决方案
#1
3
If your data is plain text, you can use CountVectorizer in order to get this job done.
如果你的数据是纯文本,你可以使用CountVectorizer来完成这项工作。
For example:
例如:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(min_df=2)
corpus = [
'This is the first document.',
'This is the second second document.',
'And the third one.',
'Is this the first document?',
]
vectorizer = vectorizer.fit(corpus)
print vectorizer.vocabulary_
#prints {u'this': 4, u'is': 2, u'the': 3, u'document': 0, u'first': 1}
X = vectorizer.transform(corpus)
Now X
is the document-term matrix. (If you are into information retrieval you want to consider also Tf–idf term weighting.
现在X是文档期限矩阵。(如果你对信息检索感兴趣,你也要考虑Tf-idf项权重。
It can help you get the document-term matrix easily with a few lines.
它可以帮助您用几行代码轻松地获得文档期限矩阵。
Regarding the sparsity - you can control these parameters:
关于稀疏性——您可以控制以下参数:
- min_df - the minimum document frequency allowed for a term in the document-term matrix.
- 最小文档频率——文档术语矩阵中允许的一个术语的最小文档频率。
- max_features - the maximum number of features allowed in the document-term matrix
- max_features——文档术语矩阵中允许的最大特性数量
Alternatively, If you already have the document-term matrix or Tf-idf matrix, and you have the notion of what is sparse, define MIN_VAL_ALLOWED
, and then do the following:
或者,如果您已经有了文档术语矩阵或Tf-idf矩阵,并且您有了什么是稀疏的概念,那么定义MIN_VAL_ALLOWED,然后执行以下操作:
import numpy as np
from scipy.sparse import csr_matrix
MIN_VAL_ALLOWED = 2
X = csr_matrix([[7,8,0],
[2,1,1],
[5,5,0]])
z = np.squeeze(np.asarray(X.sum(axis=0) > MIN_VAL_ALLOWED)) #z is the non-sparse terms
print X[:,z].toarray()
#prints X without the third term (as it is sparse)
[[7 8]
[2 1]
[5 5]]
(use X = X[:,z]
so X
remains a csr_matrix
.)
(使用X = X[:,z],则X仍然是csr_matrix。)
If it is the minimum document frequency you wish to set a threshold on, binarize the matrix first, and than use it the same way:
如果它是您希望设置阈值的最小文档频率,那么首先对矩阵进行二值化,并使用相同的方法:
import numpy as np
from scipy.sparse import csr_matrix
MIN_DF_ALLOWED = 2
X = csr_matrix([[7, 1.3, 0.9, 0],
[2, 1.2, 0.8 , 1],
[5, 1.5, 0 , 0]])
#Creating a copy of the data
B = csr_matrix(X, copy=True)
B[B>0] = 1
z = np.squeeze(np.asarray(X.sum(axis=0) > MIN_DF_ALLOWED))
print X[:,z].toarray()
#prints
[[ 7. 1.3]
[ 2. 1.2]
[ 5. 1.5]]
In this example, the third and fourth term (or columns) are gone, since they only appear in two documents (rows). Use MIN_DF_ALLOWED
to set the threshold.
在本例中,第三和第四项(或列)都没有了,因为它们只出现在两个文档(行)中。使用MIN_DF_ALLOWED设置阈值。
#1
3
If your data is plain text, you can use CountVectorizer in order to get this job done.
如果你的数据是纯文本,你可以使用CountVectorizer来完成这项工作。
For example:
例如:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(min_df=2)
corpus = [
'This is the first document.',
'This is the second second document.',
'And the third one.',
'Is this the first document?',
]
vectorizer = vectorizer.fit(corpus)
print vectorizer.vocabulary_
#prints {u'this': 4, u'is': 2, u'the': 3, u'document': 0, u'first': 1}
X = vectorizer.transform(corpus)
Now X
is the document-term matrix. (If you are into information retrieval you want to consider also Tf–idf term weighting.
现在X是文档期限矩阵。(如果你对信息检索感兴趣,你也要考虑Tf-idf项权重。
It can help you get the document-term matrix easily with a few lines.
它可以帮助您用几行代码轻松地获得文档期限矩阵。
Regarding the sparsity - you can control these parameters:
关于稀疏性——您可以控制以下参数:
- min_df - the minimum document frequency allowed for a term in the document-term matrix.
- 最小文档频率——文档术语矩阵中允许的一个术语的最小文档频率。
- max_features - the maximum number of features allowed in the document-term matrix
- max_features——文档术语矩阵中允许的最大特性数量
Alternatively, If you already have the document-term matrix or Tf-idf matrix, and you have the notion of what is sparse, define MIN_VAL_ALLOWED
, and then do the following:
或者,如果您已经有了文档术语矩阵或Tf-idf矩阵,并且您有了什么是稀疏的概念,那么定义MIN_VAL_ALLOWED,然后执行以下操作:
import numpy as np
from scipy.sparse import csr_matrix
MIN_VAL_ALLOWED = 2
X = csr_matrix([[7,8,0],
[2,1,1],
[5,5,0]])
z = np.squeeze(np.asarray(X.sum(axis=0) > MIN_VAL_ALLOWED)) #z is the non-sparse terms
print X[:,z].toarray()
#prints X without the third term (as it is sparse)
[[7 8]
[2 1]
[5 5]]
(use X = X[:,z]
so X
remains a csr_matrix
.)
(使用X = X[:,z],则X仍然是csr_matrix。)
If it is the minimum document frequency you wish to set a threshold on, binarize the matrix first, and than use it the same way:
如果它是您希望设置阈值的最小文档频率,那么首先对矩阵进行二值化,并使用相同的方法:
import numpy as np
from scipy.sparse import csr_matrix
MIN_DF_ALLOWED = 2
X = csr_matrix([[7, 1.3, 0.9, 0],
[2, 1.2, 0.8 , 1],
[5, 1.5, 0 , 0]])
#Creating a copy of the data
B = csr_matrix(X, copy=True)
B[B>0] = 1
z = np.squeeze(np.asarray(X.sum(axis=0) > MIN_DF_ALLOWED))
print X[:,z].toarray()
#prints
[[ 7. 1.3]
[ 2. 1.2]
[ 5. 1.5]]
In this example, the third and fourth term (or columns) are gone, since they only appear in two documents (rows). Use MIN_DF_ALLOWED
to set the threshold.
在本例中,第三和第四项(或列)都没有了,因为它们只出现在两个文档(行)中。使用MIN_DF_ALLOWED设置阈值。