I have data in the form of sets and i want convert it into 2D numpy array. Data is like
我有集合形式的数据,我想将其转换为2D numpy数组。数据就像
term = which contains the words
document_number= which has the doc number
tf-idf= which contain the tf-idf of each word with respect to doc in ordered manner
I want it should be in 2D numpy array like this
我希望它应该是像这样的2D numpy数组
doc1 doc2 doc3....
term1 1 5 6
term2 0 4 1
term3 6 8 10
.
.
How should I implement it?
我该如何实施呢?
1 个解决方案
#1
1
Your description of the structure of tf-idf
is not clear. So I have to make some assumptions about your data structure.
您对tf-idf结构的描述不清楚。所以我必须对你的数据结构做一些假设。
term_len = len(term)
doc_len = len(document_number)
So assuming that tf-idf
is a flat list (not list of lists) where the frequency of the first term in all the documents is in there, then for the second term, and so on.
因此,假设tf-idf是一个平面列表(不是列表列表),其中所有文档中的第一个术语的频率都在那里,那么对于第二个术语,依此类推。
term_freq = numpy.zeros((term_len, doc_len), dtype=int)
for (i, freq) in enumerate(tf_ids):
term_freq[i // term_len, i % doc_len] = freq
If the opposite is true, just turn the modulo and division operation around.
如果相反,则只需转动模数和除法运算。
#1
1
Your description of the structure of tf-idf
is not clear. So I have to make some assumptions about your data structure.
您对tf-idf结构的描述不清楚。所以我必须对你的数据结构做一些假设。
term_len = len(term)
doc_len = len(document_number)
So assuming that tf-idf
is a flat list (not list of lists) where the frequency of the first term in all the documents is in there, then for the second term, and so on.
因此,假设tf-idf是一个平面列表(不是列表列表),其中所有文档中的第一个术语的频率都在那里,那么对于第二个术语,依此类推。
term_freq = numpy.zeros((term_len, doc_len), dtype=int)
for (i, freq) in enumerate(tf_ids):
term_freq[i // term_len, i % doc_len] = freq
If the opposite is true, just turn the modulo and division operation around.
如果相反,则只需转动模数和除法运算。