如何在2D numpy数组中构造数据

时间:2021-08-19 21:28:25

I have data in the form of sets and i want convert it into 2D numpy array. Data is like

我有集合形式的数据,我想将其转换为2D numpy数组。数据就像

term = which contains the words
document_number= which has the doc number
tf-idf= which contain the tf-idf of each word with respect to doc in ordered manner

I want it should be in 2D numpy array like this

我希望它应该是像这样的2D numpy数组

            doc1    doc2   doc3....
term1        1        5      6
term2        0        4      1
term3        6        8      10
.
.

How should I implement it?

我该如何实施呢?

1 个解决方案

#1


1  

Your description of the structure of tf-idf is not clear. So I have to make some assumptions about your data structure.

您对tf-idf结构的描述不清楚。所以我必须对你的数据结构做一些假设。

term_len = len(term)
doc_len = len(document_number)

So assuming that tf-idf is a flat list (not list of lists) where the frequency of the first term in all the documents is in there, then for the second term, and so on.

因此,假设tf-idf是一个平面列表(不是列表列表),其中所有文档中的第一个术语的频率都在那里,那么对于第二个术语,依此类推。

term_freq = numpy.zeros((term_len, doc_len), dtype=int)
for (i, freq) in enumerate(tf_ids):
    term_freq[i // term_len, i % doc_len] = freq

If the opposite is true, just turn the modulo and division operation around.

如果相反,则只需转动模数和除法运算。

#1


1  

Your description of the structure of tf-idf is not clear. So I have to make some assumptions about your data structure.

您对tf-idf结构的描述不清楚。所以我必须对你的数据结构做一些假设。

term_len = len(term)
doc_len = len(document_number)

So assuming that tf-idf is a flat list (not list of lists) where the frequency of the first term in all the documents is in there, then for the second term, and so on.

因此,假设tf-idf是一个平面列表(不是列表列表),其中所有文档中的第一个术语的频率都在那里,那么对于第二个术语,依此类推。

term_freq = numpy.zeros((term_len, doc_len), dtype=int)
for (i, freq) in enumerate(tf_ids):
    term_freq[i // term_len, i % doc_len] = freq

If the opposite is true, just turn the modulo and division operation around.

如果相反,则只需转动模数和除法运算。