I have 3 arrays: array "words" of pairs ["id": "word"] by the length 5000000, array "ids" of unique ids by the length 13000 and array "dict" of unique words (dictionary) by the length 500000. This is my code:
我有3个数组:数组“单词”对[“id”:“单词”]长度5000000,数组“ids”的唯一ID长度13000和数组“dict”的独特单词(字典)长度500000.这是我的代码:
matrix = sparse.lil_matrix((len(ids), len(dict)))
for i in words:
matrix[id.index(i['id']), dict.index(i['word'])] += 1.0
But it works too slow (I haven't got a matrix after 15 hours of work). Are there any ideas to optimize my code?
但它工作得太慢(我工作15个小时后没有得到矩阵)。有什么想法来优化我的代码吗?
1 个解决方案
#1
First of all don't name your array dict
, it is confusing as well as hides the built-in type dict
.
首先不要命名你的数组字典,它是混乱的,也隐藏内置类型字典。
The problem here is that you're doing everything in quadratic time, so convert your arrays dict
and id
to a dictionary first where each word
or id
point to its index.
这里的问题是你在二次时间内做所有事情,所以首先将数组dict和id转换为字典,其中每个单词或id指向其索引。
matrix = sparse.lil_matrix((len(ids), len(dict)))
dict_from_dict = {word: ind for ind, word in enumerate(dict)}
dict_from_id = {id: ind for ind, id in enumerate(id)}
for i in words:
matrix[dict_from_id[i['id']], dict_from_dict[i['word']] += 1.0
#1
First of all don't name your array dict
, it is confusing as well as hides the built-in type dict
.
首先不要命名你的数组字典,它是混乱的,也隐藏内置类型字典。
The problem here is that you're doing everything in quadratic time, so convert your arrays dict
and id
to a dictionary first where each word
or id
point to its index.
这里的问题是你在二次时间内做所有事情,所以首先将数组dict和id转换为字典,其中每个单词或id指向其索引。
matrix = sparse.lil_matrix((len(ids), len(dict)))
dict_from_dict = {word: ind for ind, word in enumerate(dict)}
dict_from_id = {id: ind for ind, id in enumerate(id)}
for i in words:
matrix[dict_from_id[i['id']], dict_from_dict[i['word']] += 1.0