I am big fan of word2vec algorithm. I had obtained vectors binary file made by google research team and I would like to make some analysis on that (which I had previously made on much smaller datasets than google had made).
我非常喜欢word2vec算法。我已经得到了谷歌研究团队所做的向量二进制文件,我想对此做一些分析(我之前做过的比谷歌更小的数据集)。
I am not able to import the file GoogleNews-vectors-negative300.bin.gz into the R.
我无法导入文件GoogleNews-vectors-negative300.bin。广州到R。
I had extracted that, and using rword2vec (found on github) transformed from bin to txt file. There is a kind of searching function inside the package, but it is sooo slooow.
我已经提取了它,并使用rword2vec(在github上可以找到)从bin转换到txt文件。包中有一种搜索函数,但它是sooo slooow。
That is why I am now attempting to import the file inside R and transform it to dataframe , if possible, with structure:
这就是为什么我现在尝试导入R中的文件并将其转换为dataframe,如果可能的话,使用结构:
name | vec1 | ... | vec300
I had tried built in readBin (could not obtain names), also readLines with txt (did not finish) or readr package and read_lines (made only 12Mb big vector)
我尝试了在readBin中构建(无法获取名称),也尝试了使用txt(未完成)或readr包和read_lines(只生成了12Mb大的向量)
could you please point me in the right direction?
你能给我指出正确的方向吗?
1 个解决方案
#1
0
I finally found a way.
我终于找到了办法。
Using package rword2vec, it is possible to use either function bin_to_txt or framework provided in the package. For more information see the vignette provided.
使用包rword2vec,可以使用包中提供的函数bin_to_txt或框架。有关更多信息,请参阅所提供的插图。
library(rword2vec)
dist=distance(file_name = "GoogleNews-vectors-negative300.bin",search_word = "king",num = 10)
dist
word dist 1 kings 0.713804960250854 2 queen 0.651095926761627 3 monarch 0.641319692134857 4 crown_prince 0.620422065258026 5 prince 0.615999639034271 6 sultan 0.586482524871826 7 ruler 0.579756796360016 8 princes 0.564655303955078 9 Prince_Paras 0.543294668197632 10 throne 0.542210519313812
#1
0
I finally found a way.
我终于找到了办法。
Using package rword2vec, it is possible to use either function bin_to_txt or framework provided in the package. For more information see the vignette provided.
使用包rword2vec,可以使用包中提供的函数bin_to_txt或框架。有关更多信息,请参阅所提供的插图。
library(rword2vec)
dist=distance(file_name = "GoogleNews-vectors-negative300.bin",search_word = "king",num = 10)
dist
word dist 1 kings 0.713804960250854 2 queen 0.651095926761627 3 monarch 0.641319692134857 4 crown_prince 0.620422065258026 5 prince 0.615999639034271 6 sultan 0.586482524871826 7 ruler 0.579756796360016 8 princes 0.564655303955078 9 Prince_Paras 0.543294668197632 10 throne 0.542210519313812