I need to run a script on a lot of files. I'm trying to build a library of data so I won't have to redo the computations again. Right now I'm using json dump to output the results of each file as a txt containing a dictionary as follows:
我需要在很多文件上运行一个脚本。我正在尝试构建一个数据库,所以我不必再次重做计算。现在我正在使用json dump将每个文件的结果输出为包含字典的txt,如下所示:
{"ARG": [98.1704330444336, 41.769107818603516, 73.10748291015625, 45.386558532714844, 66.13928985595703, 170.6997833251953, 181.3068084716797, 163.4752960205078, 105.4854507446289], "LEU": [28.727693557739258, 37.46043014526367, 13.47089672088623, 53.70556640625, 4.947306156158447, 0.17834201455116272], "ASP": [], "THR": [82.61577606201172, 66.58378601074219], "ILE": [114.99510192871094, 0.0, 41.7198600769043], "CYS": [], "LYS": [132.67730712890625, 34.025794982910156, 116.17617797851562, 95.01632690429688], "PHE": [2.027207136154175, 14.673666000366211, 33.46115493774414], "VAL": [], "SER": [87.324462890625, 100.39542388916016, 20.75590705871582, 49.42512893676758], "ASN": [115.7877197265625, 68.15550994873047, 79.04554748535156, 62.12760543823242], "MET": [], "TRP": [5.433267593383789], "GLN": [103.35163879394531, 12.17470932006836, 83.19425201416016, 81.73150634765625, 31.622051239013672], "PRO": [116.5839614868164], "TYR": [143.76821899414062], "GLU": [32.767948150634766, 112.40697479248047, 151.73361206054688, 53.77445602416992, 137.96853637695312, 137.53512573242188], "ALA": [81.7466812133789, 59.530941009521484, 30.13962173461914, 88.2237319946289], "GLY": [68.45809936523438], "HIS": []}
I can reload the dictionary again with json load. I'm trying to know what the best way to handle my data is, knowing that I will be using all these txt files to join them into one huge dictionary. The keys will be the same in all dictionaries. I will try to append all these "list" values into one big list as value for each key. I will do some mathematical operations, addition, division, draw histograms, clustering,..etc.
我可以用json load重新加载字典。我试图知道处理我的数据的最佳方法是什么,知道我将使用所有这些txt文件将它们连接成一个巨大的字典。所有词典中的键都是相同的。我将尝试将所有这些“列表”值附加到一个大列表中作为每个键的值。我会做一些数学运算,加法,除法,绘制直方图,聚类等等。
I want to know how you would do it, and if what I described above is going to be inefficient or computationally expensive giving that the data will be huge.
我想知道你将如何做到这一点,如果我上面所描述的将是效率低下或计算成本高,因为数据将是巨大的。
1 个解决方案
#1
0
As always it depends. If you are sure that there will be a lot of data, you can consider using pandas
library for python (http://pandas.pydata.org/).
一如既往地取决于。如果您确定会有大量数据,可以考虑使用pandas库进行python(http://pandas.pydata.org/)。
It is very powerful data analysis library and it enables you to do additions, divisions, histograms etc. directly on it's data types. I found it very helpful and easy to use when solving issues similar (I believe) to yours.
它是一个非常强大的数据分析库,它使您可以直接在其数据类型上执行添加,除法,直方图等。我发现在解决与您类似的问题(我相信)时,它非常有用且易于使用。
If you go with this solution you can use pandas' DataFrame
objects (instead of pythons dict
) to store data and do all mentioned operations on this object.
如果你使用这个解决方案,你可以使用pandas的DataFrame对象(而不是pythons dict)来存储数据并对这个对象进行所有提到的操作。
Pandas data types also have a nice interfacec for writing to/reading from files (i.e. DataFrame.to_json(...)
)
Pandas数据类型也有一个很好的接口,用于写入/读取文件(即DataFrame.to_json(...))
#1
0
As always it depends. If you are sure that there will be a lot of data, you can consider using pandas
library for python (http://pandas.pydata.org/).
一如既往地取决于。如果您确定会有大量数据,可以考虑使用pandas库进行python(http://pandas.pydata.org/)。
It is very powerful data analysis library and it enables you to do additions, divisions, histograms etc. directly on it's data types. I found it very helpful and easy to use when solving issues similar (I believe) to yours.
它是一个非常强大的数据分析库,它使您可以直接在其数据类型上执行添加,除法,直方图等。我发现在解决与您类似的问题(我相信)时,它非常有用且易于使用。
If you go with this solution you can use pandas' DataFrame
objects (instead of pythons dict
) to store data and do all mentioned operations on this object.
如果你使用这个解决方案,你可以使用pandas的DataFrame对象(而不是pythons dict)来存储数据并对这个对象进行所有提到的操作。
Pandas data types also have a nice interfacec for writing to/reading from files (i.e. DataFrame.to_json(...)
)
Pandas数据类型也有一个很好的接口,用于写入/读取文件(即DataFrame.to_json(...))