
时间:2022-02-22 23:13:51

I need to run a script on a lot of files. I'm trying to build a library of data so I won't have to redo the computations again. Right now I'm using json dump to output the results of each file as a txt containing a dictionary as follows:

我需要在很多文件上运行一个脚本。我正在尝试构建一个数据库,所以我不必再次重做计算。现在我正在使用json dump将每个文件的结果输出为包含字典的txt,如下所示:

{"ARG": [98.1704330444336, 41.769107818603516, 73.10748291015625, 45.386558532714844, 66.13928985595703, 170.6997833251953, 181.3068084716797, 163.4752960205078, 105.4854507446289], "LEU": [28.727693557739258, 37.46043014526367, 13.47089672088623, 53.70556640625, 4.947306156158447, 0.17834201455116272], "ASP": [], "THR": [82.61577606201172, 66.58378601074219], "ILE": [114.99510192871094, 0.0, 41.7198600769043], "CYS": [], "LYS": [132.67730712890625, 34.025794982910156, 116.17617797851562, 95.01632690429688], "PHE": [2.027207136154175, 14.673666000366211, 33.46115493774414], "VAL": [], "SER": [87.324462890625, 100.39542388916016, 20.75590705871582, 49.42512893676758], "ASN": [115.7877197265625, 68.15550994873047, 79.04554748535156, 62.12760543823242], "MET": [], "TRP": [5.433267593383789], "GLN": [103.35163879394531, 12.17470932006836, 83.19425201416016, 81.73150634765625, 31.622051239013672], "PRO": [116.5839614868164], "TYR": [143.76821899414062], "GLU": [32.767948150634766, 112.40697479248047, 151.73361206054688, 53.77445602416992, 137.96853637695312, 137.53512573242188], "ALA": [81.7466812133789, 59.530941009521484, 30.13962173461914, 88.2237319946289], "GLY": [68.45809936523438], "HIS": []}

I can reload the dictionary again with json load. I'm trying to know what the best way to handle my data is, knowing that I will be using all these txt files to join them into one huge dictionary. The keys will be the same in all dictionaries. I will try to append all these "list" values into one big list as value for each key. I will do some mathematical operations, addition, division, draw histograms, clustering,..etc.

我可以用json load重新加载字典。我试图知道处理我的数据的最佳方法是什么,知道我将使用所有这些txt文件将它们连接成一个巨大的字典。所有词典中的键都是相同的。我将尝试将所有这些“列表”值附加到一个大列表中作为每个键的值。我会做一些数学运算,加法,除法,绘制直方图,聚类等等。

I want to know how you would do it, and if what I described above is going to be inefficient or computationally expensive giving that the data will be huge.


1 个解决方案



As always it depends. If you are sure that there will be a lot of data, you can consider using pandas library for python (http://pandas.pydata.org/).


It is very powerful data analysis library and it enables you to do additions, divisions, histograms etc. directly on it's data types. I found it very helpful and easy to use when solving issues similar (I believe) to yours.


If you go with this solution you can use pandas' DataFrame objects (instead of pythons dict) to store data and do all mentioned operations on this object.

如果你使用这个解决方案,你可以使用pandas的DataFrame对象(而不是pythons dict)来存储数据并对这个对象进行所有提到的操作。

Pandas data types also have a nice interfacec for writing to/reading from files (i.e. DataFrame.to_json(...))




As always it depends. If you are sure that there will be a lot of data, you can consider using pandas library for python (http://pandas.pydata.org/).


It is very powerful data analysis library and it enables you to do additions, divisions, histograms etc. directly on it's data types. I found it very helpful and easy to use when solving issues similar (I believe) to yours.


If you go with this solution you can use pandas' DataFrame objects (instead of pythons dict) to store data and do all mentioned operations on this object.

如果你使用这个解决方案,你可以使用pandas的DataFrame对象(而不是pythons dict)来存储数据并对这个对象进行所有提到的操作。

Pandas data types also have a nice interfacec for writing to/reading from files (i.e. DataFrame.to_json(...))
