对要在网上发布的大数据集进行统计分析

时间:2021-01-07 16:16:24

I have a non-computer related data logger, that collects data from the field. This data is stored as text files, and I manually lump the files together and organize them. The current format is through a csv file per year per logger. Each file is around 4,000,000 lines x 7 loggers x 5 years = a lot of data. some of the data is organized as bins item_type, item_class, item_dimension_class, and other data is more unique, such as item_weight, item_color, date_collected, and so on ...

我有一个非计算机相关的数据记录器,它从现场收集数据。这些数据存储为文本文件,我手动将文件放在一起并组织它们。当前格式是每个记录器每年通过一个csv文件。每个文件大约4,000,000行x 7个记录器x 5年=大量数据。一些数据被组织为bin,item_type,item_class,item_dimension_class,其他数据更加独特,例如item_weight,item_color,date_collected等等......

Currently, I do statistical analysis on the data using a python/numpy/matplotlib program I wrote. It works fine, but the problem is, I'm the only one who can use it, since it and the data live on my computer.

目前,我使用我编写的python / numpy / matplotlib程序对数据进行统计分析。它工作正常,但问题是,我是唯一可以使用它的人,因为它和数据存在于我的计算机上。

I'd like to publish the data on the web using a postgres db; however, I need to find or implement a statistical tool that'll take a large postgres table, and return statistical results within an adequate time frame. I'm not familiar with python for the web; however, I'm proficient with PHP on the web side, and python on the offline side.

我想使用postgres db在网上发布数据;但是,我需要找到或实施一个统计工具,它将采用一个大的postgres表,并在适当的时间范围内返回统计结果。我不熟悉网络的python;但是,我在网络方面精通PHP,在线下方面精通python。

users should be allowed to create their own histograms, data analysis. For example, a user can search for all items that are blue shipped between week x and week y, while another user can search for sort the weight distribution of all items by hour for all year long.

应允许用户创建自己的直方图,数据分析。例如,用户可以搜索在第x周和第y周之间发送蓝色的所有项目,而另一个用户可以搜索按年份按小时对所有项目的权重分布进行排序。

I was thinking of creating and indexing my own statistical tools, or automate the process somehow to emulate most queries. This seemed inefficient.

我正在考虑创建和索引我自己的统计工具,或者以某种方式自动化该过程来模拟大多数查询。这似乎效率低下。

I'm looking forward to hearing your ideas

我很期待听到你的想法

Thanks

1 个解决方案

#1


1  

I think you can utilize your current combination(python/numpy/matplotlib) fully if the number of users are not too big. I do some similar works, and my data size a little more than 10g. Data are stored in a few sqlite files, and i use numpy to analyze data, PIL/matplotlib to generate chart files(png, gif), cherrypy as a webserver, mako as a template language.

我认为如果用户数量不是太大,你可以充分利用你当前的组合(python / numpy / matplotlib)。我做了一些类似的工作,我的数据大小超过10克。数据存储在几个sqlite文件中,我使用numpy分析数据,PIL / matplotlib生成图表文件(png,gif),cherrypy作为网络服务器,mako作为模板语言。

If you need more server/client database, then you can migrate to postgresql, but you can still fully use your current programs if you go with a python web framework, like cherrypy.

如果您需要更多服务器/客户端数据库,那么您可以迁移到postgresql,但如果您使用python Web框架(如cherrypy),您仍然可以完全使用当前程序。

#1


1  

I think you can utilize your current combination(python/numpy/matplotlib) fully if the number of users are not too big. I do some similar works, and my data size a little more than 10g. Data are stored in a few sqlite files, and i use numpy to analyze data, PIL/matplotlib to generate chart files(png, gif), cherrypy as a webserver, mako as a template language.

我认为如果用户数量不是太大,你可以充分利用你当前的组合(python / numpy / matplotlib)。我做了一些类似的工作,我的数据大小超过10克。数据存储在几个sqlite文件中,我使用numpy分析数据,PIL / matplotlib生成图表文件(png,gif),cherrypy作为网络服务器,mako作为模板语言。

If you need more server/client database, then you can migrate to postgresql, but you can still fully use your current programs if you go with a python web framework, like cherrypy.

如果您需要更多服务器/客户端数据库,那么您可以迁移到postgresql,但如果您使用python Web框架(如cherrypy),您仍然可以完全使用当前程序。