I'm working on a Web app to display some analytics data from a MYSQL database table. I expect to collect data from about 10,000 total users at the most. This table is going to have millions of records per user.
我正在开发一个Web应用程序来显示MYSQL数据库表中的一些分析数据。我希望最多从大约10,000个用户收集数据。该表将为每个用户提供数百万条记录。
I'm considering giving each user their own table, but more importantly I want to figure out how to optimize data retrieval.
我正在考虑给每个用户自己的表,但更重要的是我想弄清楚如何优化数据检索。
I get data from the database table using a series of SELECT COUNT
queries for a particular day. An example is below:
我使用一系列针对特定日期的SELECT COUNT查询从数据库表中获取数据。一个例子如下:
SELECT * FROM
(SELECT COUNT(id) AS data_point_1 FROM my_table WHERE customer_id = '1' AND datetime_added LIKE '2013-01-20%' AND status_id = '1') AS col_1
CROSS JOIN
(SELECT COUNT(id) AS data_point_2 FROM my_table WHERE customer_id = '1' AND datetime_added LIKE '2013-01-20%' AND status_id = '0') AS col_2
CROSS JOIN ...
When I want to retrieve data from the last 30 days, the query will be 30 times as long as it is above; 60 days likewise, etc. The user will have the ability to select the number of days e.g. 30, 60, 90, and a custom range.
当我想要检索过去30天的数据时,查询将是上面的30倍;同样60天等。用户可以选择天数,例如30,60,90和自定义范围。
I need the data for a time series chart. Just to be clear, data for each day could range from thousands of records to millions.
我需要时间序列图表的数据。需要明确的是,每天的数据范围可以从数千条记录到数百万条。
My question is:
我的问题是:
-
Is this the most performant way of retrieving this data, or is there a better way to getting all the time series data I need in one SQL query?! How is this going to work when a user needs data from the last 2 years i.e. a MySQL Query that is potential over a thousand lines long?!
这是检索此数据的最佳方式,还是有更好的方法来获取我在一个SQL查询中需要的所有时间序列数据?!当用户需要过去2年的数据,即可能超过一千行的MySQL查询时,这是如何工作的?
-
Should I consider caching the retrieved data (using memcache for example) for extended periods of time e.g. an hour or more, to reduce server (Being that this is analytics data, it really should be real-time but I'm afraid of overloading the server with queries for the same data even when there are no changes)?!
我是否应该考虑将检索到的数据缓存(例如使用memcache),例如一小时或更长时间,以减少服务器(因为这是分析数据,它真的应该是实时的,但我担心即使没有变化也会对查询相同数据的服务器超载)?!
Any assitance would be appreciated.
任何协助将不胜感激。
1 个解决方案
#1
0
First, you should not put each user in a separate table. You have other options that are not nearly as intrusive on your application.
首先,您不应将每个用户放在单独的表中。您还有其他选项,几乎不会对您的应用程序产生干扰。
You should consider partitioning the data. Based on what you say, I would have one partition by time (by day, week, or month) and an index on the users. Your query should probably look more like:
您应该考虑对数据进行分区。根据你的说法,我会按时间(按日,周或月)划分一个分区,并为用户提供索引。您的查询应该看起来更像:
select date(datetime), count(*)
from t
where userid = 1 and datetime between DATE1 and DATE2
group by date(datetime)
You can then pivot this, either in an outer query or in an application.
然后,您可以在外部查询或应用程序中对此进行透视。
I would also suggest that you summarize the data on a daily basis, so your analyses can run on the summarized tables. This will make things go much faster.
我还建议您每天汇总数据,以便您的分析可以在汇总表上运行。这将使事情变得更快。
#1
0
First, you should not put each user in a separate table. You have other options that are not nearly as intrusive on your application.
首先,您不应将每个用户放在单独的表中。您还有其他选项,几乎不会对您的应用程序产生干扰。
You should consider partitioning the data. Based on what you say, I would have one partition by time (by day, week, or month) and an index on the users. Your query should probably look more like:
您应该考虑对数据进行分区。根据你的说法,我会按时间(按日,周或月)划分一个分区,并为用户提供索引。您的查询应该看起来更像:
select date(datetime), count(*)
from t
where userid = 1 and datetime between DATE1 and DATE2
group by date(datetime)
You can then pivot this, either in an outer query or in an application.
然后,您可以在外部查询或应用程序中对此进行透视。
I would also suggest that you summarize the data on a daily basis, so your analyses can run on the summarized tables. This will make things go much faster.
我还建议您每天汇总数据,以便您的分析可以在汇总表上运行。这将使事情变得更快。