从MySQL处理大数据集到PHP,再到JSON处理大数据集

时间:2021-04-08 16:54:49

Situation: I need to grab large amounts of data from the database (~150k+). Then using PHP, I split that data based on a daily figure and count it (1st: ~10k, 2nd: ~15k, etc...), and then increment another value from the daily figures. And after doing that, I need to format all that information into a JSON array and return to client and display a graph on these statistics.

情境:我需要从数据库中获取大量数据(~150k+)。然后使用PHP,根据每天的数据进行分割和计数(1:~10k, 2: ~15k等),然后从每天的数据中增加一个值。在此之后,我需要将所有信息格式化为JSON数组并返回到客户端并在这些统计数据上显示图形。

Now, I'm pretty sure this can all be handled well by PHP but it would probably create a lot of load on the server + bandwidth, and especially if the client keeps refreshing the page to view updated stats. Their are also about ~5k+ active users daily so their will be a lot of data being fetched.

现在,我非常确信PHP可以很好地处理这些问题,但是它可能会在服务器+带宽上创建大量的负载,尤其是当客户端不断刷新页面以查看更新的状态时。他们每天也有大约5k+的活跃用户,所以他们的数据将被获取。

What would be the best way to handle this?

最好的处理方法是什么?

Note: The server has 4gb DD3 RAM.

注意:服务器有4gb的DD3内存。

2 个解决方案

#1


2  

You'd want to implement some kind of caching mechanism, so each user only has a granularity of (say) 1 minute. That way even if the user's hammering on refresh, they'd only execute the db query/data collation once a minute, and otherwise get the previous results.

您需要实现某种缓存机制,因此每个用户的粒度(比如)只有1分钟。这样,即使用户刷新时出错,他们也只会每分钟执行一次db查询/数据排序,否则会得到先前的结果。

If the data's relatively the same between users, that'll reduce the total database load even further. Assuming each user hits refresh every 10 seconds, and the data sets are common to 10% other users, then doing a per-query cache with a 1minute granularity takes you from

如果用户之间的数据相对相同,这将进一步减少数据库的总负载。假设每个用户每10秒刷新一次,并且这些数据集对于其他10%的用户来说是常见的,那么使用1分钟粒度的每个查询缓存就可以将您从

150,000 rows * 6 times per minute * 5000 users = 4.5 billion rows fetched

to

150,000 rows * 1 times per minute * 500 users = 75 million rows fetched.

(e.g. 1/300th the rows fetched).

(例如:1/300行)。

#2


0  

Short answer: don't perform the calculations every time; save the results of the calculation in a database table, and return those results.

简短的回答:不要每次都计算;将计算结果保存在数据库表中,并返回这些结果。

Longer answer: the above, but understand it can be tricky based upon just how up-to-date you expect your data to be. Consider just how much updated data invalidates your result set, and design around that.

更详细的回答:上面提到的,但是要理解这一点是很棘手的,这取决于您希望您的数据有多及时更新。考虑有多少更新的数据会使结果集失效,并围绕此进行设计。

#1


2  

You'd want to implement some kind of caching mechanism, so each user only has a granularity of (say) 1 minute. That way even if the user's hammering on refresh, they'd only execute the db query/data collation once a minute, and otherwise get the previous results.

您需要实现某种缓存机制,因此每个用户的粒度(比如)只有1分钟。这样,即使用户刷新时出错,他们也只会每分钟执行一次db查询/数据排序,否则会得到先前的结果。

If the data's relatively the same between users, that'll reduce the total database load even further. Assuming each user hits refresh every 10 seconds, and the data sets are common to 10% other users, then doing a per-query cache with a 1minute granularity takes you from

如果用户之间的数据相对相同,这将进一步减少数据库的总负载。假设每个用户每10秒刷新一次,并且这些数据集对于其他10%的用户来说是常见的,那么使用1分钟粒度的每个查询缓存就可以将您从

150,000 rows * 6 times per minute * 5000 users = 4.5 billion rows fetched

to

150,000 rows * 1 times per minute * 500 users = 75 million rows fetched.

(e.g. 1/300th the rows fetched).

(例如:1/300行)。

#2


0  

Short answer: don't perform the calculations every time; save the results of the calculation in a database table, and return those results.

简短的回答:不要每次都计算;将计算结果保存在数据库表中,并返回这些结果。

Longer answer: the above, but understand it can be tricky based upon just how up-to-date you expect your data to be. Consider just how much updated data invalidates your result set, and design around that.

更详细的回答:上面提到的,但是要理解这一点是很棘手的,这取决于您希望您的数据有多及时更新。考虑有多少更新的数据会使结果集失效,并围绕此进行设计。