I'm planning the structure of a MySql database and could use some advice from more seasoned professionals. The site which the DB belongs to gathers 90-days of weather data for EACH registered user, and has to support millions of users.
我正在计划MySql数据库的结构,并可以使用来自更多经验丰富的专业人士的一些建议。数据库所属的站点为每个注册用户收集90天的天气数据,并且必须支持数百万用户。
I already have a table for the users, with their login and contact information, but assume that I need a second table for all the weather data...
我已经为用户提供了一张表,其中包含他们的登录信息和联系信息,但我认为我需要第二张表来显示所有天气数据...
What I intend to do is basically store the average temperature, humidity, wind-direction and so fourth - per day - for every user. And each day the DB is updated with the new day's data, while keeping yesterday's entries (but limited to 89-days of old data + the current day's data) - for all users.
我打算做的基本上是为每个用户存储平均温度,湿度,风向等每天第四。每天使用新的一天的数据更新数据库,同时保留所有用户的昨天的条目(但限制为89天的旧数据+当天的数据)。
Now, does it make most sense to have one huge "data" table that has 90 rows for EVERY user (with millions of users)? Or is there a more clever way to do this that is better for performance reasons or similar?
现在,拥有一个巨大的“数据”表,每个用户拥有90行(拥有数百万用户)是否最有意义?或者是否有更聪明的方法来做到这一点,这更好地出于性能原因或类似原因?
The 90-days of data will be accessed (READ and displayed etc.) every time a user logs in and views his own profile or if she browses someone else's profile. But it will only be updated once per day (overwriting the oldest entry, maintaining the limit of 90 rows per user.)
每次用户登录并查看自己的个人资料或浏览其他人的个人资料时,将访问(阅读和显示等)90天的数据。但它每天只会更新一次(覆盖最旧的条目,每个用户保持90行的限制。)
5 个解决方案
#1
2
Edit: saw just now that each user has different weather data. Keeping the "shared data" in the answer, but you're interested in the second case.
编辑:刚看到每个用户都有不同的天气数据。保持答案中的“共享数据”,但您对第二种情况感兴趣。
Users share weather data
用户共享天气数据
Based, say, on their nearest weather station ID.
比如,基于他们最近的气象站ID。
I'd store a (userId, stationId, isActive, isPreferred) table to know what data the user is interested in, and then I'd run a query against stationWeatherData to fetch the 90 rows of weather data for that station.
我将存储一个(userId,stationId,isActive,isPreferred)表来了解用户感兴趣的数据,然后我将对stationWeatherData运行查询以获取该站的90行天气数据。
Each user has his own weather data
每个用户都有自己的天气数据
There shouldn't be particular problems in handling 900 million users. If you really had to, you could "shard" on different tables based on userId, e.g, table weather174 would hold data of all users for which (userId % 1000) gives 174, and you'd find yourself with 1000 tables - possibly on different servers - of one thousandth the size.
处理9亿用户时应该没有特别的问题。如果你真的不得不,你可以根据userId在不同的表上“分片”,例如,table weather174将保存所有用户的数据(userId%1000)给出174,你会发现自己有1000个表 - 可能在不同的服务器 - 千分之一。
So you start with one big table, and prepare for sharding (or moving to cloud storage and a no-SQL keystore database, e.g. MongoDB, VoltDB). Or partition based on UserID as soon as UserID reaches, say, one million.
所以你从一个大表开始,准备分片(或转移到云存储和无SQL密钥库数据库,例如MongoDB,VoltDB)。或者,一旦UserID达到一百万,就基于UserID进行分区。
Or even, you don't use a database at all. A DB makes sense if you need to search or correlate/join data -- here you are just accessing a user's "weather station".
甚至,您根本不使用数据库。如果您需要搜索或关联/加入数据,数据库是有意义的 - 这里您只是访问用户的“气象站”。
If you know you're never going to query "How many users have 60% humidity?", but always only "What data are there for user 1234567?", then you might save the data in a rolling buffer in binary, JSON or HTML format (on cloud storage, S3, or again MongoDB - now only one document per user). Much would then depend on how the data to be updated is arriving, i.e., in one big batch from a concentrator or each user uploading its own.
如果您知道您永远不会查询“有多少用户有60%的湿度?”,但始终只有“用户1234567有什么数据?”,那么您可以将数据保存在二进制,JSON或滚动缓冲区中HTML格式(在云存储,S3或MongoDB上 - 现在每个用户只有一个文档)。那么很大程度上取决于要更新的数据是如何到达的,即,来自集中器的一个大批量或者每个用户上传它自己的批量。
#2
1
For my answer (below), I assumed the data is specific to the user, such as from their personal backyard weather station. If it is data shared with other users, then my answer is sub-optimal.
对于我的回答(下面),我假设数据是特定于用户的,例如来自他们的个人后院气象站。如果它是与其他用户共享的数据,那么我的答案是次优的。
That seems reasonable, but why stop at 90 days? Keep daily information for each user for as long as they are valid users. The described query is always then something like
这似乎是合理的,但为什么在90天停止?只要他们是有效用户,就保留每个用户的每日信息。所描述的查询总是如此
SELECT temperature_avg, humidity, wind_direction, wind_speed
FROM weather_summary
WHERE user_id = (current_user)
ORDER BY sample_date DESC
LIMIT 90;
As long as there are indexes on sample_date
and user_id
, this will be extremely efficient.
只要sample_date和user_id上有索引,这将非常有效。
Having a separate table for each user has never worked out very well in my experience.
根据我的经验,每个用户都有一个单独的表格。
#3
1
If you are storing the location of each user, it would be simpler to store the weather data based on location and map it to the user on demand.
如果要存储每个用户的位置,则根据位置存储天气数据并根据需要将其映射到用户将更加简单。
UserId --> LocationId --> Weather details.
UserId - > LocationId - >天气详情。
Assuming that on the average there will be multiple users from each location, this should cut down on your database size quite a bit and should also scale better.
假设平均每个位置会有多个用户,这应该会大大减少您的数据库大小,并且还应该更好地扩展。
#4
1
I'd recommend a single table for the weather data, partitioned by the date (see MySQL documentation on range partitioning).
我建议使用单个表来查看天气数据,按日期分区(请参阅有关范围分区的MySQL文档)。
This way, you can easily get rid of old data (simply drop the oldest partition), and queries for ranges of days (say, average temperature for the last 7 days) will be very efficient.
这样,您可以轻松地删除旧数据(只需删除最旧的分区),并查询天数范围(例如,过去7天的平均温度)将非常有效。
#5
0
- Create Index on table columns (id, full-text indexing).
- As an idea, you can create some views on this table that will contain filtered data on the basis of location, days, week, month or quarter or alphabets or other criteria and based on that your code will decide which view to use to fetch the search results.
- OR if your table has much insert/update operations you can make more than one table and based on some criteria choose the table name to update/insert data with your server side programming language.
在表列上创建索引(id,全文索引)。
作为一个想法,您可以在此表上创建一些视图,其中包含基于位置,天,周,月或季度或字母表或其他条件的过滤数据,并基于您的代码将决定使用哪个视图来获取搜索结果。
或者如果您的表有很多插入/更新操作,您可以创建多个表,并根据某些条件选择表名来使用服务器端编程语言更新/插入数据。
#1
2
Edit: saw just now that each user has different weather data. Keeping the "shared data" in the answer, but you're interested in the second case.
编辑:刚看到每个用户都有不同的天气数据。保持答案中的“共享数据”,但您对第二种情况感兴趣。
Users share weather data
用户共享天气数据
Based, say, on their nearest weather station ID.
比如,基于他们最近的气象站ID。
I'd store a (userId, stationId, isActive, isPreferred) table to know what data the user is interested in, and then I'd run a query against stationWeatherData to fetch the 90 rows of weather data for that station.
我将存储一个(userId,stationId,isActive,isPreferred)表来了解用户感兴趣的数据,然后我将对stationWeatherData运行查询以获取该站的90行天气数据。
Each user has his own weather data
每个用户都有自己的天气数据
There shouldn't be particular problems in handling 900 million users. If you really had to, you could "shard" on different tables based on userId, e.g, table weather174 would hold data of all users for which (userId % 1000) gives 174, and you'd find yourself with 1000 tables - possibly on different servers - of one thousandth the size.
处理9亿用户时应该没有特别的问题。如果你真的不得不,你可以根据userId在不同的表上“分片”,例如,table weather174将保存所有用户的数据(userId%1000)给出174,你会发现自己有1000个表 - 可能在不同的服务器 - 千分之一。
So you start with one big table, and prepare for sharding (or moving to cloud storage and a no-SQL keystore database, e.g. MongoDB, VoltDB). Or partition based on UserID as soon as UserID reaches, say, one million.
所以你从一个大表开始,准备分片(或转移到云存储和无SQL密钥库数据库,例如MongoDB,VoltDB)。或者,一旦UserID达到一百万,就基于UserID进行分区。
Or even, you don't use a database at all. A DB makes sense if you need to search or correlate/join data -- here you are just accessing a user's "weather station".
甚至,您根本不使用数据库。如果您需要搜索或关联/加入数据,数据库是有意义的 - 这里您只是访问用户的“气象站”。
If you know you're never going to query "How many users have 60% humidity?", but always only "What data are there for user 1234567?", then you might save the data in a rolling buffer in binary, JSON or HTML format (on cloud storage, S3, or again MongoDB - now only one document per user). Much would then depend on how the data to be updated is arriving, i.e., in one big batch from a concentrator or each user uploading its own.
如果您知道您永远不会查询“有多少用户有60%的湿度?”,但始终只有“用户1234567有什么数据?”,那么您可以将数据保存在二进制,JSON或滚动缓冲区中HTML格式(在云存储,S3或MongoDB上 - 现在每个用户只有一个文档)。那么很大程度上取决于要更新的数据是如何到达的,即,来自集中器的一个大批量或者每个用户上传它自己的批量。
#2
1
For my answer (below), I assumed the data is specific to the user, such as from their personal backyard weather station. If it is data shared with other users, then my answer is sub-optimal.
对于我的回答(下面),我假设数据是特定于用户的,例如来自他们的个人后院气象站。如果它是与其他用户共享的数据,那么我的答案是次优的。
That seems reasonable, but why stop at 90 days? Keep daily information for each user for as long as they are valid users. The described query is always then something like
这似乎是合理的,但为什么在90天停止?只要他们是有效用户,就保留每个用户的每日信息。所描述的查询总是如此
SELECT temperature_avg, humidity, wind_direction, wind_speed
FROM weather_summary
WHERE user_id = (current_user)
ORDER BY sample_date DESC
LIMIT 90;
As long as there are indexes on sample_date
and user_id
, this will be extremely efficient.
只要sample_date和user_id上有索引,这将非常有效。
Having a separate table for each user has never worked out very well in my experience.
根据我的经验,每个用户都有一个单独的表格。
#3
1
If you are storing the location of each user, it would be simpler to store the weather data based on location and map it to the user on demand.
如果要存储每个用户的位置,则根据位置存储天气数据并根据需要将其映射到用户将更加简单。
UserId --> LocationId --> Weather details.
UserId - > LocationId - >天气详情。
Assuming that on the average there will be multiple users from each location, this should cut down on your database size quite a bit and should also scale better.
假设平均每个位置会有多个用户,这应该会大大减少您的数据库大小,并且还应该更好地扩展。
#4
1
I'd recommend a single table for the weather data, partitioned by the date (see MySQL documentation on range partitioning).
我建议使用单个表来查看天气数据,按日期分区(请参阅有关范围分区的MySQL文档)。
This way, you can easily get rid of old data (simply drop the oldest partition), and queries for ranges of days (say, average temperature for the last 7 days) will be very efficient.
这样,您可以轻松地删除旧数据(只需删除最旧的分区),并查询天数范围(例如,过去7天的平均温度)将非常有效。
#5
0
- Create Index on table columns (id, full-text indexing).
- As an idea, you can create some views on this table that will contain filtered data on the basis of location, days, week, month or quarter or alphabets or other criteria and based on that your code will decide which view to use to fetch the search results.
- OR if your table has much insert/update operations you can make more than one table and based on some criteria choose the table name to update/insert data with your server side programming language.
在表列上创建索引(id,全文索引)。
作为一个想法,您可以在此表上创建一些视图,其中包含基于位置,天,周,月或季度或字母表或其他条件的过滤数据,并基于您的代码将决定使用哪个视图来获取搜索结果。
或者如果您的表有很多插入/更新操作,您可以创建多个表,并根据某些条件选择表名来使用服务器端编程语言更新/插入数据。