I need to implement a custom-developed web analytics service for large number of websites. The key entities here are:
我需要为大量的网站实现一个定制开发的web分析服务。这里的主要实体是:
- Website
- 网站
- Visitor
- 游客
Each unique visitor will have have a single row in the database with information like landing page, time of day, OS, Browser, referrer, IP, etc.
每个独特的访问者在数据库中都有一行,其中包含登录页面、一天中的时间、操作系统、浏览器、引用者、IP等信息。
I will need to do aggregated queries on this database such as 'COUNT all visitors who have Windows as OS and came from Bing.com'
我需要在这个数据库上做聚合查询,比如“计算所有有Windows操作系统的访问者,来自Bing.com”
I have hundreds of websites to track and the number of visitors for those websites range from a few hundred a day to few million a day. In total, I expect this database to grow by about a million rows per day.
我有数百个网站要跟踪,这些网站的访问量从每天几百人到每天几百万人不等。总的来说,我期望这个数据库每天增加大约一百万行。
My questions are:
我的问题是:
1) Is MySQL a good database for this purpose?
1)MySQL是一个很好的数据库吗?
2) What could be a good architecture? I am thinking of creating a new table for each website. Or perhaps start with a single table and then spawn a new table (daily) if number of rows in an existing table exceed 1 million (is my assumption correct). My only worry is that if a table grows too big, the SQL queries can get dramatically slow. So, what is the maximum number of rows I should store per table? Moreover, is there a limit on number of tables that MySQL can handle.
什么是好的建筑?我正在考虑为每个网站创建一个新的表格。或者,如果现有表中的行数超过100万(我的假设是正确的),那么可以从一个表开始,然后生成一个新的表(每天)。我唯一担心的是,如果一个表变得太大,SQL查询会变得非常慢。那么,每个表应该存储的最大行数是多少?此外,MySQL可以处理的表数量也有限制。
3) Is it advisable to do aggregate queries over millions of rows? I'm ready to wait for a couple of seconds to get results for such queries. Is it a good practice or is there any other way to do aggregate queries?
3)是否建议对数百万行进行聚合查询?我准备等待几秒钟,以获取此类查询的结果。这是一个很好的实践,还是有其他方法来进行聚合查询?
In a nutshell, I am trying a design a large scale data-warehouse kind of setup which will be write heavy. If you know about any published case studies or reports, that'll be great!
简而言之,我正在尝试设计一种大规模的数据仓库类型的设置,这种设置需要大量的编写。如果你知道任何已发表的案例研究或报告,那将是伟大的!
4 个解决方案
#1
4
If you're talking larger volumes of data, then look at MySQL partitioning. For these tables, a partition by data/time would certainly help performance. There's a decent article about partitioning here.
如果您正在讨论更大的数据量,那么请查看MySQL分区。对于这些表,按数据/时间划分的分区肯定有助于提高性能。这里有一篇关于分区的文章。
Look at creating two separate databases: one for all raw data for the writes with minimal indexing; a second for reporting using the aggregated values; with either a batch process to update the reporting database from the raw data database, or use replication to do that for you.
看看如何创建两个独立的数据库:一个用于所有原始数据,用于写操作,并使用最小的索引;第二个使用聚合值进行报告;通过一个批处理流程从原始数据数据库更新报告数据库,或者使用复制来为您做这些。
EDIT
编辑
If you want to be really clever with your aggregation reports, create a set of aggregation tables ("today", "week to date", "month to date", "by year"). Aggregate from raw data to "today" either daily or in "real time"; aggregate from "by day" to "week to date" on a nightly basis; from "week to date" to "month to date" on a weekly basis, etc. When executing queries, join (UNION) the appropriate tables for the date ranges you're interested in.
如果您想要非常巧妙地处理聚合报告,请创建一组聚合表(“today”、“week to date”、“month to date”、“by year”)。从原始数据汇总到“今天”,可以是每天的,也可以是“实时的”;按每夜计算,由“按日计算”至“按星期计算”;每周从“星期到日期”到“月到日期”,等等。执行查询时,为您感兴趣的日期范围加入(联合)适当的表。
EDIT #2
编辑# 2
Rather than one table per client, we work with one database schema per client. Depending on the size of the client, we might have several schemas in a single database instance, or a dedicated database instance per client. We use separate schemas for raw data collection, and for aggregation/reporting for each client. We run multiple database servers, restricting each server to a single database instance. For resilience, databases are replicated across multiple servers and load balanced for improved performance.
我们使用每个客户机的一个数据库模式,而不是每个客户机的一个表。根据客户机的大小,我们可能在单个数据库实例中有多个模式,或者每个客户机有一个专用的数据库实例。我们为原始数据收集使用不同的模式,为每个客户端使用不同的聚合/报告模式。我们运行多个数据库服务器,将每个服务器限制为单个数据库实例。为了恢复性能,可以跨多个服务器复制数据库,并平衡负载以提高性能。
#2
3
Some suggestions in a database agnostic fashion.
以数据库不可知的方式给出一些建议。
The most simplest rational is to distinguish between read intensive and write intensive tables. Probably it is good idea to create two parallel schemas daily/weekly schema and a history schema. The partitioning can be done appropriately. One can think of a batch job to update the history schema with data from daily/weekly schema. In history schema again, you can create separate data tables per website (based on the data volume).
最简单的rational就是区分读密集型表和写密集型表。也许每天/每周创建两个并行模式和一个历史模式是一个好主意。可以适当地进行分区。可以考虑使用来自每日/每周模式的数据更新历史模式的批处理作业。在history模式中,您可以为每个网站创建单独的数据表(基于数据卷)。
If all you are interested is in the aggregation stats alone (which may not be true). It is a good idea to have a summary tables (monthly, daily) in which the summary is stored like total unqiue visitors, repeat visitors etc; and these summary tables are to be updated at the end of day. This enables on the fly computation of stats with out waiting for the history database to be updated.
如果您感兴趣的只是聚合统计数据(可能不是这样)。最好有一个总结表(每月、每天),其中的总结像完整的unqiue访问者、回访访问者等存储在其中;这些总结表将在一天结束时更新。这允许动态计算状态,而无需等待历史数据库更新。
#3
2
You should definitely consider splitting the data by site across databases or schemas - this not only makes it much easier to backup, drop etc an individual site/client but also eliminates much of the hassle of making sure no customer can see any other customers data by accident or poor coding etc. It also means it is easier to make choices about partitionaing, over and above databae table-level partitioning for time or client etc.
您应该考虑将整个数据库或数据的网站模式,这不仅使它更容易备份、删除等个人网站/客户端但也消除了大部分的麻烦确保没有客户可以看到其他客户的数据是偶然或糟糕的编码等。这也意味着更容易做出选择partitionaing,超过databae表级分区时间或客户等。
Also you said that the data volume is 1 million rows per day (that's not particularly heavy and doesn't require huge grunt power to log/store, nor indeed to report (though if you were genererating 500 reports at midnight you might logjam). However you also said that some sites had 1m visitors daily so perhaps you figure is too conservative?
另外,您还说数据量是每天100万行(这不是特别重,不需要巨大的grunt power日志/存储,也不需要报告(但是如果您在午夜时生成了500个报告,您可能会遇到麻烦)。然而你也说过有些网站每天有100万的访问者,所以你认为这太保守了吗?
Lastly you didn't say if you want real-time reporting a la chartbeat/opentracker etc or cyclical refresh like google analytics - this will have a major bearing on what your storage model is from day one.
最后,你没有说如果你想要实时报告一个la chartbeat/opentracker等等,或者像谷歌分析那样的周期性刷新——这将对你的存储模型从第一天起产生很大的影响。
M
米
#4
0
You really should test your way forward will simulated enviroments as close as possible to the live enviroment, with "real fake" data (correct format & length). Benchmark queries and variants of table structures. Since you seem to know MySQL, start there. It shouldn't take you that long to set up a few scripts bombarding your database with queries. Studying the results of your database with your kind of data will help you realise where the bottlenecks will occur.
你真的应该用“真正的假”数据(正确的格式和长度)来测试你的前进方式,尽可能地接近真实环境。基准查询和表结构的变体。既然你似乎了解MySQL,那就从这里开始吧。不需要花太长时间就可以设置一些脚本,用查询轰炸数据库。使用您的数据研究数据库的结果将帮助您了解瓶颈将出现在何处。
Not a solution but hopefully some help on the way, good luck :)
这不是一个解决方案,但希望能在途中得到一些帮助,祝你好运。
#1
4
If you're talking larger volumes of data, then look at MySQL partitioning. For these tables, a partition by data/time would certainly help performance. There's a decent article about partitioning here.
如果您正在讨论更大的数据量,那么请查看MySQL分区。对于这些表,按数据/时间划分的分区肯定有助于提高性能。这里有一篇关于分区的文章。
Look at creating two separate databases: one for all raw data for the writes with minimal indexing; a second for reporting using the aggregated values; with either a batch process to update the reporting database from the raw data database, or use replication to do that for you.
看看如何创建两个独立的数据库:一个用于所有原始数据,用于写操作,并使用最小的索引;第二个使用聚合值进行报告;通过一个批处理流程从原始数据数据库更新报告数据库,或者使用复制来为您做这些。
EDIT
编辑
If you want to be really clever with your aggregation reports, create a set of aggregation tables ("today", "week to date", "month to date", "by year"). Aggregate from raw data to "today" either daily or in "real time"; aggregate from "by day" to "week to date" on a nightly basis; from "week to date" to "month to date" on a weekly basis, etc. When executing queries, join (UNION) the appropriate tables for the date ranges you're interested in.
如果您想要非常巧妙地处理聚合报告,请创建一组聚合表(“today”、“week to date”、“month to date”、“by year”)。从原始数据汇总到“今天”,可以是每天的,也可以是“实时的”;按每夜计算,由“按日计算”至“按星期计算”;每周从“星期到日期”到“月到日期”,等等。执行查询时,为您感兴趣的日期范围加入(联合)适当的表。
EDIT #2
编辑# 2
Rather than one table per client, we work with one database schema per client. Depending on the size of the client, we might have several schemas in a single database instance, or a dedicated database instance per client. We use separate schemas for raw data collection, and for aggregation/reporting for each client. We run multiple database servers, restricting each server to a single database instance. For resilience, databases are replicated across multiple servers and load balanced for improved performance.
我们使用每个客户机的一个数据库模式,而不是每个客户机的一个表。根据客户机的大小,我们可能在单个数据库实例中有多个模式,或者每个客户机有一个专用的数据库实例。我们为原始数据收集使用不同的模式,为每个客户端使用不同的聚合/报告模式。我们运行多个数据库服务器,将每个服务器限制为单个数据库实例。为了恢复性能,可以跨多个服务器复制数据库,并平衡负载以提高性能。
#2
3
Some suggestions in a database agnostic fashion.
以数据库不可知的方式给出一些建议。
The most simplest rational is to distinguish between read intensive and write intensive tables. Probably it is good idea to create two parallel schemas daily/weekly schema and a history schema. The partitioning can be done appropriately. One can think of a batch job to update the history schema with data from daily/weekly schema. In history schema again, you can create separate data tables per website (based on the data volume).
最简单的rational就是区分读密集型表和写密集型表。也许每天/每周创建两个并行模式和一个历史模式是一个好主意。可以适当地进行分区。可以考虑使用来自每日/每周模式的数据更新历史模式的批处理作业。在history模式中,您可以为每个网站创建单独的数据表(基于数据卷)。
If all you are interested is in the aggregation stats alone (which may not be true). It is a good idea to have a summary tables (monthly, daily) in which the summary is stored like total unqiue visitors, repeat visitors etc; and these summary tables are to be updated at the end of day. This enables on the fly computation of stats with out waiting for the history database to be updated.
如果您感兴趣的只是聚合统计数据(可能不是这样)。最好有一个总结表(每月、每天),其中的总结像完整的unqiue访问者、回访访问者等存储在其中;这些总结表将在一天结束时更新。这允许动态计算状态,而无需等待历史数据库更新。
#3
2
You should definitely consider splitting the data by site across databases or schemas - this not only makes it much easier to backup, drop etc an individual site/client but also eliminates much of the hassle of making sure no customer can see any other customers data by accident or poor coding etc. It also means it is easier to make choices about partitionaing, over and above databae table-level partitioning for time or client etc.
您应该考虑将整个数据库或数据的网站模式,这不仅使它更容易备份、删除等个人网站/客户端但也消除了大部分的麻烦确保没有客户可以看到其他客户的数据是偶然或糟糕的编码等。这也意味着更容易做出选择partitionaing,超过databae表级分区时间或客户等。
Also you said that the data volume is 1 million rows per day (that's not particularly heavy and doesn't require huge grunt power to log/store, nor indeed to report (though if you were genererating 500 reports at midnight you might logjam). However you also said that some sites had 1m visitors daily so perhaps you figure is too conservative?
另外,您还说数据量是每天100万行(这不是特别重,不需要巨大的grunt power日志/存储,也不需要报告(但是如果您在午夜时生成了500个报告,您可能会遇到麻烦)。然而你也说过有些网站每天有100万的访问者,所以你认为这太保守了吗?
Lastly you didn't say if you want real-time reporting a la chartbeat/opentracker etc or cyclical refresh like google analytics - this will have a major bearing on what your storage model is from day one.
最后,你没有说如果你想要实时报告一个la chartbeat/opentracker等等,或者像谷歌分析那样的周期性刷新——这将对你的存储模型从第一天起产生很大的影响。
M
米
#4
0
You really should test your way forward will simulated enviroments as close as possible to the live enviroment, with "real fake" data (correct format & length). Benchmark queries and variants of table structures. Since you seem to know MySQL, start there. It shouldn't take you that long to set up a few scripts bombarding your database with queries. Studying the results of your database with your kind of data will help you realise where the bottlenecks will occur.
你真的应该用“真正的假”数据(正确的格式和长度)来测试你的前进方式,尽可能地接近真实环境。基准查询和表结构的变体。既然你似乎了解MySQL,那就从这里开始吧。不需要花太长时间就可以设置一些脚本,用查询轰炸数据库。使用您的数据研究数据库的结果将帮助您了解瓶颈将出现在何处。
Not a solution but hopefully some help on the way, good luck :)
这不是一个解决方案,但希望能在途中得到一些帮助,祝你好运。