用于web访问日志的实时数据仓库

时间:2022-01-18 15:33:14

We're thinking about putting up a data warehouse system to load with web access logs that our web servers generate. The idea is to load the data in real-time.

我们正在考虑使用web服务器生成的web访问日志加载数据仓库系统。其想法是实时加载数据。

To the user we want to present a line graph of the data and enable the user to drill down using the dimensions.

对于用户,我们希望显示数据的线形图,并允许用户使用维度进行深入挖掘。

The question is how to balance and design the system so that ;

问题是如何平衡和设计系统;

(1) the data can be fetched and presented to the user in real-time (<2 seconds),

(1)数据可实时获取并呈现给用户(<2秒),

(2) data can be aggregated on per-hour and per-day basis, and

(2)数据可以按小时、日汇总

(2) as large amount of data can still be stored in the warehouse, and

(2)大量数据仍可存储在仓库中。

Our current data-rate is roughly ~10 accesses per second which gives us ~800k rows per day. My simple tests with MySQL and a simple star schema shows that my quires starts to take longer than 2 seconds when we have more than 8 million rows.

我们目前的数据速率大约是每秒10次访问,每天大约有800k行。我对MySQL和一个简单的星型模式的简单测试表明,当我们有超过800万行的时候,我的quires开始花费超过2秒的时间。

Is it possible it get real-time query performance from a "simple" data warehouse like this, and still have it store a lot of data (it would be nice to be able to never throw away any data)

它是否有可能从这样的“简单”数据仓库获得实时查询性能,并仍然存储大量数据(能够永远不丢弃任何数据是很好的)

Are there ways to aggregate the data into higher resolution tables?

是否有方法将数据聚合到更高分辨率的表中?

I got a feeling that this isn't really a new question (i've googled quite a lot though). Could maybe someone give points to data warehouse solutions like this? One that comes to mind is Splunk.

我觉得这并不是一个新的问题(我在谷歌上搜了很多次)。也许有人能给这样的数据仓库解决方案点什么吗?我想到的其中一个就是Splunk。

Maybe I'm grasping for too much.

也许我太贪心了。

UPDATE

更新

My schema looks like this;

我的模式是这样的;

  • dimensions:

    维度:

    • client (ip-address)
    • 客户端(ip地址)
    • server
    • 服务器
    • url
    • url
  • facts;

    事实;

    • timestamp (in seconds)
    • 时间戳(以秒为单位)
    • bytes transmitted
    • 字节传输

4 个解决方案

#1


1  

Doesn't sound like it would be a problem. MySQL is very fast.

听起来不像是个问题。MySQL是非常快。

For storing logging data, use MyISAM tables -- they're much faster and well suited for web server logs. (I think InnoDB is the default for new installations these days - foreign keys and all the other features of InnoDB aren't necessary for the log tables). You might also consider using merge tables - you can keep individual tables to a manageable size while still being able to access them all as one big table.

要存储日志数据,请使用MyISAM表——它们要快得多,非常适合web服务器日志。(我认为InnoDB现在是新安装的默认值——外键和InnoDB的所有其他特性对日志表来说都不是必需的)。您也可以考虑使用合并表——您可以将单个表保持在可管理的大小,同时仍然能够作为一个大表访问它们。

If you're still not able to keep up, then get yourself more memory, faster disks, a RAID, or a faster system, in that order.

如果你仍然不能跟上,那就给自己更多的内存,更快的磁盘,RAID,或者更快的系统。

Also: Never throwing away data is probably a bad idea. If each line is about 200 bytes long, you're talking about a minimum of 50 GB per year, just for the raw logging data. Multiply by at least two if you have indexes. Multiply again by (at least) two for backups.

另外:永远不要丢弃数据可能是个坏主意。如果每一行大约有200字节长,那么每年至少需要50 GB,这仅仅是原始日志数据。如果你有索引,至少乘以2。再乘以(至少)两个备份。

You can keep it all if you want, but in my opinion you should consider storing the raw data for a few weeks and the aggregated data for a few years. For anything older, just store the reports. (That is, unless you are required by law to keep around. Even then, it probably won't be for more than 3-4 years).

如果您愿意,您可以保留它,但是在我看来,您应该考虑将原始数据存储几个星期,并将聚合数据保存几年。对于任何旧的,只要存储报告。(也就是说,除非你被法律要求保留。即便如此,可能也不会超过3-4年)。

#2


2  

Seth's answer above is a very reasonable answer and I feel confident that if you invest in the appropriate knowledge and hardware, it has a high chance of success.

上面赛斯的回答是一个非常合理的答案,我相信如果你投资于适当的知识和硬件,它就有很大的成功机会。

Mozilla does a lot of web service analytics. We keep track of details on an hourly basis and we use a commercial DB product, Vertica. It would work very well for this approach but since it is a proprietary commercial product, it has a different set of associated costs.

Mozilla做了很多web服务分析。我们每小时都在跟踪细节,我们使用商业数据库产品Vertica。对于这种方法,它将非常有效,但是由于它是一种私有的商业产品,所以它有不同的相关成本集。

Another technology that you might want to investigate would be MongoDB. It is a document store database that has a few features that make it potentially a great fit for this use case. Namely, the capped collections (do a search for mongodb capped collections for more info)

您可能希望研究的另一项技术是MongoDB。它是一个文档存储数据库,具有一些特性,这使得它可能非常适合这个用例。即,封顶集合(对mongodb有上限的集合进行搜索以获取更多信息)

And the fast increment operation for things like keeping track of page views, hits, etc. http://blog.mongodb.org/post/171353301/using-mongodb-for-real-time-analytics

以及快速增量操作,如跟踪页面视图、点击等。http://blog.mongodb.org/post/171353301/ use -mongodb-for real-time analytics

#3


1  

Also, look into partitioning, especially if your queries mostly access latest data; you could -- for example -- set-up weekly partitions of ~5.5M rows.

此外,还要研究分区,尤其是当查询主要访问最新数据时;您可以——例如——设置每周大约5.5行的分区。

If aggregating per-day and per hour, consider having date and time dimensions -- you did not list them so I assume you do not use them. The idea is not to have any functions in a query, like HOUR(myTimestamp) or DATE(myTimestamp). The date dimension should be partitioned the same way as fact tables.

如果每天和每小时进行汇总,请考虑有日期和时间维度——您没有列出它们,因此我假设您没有使用它们。其思想是在查询中不包含任何函数,比如HOUR(myTimestamp)或DATE(myTimestamp)。日期维度应该像事实表那样分区。

With this in place, the query optimizer can use partition pruning, so the total size of tables does not influence the query response as before.

这样,查询优化器就可以使用分区修剪,因此表的总大小不会像以前那样影响查询响应。

#4


0  

This has gotten to be a fairly common data warehousing application. I've run one for years that supported 20-100 million rows a day with 0.1 second response time (from database), over a second from web server. This isn't even on a huge server.

这已经成为一个相当常见的数据仓库应用程序。我已经运行了多年,每天支持2000 -1亿行,响应时间为0.1秒(来自数据库),web服务器为1秒。这甚至不在大型服务器上。

Your data volumes aren't too large, so I wouldn't think you'd need very expensive hardware. But I'd still go multi-core, 64-bit with a lot of memory.

您的数据量不是很大,所以我认为您不需要非常昂贵的硬件。但我还是会选择多核、64位、内存很大的。

But you will want to mostly hit aggregate data rather than detail data - especially for time-series graphing over days, months, etc. Aggregate data can be either periodically created on your database through an asynchronous process, or in cases like this is typically works best if your ETL process that transforms your data creates the aggregate data. Note that the aggregate is typically just a group-by of your fact table.

但是你要主要聚合数据而不是详细数据,尤其是时序图形/天,月,等。聚合数据可以是定期上创建数据库通过异步过程,或在这种情况下通常是最好的如果你的ETL过程转换数据创建聚合数据。注意,聚合通常只是事实表的一个组。

As others have said - partitioning is a good idea when accessing detail data. But this is less critical for the aggregate data. Also, reliance on pre-created dimensional values is much better than on functions or stored procs. Both of these are typical data warehousing strategies.

正如其他人所说的,在访问详细数据时,分区是一个好主意。但这对聚合数据来说并不那么重要。此外,依赖于预先创建的维度值比依赖于函数或存储的proc要好得多。这两者都是典型的数据仓库策略。

Regarding the database - if it were me I'd try Postgresql rather than MySQL. The reason is primarily optimizer maturity: postgresql can better handle the kinds of queries you're likely to run. MySQL is more likely to get confused on five-way joins, go bottom up when you run a subselect, etc. And if this application is worth a lot, then I'd consider a commercial database like db2, oracle, sql server. Then you'd get additional features like query parallelism, automatic query rewrite against aggregate tables, additional optimizer sophistication, etc.

关于数据库——如果是我的话,我会尝试Postgresql而不是MySQL。原因主要是优化器成熟度:postgresql可以更好地处理可能运行的查询类型。MySQL在五路连接上更容易混淆,在运行子选择时从底向上,等等。如果这个应用程序值很多钱,那么我将考虑使用一个商业数据库,比如db2、oracle、sql server。然后您将获得额外的特性,如查询并行性、针对聚合表的自动查询重写、额外的优化器复杂性等。

#1


1  

Doesn't sound like it would be a problem. MySQL is very fast.

听起来不像是个问题。MySQL是非常快。

For storing logging data, use MyISAM tables -- they're much faster and well suited for web server logs. (I think InnoDB is the default for new installations these days - foreign keys and all the other features of InnoDB aren't necessary for the log tables). You might also consider using merge tables - you can keep individual tables to a manageable size while still being able to access them all as one big table.

要存储日志数据,请使用MyISAM表——它们要快得多,非常适合web服务器日志。(我认为InnoDB现在是新安装的默认值——外键和InnoDB的所有其他特性对日志表来说都不是必需的)。您也可以考虑使用合并表——您可以将单个表保持在可管理的大小,同时仍然能够作为一个大表访问它们。

If you're still not able to keep up, then get yourself more memory, faster disks, a RAID, or a faster system, in that order.

如果你仍然不能跟上,那就给自己更多的内存,更快的磁盘,RAID,或者更快的系统。

Also: Never throwing away data is probably a bad idea. If each line is about 200 bytes long, you're talking about a minimum of 50 GB per year, just for the raw logging data. Multiply by at least two if you have indexes. Multiply again by (at least) two for backups.

另外:永远不要丢弃数据可能是个坏主意。如果每一行大约有200字节长,那么每年至少需要50 GB,这仅仅是原始日志数据。如果你有索引,至少乘以2。再乘以(至少)两个备份。

You can keep it all if you want, but in my opinion you should consider storing the raw data for a few weeks and the aggregated data for a few years. For anything older, just store the reports. (That is, unless you are required by law to keep around. Even then, it probably won't be for more than 3-4 years).

如果您愿意,您可以保留它,但是在我看来,您应该考虑将原始数据存储几个星期,并将聚合数据保存几年。对于任何旧的,只要存储报告。(也就是说,除非你被法律要求保留。即便如此,可能也不会超过3-4年)。

#2


2  

Seth's answer above is a very reasonable answer and I feel confident that if you invest in the appropriate knowledge and hardware, it has a high chance of success.

上面赛斯的回答是一个非常合理的答案,我相信如果你投资于适当的知识和硬件,它就有很大的成功机会。

Mozilla does a lot of web service analytics. We keep track of details on an hourly basis and we use a commercial DB product, Vertica. It would work very well for this approach but since it is a proprietary commercial product, it has a different set of associated costs.

Mozilla做了很多web服务分析。我们每小时都在跟踪细节,我们使用商业数据库产品Vertica。对于这种方法,它将非常有效,但是由于它是一种私有的商业产品,所以它有不同的相关成本集。

Another technology that you might want to investigate would be MongoDB. It is a document store database that has a few features that make it potentially a great fit for this use case. Namely, the capped collections (do a search for mongodb capped collections for more info)

您可能希望研究的另一项技术是MongoDB。它是一个文档存储数据库,具有一些特性,这使得它可能非常适合这个用例。即,封顶集合(对mongodb有上限的集合进行搜索以获取更多信息)

And the fast increment operation for things like keeping track of page views, hits, etc. http://blog.mongodb.org/post/171353301/using-mongodb-for-real-time-analytics

以及快速增量操作,如跟踪页面视图、点击等。http://blog.mongodb.org/post/171353301/ use -mongodb-for real-time analytics

#3


1  

Also, look into partitioning, especially if your queries mostly access latest data; you could -- for example -- set-up weekly partitions of ~5.5M rows.

此外,还要研究分区,尤其是当查询主要访问最新数据时;您可以——例如——设置每周大约5.5行的分区。

If aggregating per-day and per hour, consider having date and time dimensions -- you did not list them so I assume you do not use them. The idea is not to have any functions in a query, like HOUR(myTimestamp) or DATE(myTimestamp). The date dimension should be partitioned the same way as fact tables.

如果每天和每小时进行汇总,请考虑有日期和时间维度——您没有列出它们,因此我假设您没有使用它们。其思想是在查询中不包含任何函数,比如HOUR(myTimestamp)或DATE(myTimestamp)。日期维度应该像事实表那样分区。

With this in place, the query optimizer can use partition pruning, so the total size of tables does not influence the query response as before.

这样,查询优化器就可以使用分区修剪,因此表的总大小不会像以前那样影响查询响应。

#4


0  

This has gotten to be a fairly common data warehousing application. I've run one for years that supported 20-100 million rows a day with 0.1 second response time (from database), over a second from web server. This isn't even on a huge server.

这已经成为一个相当常见的数据仓库应用程序。我已经运行了多年,每天支持2000 -1亿行,响应时间为0.1秒(来自数据库),web服务器为1秒。这甚至不在大型服务器上。

Your data volumes aren't too large, so I wouldn't think you'd need very expensive hardware. But I'd still go multi-core, 64-bit with a lot of memory.

您的数据量不是很大,所以我认为您不需要非常昂贵的硬件。但我还是会选择多核、64位、内存很大的。

But you will want to mostly hit aggregate data rather than detail data - especially for time-series graphing over days, months, etc. Aggregate data can be either periodically created on your database through an asynchronous process, or in cases like this is typically works best if your ETL process that transforms your data creates the aggregate data. Note that the aggregate is typically just a group-by of your fact table.

但是你要主要聚合数据而不是详细数据,尤其是时序图形/天,月,等。聚合数据可以是定期上创建数据库通过异步过程,或在这种情况下通常是最好的如果你的ETL过程转换数据创建聚合数据。注意,聚合通常只是事实表的一个组。

As others have said - partitioning is a good idea when accessing detail data. But this is less critical for the aggregate data. Also, reliance on pre-created dimensional values is much better than on functions or stored procs. Both of these are typical data warehousing strategies.

正如其他人所说的,在访问详细数据时,分区是一个好主意。但这对聚合数据来说并不那么重要。此外,依赖于预先创建的维度值比依赖于函数或存储的proc要好得多。这两者都是典型的数据仓库策略。

Regarding the database - if it were me I'd try Postgresql rather than MySQL. The reason is primarily optimizer maturity: postgresql can better handle the kinds of queries you're likely to run. MySQL is more likely to get confused on five-way joins, go bottom up when you run a subselect, etc. And if this application is worth a lot, then I'd consider a commercial database like db2, oracle, sql server. Then you'd get additional features like query parallelism, automatic query rewrite against aggregate tables, additional optimizer sophistication, etc.

关于数据库——如果是我的话,我会尝试Postgresql而不是MySQL。原因主要是优化器成熟度:postgresql可以更好地处理可能运行的查询类型。MySQL在五路连接上更容易混淆,在运行子选择时从底向上,等等。如果这个应用程序值很多钱,那么我将考虑使用一个商业数据库,比如db2、oracle、sql server。然后您将获得额外的特性,如查询并行性、针对聚合表的自动查询重写、额外的优化器复杂性等。