用于存储大型日志表的数据库

时间:2021-06-29 17:00:55

We have an API server running that serves around 500.000 requests a day. We want to keep all these reguests in a database to be able to analyze the data. We log things like:

我们有一个API服务器,每天服务500.000个请求。我们希望将所有这些reguest保存在数据库中,以便能够分析数据。我们的日志:

  • Who did the request
  • 是谁干的请求
  • How long time did it take
  • 花了多长时间
  • Date and time
  • 日期和时间
  • Http response code
  • Http响应代码
  • What api resource was asked for (url)
  • 问了什么api资源(url)
  • Cached response or not (bool)
  • 缓存响应与否(bool)
  • +++
  • + + +

We want to keep these logs for 3 months, something which will result in about 45.000.000 records in that database. When records are older than 3 months they are deleted.

我们希望将这些日志保存3个月,这将导致该数据库中大约45.000.000条记录。当记录超过3个月时,他们被删除。

Storing these 45 million records in a sql database is possible, but then it is really slow to perform any analysis on these data. We would like to do extensive analysis like - how many request did a specific user do today, compared to the same day last week? How many percent of requests failed today compared to any other day? See a trend diagram showing if the number of request are going up or down. See the top 10 resources being asked for at a given time. You get it - we want to be able to do all kind of analysis like this.

在sql数据库中存储这4500万条记录是可能的,但是对这些数据进行任何分析都很慢。我们想做大量的分析,比如,一个特定的用户今天做了多少请求,而上周是同一天?今天有多少请求没有通过?查看趋势图,显示请求的数量在上升还是下降。请参阅在给定时间被请求的前10个资源。你懂的,我们想要做这样的分析。

Can you give any advise on where to store these logs to be able to do analysis like this in realtime (or near realtime)? Any nosql database that could be good for this? Azure? I see there is something called azure sql datawarehouse, could that be used for this? I have looked at Microsoft Power Bi which will probably be great for doing the analysis on these data, but where do I store the data.

您是否可以给出一些建议,说明在哪里存储这些日志,以便能够实时(或接近实时)进行这样的分析?有什么nosql数据库可以用来做这个吗?Azure吗?我看到有一个叫做azure sql datawarehouse的东西,它可以用来做这个吗?我已经研究过Microsoft Power Bi,它可能非常适合对这些数据进行分析,但是我应该在哪里存储数据。

I would really appreciate if someone have some suggestions for me.

如果有人能给我一些建议,我将不胜感激。

2 个解决方案

#1


2  

Power BI is potentially a good solution for you. It actually spins up a SQL Server Analysis Services instance in memory, which is effectively an "OLAP data warehouse". Infrastructure requirements are minimal as you design in a free PBI Desktop tool and publish to Microsoft's cloud for PBI Web users.

Power BI可能是一个很好的解决方案。它实际上在内存中生成一个SQL服务器分析服务实例,这实际上是一个“OLAP数据仓库”。当您在一个免费的PBI桌面工具中进行设计并为PBI Web用户发布到Microsoft的云时,基础设施的需求是最小的。

There are limits to the data that can be published - see link below. Note that PBI uses the very effective Vertipac compression so datasets are typically a lot smaller than your raw data. I often see 10k - 50k rows per MB, so 45m should be achievable with a single Pro license. Ruthlessly filter your column list in PBI Desktop to optimise this.

可以发布的数据有一些限制——见下面的链接。注意,PBI使用非常有效的Vertipac压缩,因此数据集通常比原始数据要小得多。我经常看到10k - 50k行/ MB,因此使用一个Pro许可证可以实现4500行。在PBI桌面中无情地过滤你的列列表来优化它。

https://powerbi.microsoft.com/en-us/documentation/powerbi-admin-manage-your-data-storage-in-power-bi/

https://powerbi.microsoft.com/en-us/documentation/powerbi-admin-manage-your-data-storage-in-power-bi/

With PBI Pro license you can refresh Hourly, up to 8 times a day:

有了PBI专业许可,你可以每小时刷新一次,每天多达8次:

https://powerbi.microsoft.com/en-us/documentation/powerbi-refresh-data/

https://powerbi.microsoft.com/en-us/documentation/powerbi-refresh-data/

Building SQL databases and OLAP/SSAS solutions has been a good career for me over the last 20 years. That is still the "Rolls Royce" solution if you have the time and money. But after 20 years I am still learning as it is a technically challenging area. If you don't already have those skills, I suggest Power BI would be a more productive path.

在过去的20年里,构建SQL数据库和OLAP/SSAS解决方案对我来说是一个很好的职业。如果你有时间和金钱,这仍然是“劳斯莱斯”的解决方案。但20年后,我仍在学习,因为这是一个技术上具有挑战性的领域。如果你还没有这些技能,我建议Power BI是一条更有效的道路。

#2


1  

You absolutely will want to store your logs in a SQL OLTP database. The very nature of a log table is transactional, you will be constantly updating it and will benefit from the speed of commits.

您绝对需要将日志存储在一个SQL OLTP数据库中。日志表的本质是事务性的,您将不断地更新它,并将受益于提交的速度。

The reporting speed issue you mention can be resolved by building an OLAP data warehouse on top of the log database. It seems your data model is quite simplistic so it wouldn't be very much development work to implement.

您提到的报告速度问题可以通过在日志数据库之上构建OLAP数据仓库来解决。看起来您的数据模型非常简单,因此不需要进行太多的开发工作。

The only way to get real-time reporting is to build your reports on top of the OLTP database. If you can live with a small delay, most places opt to rebuild their cubes overnight which will provide near instant reports on a 24h delay.

获得实时报告的唯一方法是在OLTP数据库之上构建报告。如果你能忍受一个小的延迟,大多数地方选择在一夜之间重建他们的立方体,这将在24小时内提供几乎即时的报告。

Apologies for the conceptual response but short of designing your infrastructure for you, I think that's as far as can be gone in the Q&A format.

对于概念性的回应表示歉意,但缺少为您设计基础设施,我认为这是在问答格式中所能做到的。

#1


2  

Power BI is potentially a good solution for you. It actually spins up a SQL Server Analysis Services instance in memory, which is effectively an "OLAP data warehouse". Infrastructure requirements are minimal as you design in a free PBI Desktop tool and publish to Microsoft's cloud for PBI Web users.

Power BI可能是一个很好的解决方案。它实际上在内存中生成一个SQL服务器分析服务实例,这实际上是一个“OLAP数据仓库”。当您在一个免费的PBI桌面工具中进行设计并为PBI Web用户发布到Microsoft的云时,基础设施的需求是最小的。

There are limits to the data that can be published - see link below. Note that PBI uses the very effective Vertipac compression so datasets are typically a lot smaller than your raw data. I often see 10k - 50k rows per MB, so 45m should be achievable with a single Pro license. Ruthlessly filter your column list in PBI Desktop to optimise this.

可以发布的数据有一些限制——见下面的链接。注意,PBI使用非常有效的Vertipac压缩,因此数据集通常比原始数据要小得多。我经常看到10k - 50k行/ MB,因此使用一个Pro许可证可以实现4500行。在PBI桌面中无情地过滤你的列列表来优化它。

https://powerbi.microsoft.com/en-us/documentation/powerbi-admin-manage-your-data-storage-in-power-bi/

https://powerbi.microsoft.com/en-us/documentation/powerbi-admin-manage-your-data-storage-in-power-bi/

With PBI Pro license you can refresh Hourly, up to 8 times a day:

有了PBI专业许可,你可以每小时刷新一次,每天多达8次:

https://powerbi.microsoft.com/en-us/documentation/powerbi-refresh-data/

https://powerbi.microsoft.com/en-us/documentation/powerbi-refresh-data/

Building SQL databases and OLAP/SSAS solutions has been a good career for me over the last 20 years. That is still the "Rolls Royce" solution if you have the time and money. But after 20 years I am still learning as it is a technically challenging area. If you don't already have those skills, I suggest Power BI would be a more productive path.

在过去的20年里,构建SQL数据库和OLAP/SSAS解决方案对我来说是一个很好的职业。如果你有时间和金钱,这仍然是“劳斯莱斯”的解决方案。但20年后,我仍在学习,因为这是一个技术上具有挑战性的领域。如果你还没有这些技能,我建议Power BI是一条更有效的道路。

#2


1  

You absolutely will want to store your logs in a SQL OLTP database. The very nature of a log table is transactional, you will be constantly updating it and will benefit from the speed of commits.

您绝对需要将日志存储在一个SQL OLTP数据库中。日志表的本质是事务性的,您将不断地更新它,并将受益于提交的速度。

The reporting speed issue you mention can be resolved by building an OLAP data warehouse on top of the log database. It seems your data model is quite simplistic so it wouldn't be very much development work to implement.

您提到的报告速度问题可以通过在日志数据库之上构建OLAP数据仓库来解决。看起来您的数据模型非常简单,因此不需要进行太多的开发工作。

The only way to get real-time reporting is to build your reports on top of the OLTP database. If you can live with a small delay, most places opt to rebuild their cubes overnight which will provide near instant reports on a 24h delay.

获得实时报告的唯一方法是在OLTP数据库之上构建报告。如果你能忍受一个小的延迟,大多数地方选择在一夜之间重建他们的立方体,这将在24小时内提供几乎即时的报告。

Apologies for the conceptual response but short of designing your infrastructure for you, I think that's as far as can be gone in the Q&A format.

对于概念性的回应表示歉意,但缺少为您设计基础设施,我认为这是在问答格式中所能做到的。