从关系数据库迁移到大数据

Currently I have an application hosted on the Google Cloud Platform that offers web analytics and provides session activity (clicks, downloads etc) and ties that web activity with web registrations.

目前，我在Google云平台上托管了一个应用程序，该应用程序提供网络分析并提供会话活动（点击，下载等），并将该网络活动与网络注册联系起来。

At the moment we store all of our click and session profile data in MySQL and use SQL queries to generate both aggregate and per-user reports, however, as the amount of data has grown, we are seeing a real slow-down in query responses which is in turn slowing down page-load times.

目前，我们将所有点击和会话配置文件数据存储在MySQL中，并使用SQL查询生成聚合和每用户报告，但随着数据量的增长，我们看到查询响应真正减慢这反过来减慢了页面加载时间。

In investigating ways we can solve this problem, we have looked into tools available on Google Cloud Platform like Dataproc and Dataflow as well as NoSQL solutions, however, I am having a hard time understanding how we could apply our current solution to any of these solutions.

在调查我们可以解决这个问题的方法时，我们已经研究了Google Cloud Platform上可用的工具，如Dataproc和Dataflow以及NoSQL解决方案，但是，我很难理解如何将我们当前的解决方案应用于任何这些解决方案。

Currently, a rough idea of our data schema is as follows:

目前，我们对数据模式的概念如下：

User table
- id
- name
- email

Profile table (web browser/device)
- id
- user id
- user agent string

Session table
- id
- profile id
- session string

Action table
- id
- session id
- action type
- action details
- timestamp

Based on my research, my understanding of what would be the best solution would be to store action data in a NoSQL database solution like BigTable which feeds data into a solution like DataProc or DataFlow which generates the reports. However, given that our current schema is a highly relational structure, seems to remove the option of moving towards a NoSQL solution as all my research indicates that you shouldn't move relational data to a NoSQL solution.

根据我的研究，我对什么是最佳解决方案的理解是将动作数据存储在NoT数据库解决方案中，如BigTable，它将数据提供给DataProc或DataFlow等生成报告的解决方案。但是，鉴于我们当前的架构是一个高度关系结构，似乎删除了转向NoSQL解决方案的选项，因为我的所有研究表明您不应该将关系数据移动到NoSQL解决方案。

My question is, is my understanding of how to apply these tools correct? Or are there better solutions? Is it even necessary to consider moving away from MySQL? And if not, what kind of solutions are available that would allow us to possibly pre-process/generate reporting data in the background?

我的问题是，我对如何正确应用这些工具的理解是什么？还是有更好的解决方案？是否有必要考虑远离MySQL？如果没有，有哪些解决方案可以让我们在后台预处理/生成报告数据？

3 个解决方案

#1

Assuming that sessions and actions table values are not updated and only insert. The best way would be to separate the databases into two parts. Keep the MySQL DB for user and profile tables and use the BigQuery for actions and sessions.

假设会话和操作表值未更新且仅插入。最好的方法是将数据库分成两部分。保留用户和配置文件表的MySQL数据库，并使用BigQuery进行操作和会话。

This way you have following:

这样你有以下几点：

minimize the amount of change you have to do on the either sides (data ingestion and extraction)
最大限度地减少您在任何一方必须做的更改（数据摄取和提取）
you will significantly reduce the cost of data storage
您将大大降低数据存储的成本
query times will significantly improve
查询时间将显着改善
before you know it, you will be in the big data territory and BigQuery is just the solution for it
在你知道它之前，你将处于大数据领域，而BigQuery只是它的解决方案

BigQuery is the best way. But, if you have too many extra resources and time available, you can look into storing it into NoSQL db, then run a pipeline job on it using DataFlow to extract analytics data which you will again need to store in a database for querying purposes.

BigQuery是最好的方法。但是，如果您有太多额外的资源和时间，您可以考虑将其存储到NoSQL数据库中，然后使用DataFlow在其上运行管道作业来提取分析数据，您将再次需要将其存储在数据库中以进行查询。

#2

A couple of questions / potential solutions:

几个问题/潜在的解决方案：

Profile! If it's the same queries thrashing the database, then optimising your queries or caching some of the results for your most frequent pages can help offload processing. Ditto for database settings, RAM, etc.
简介！如果同样的查询颠覆了数据库，那么优化查询或缓存最频繁页面的一些结果可以帮助卸载处理。同样适用于数据库设置，RAM等。
How big is your database? If it's less than 64GB, scaling up to a larger server where the database can fit into RAM could be a quick win.
你的数据库有多大？如果它小于64GB，扩展到数据库可以放入RAM的更大的服务器可能是一个快速的胜利。
How is your data being used? If it's purely for historical data, you could potentially reduce your clicks down into a lookup table, eg. actions per session per week or per user per week. If the data is collated per 5 minutes / hour, downloading the raw data and processing it like this locally can work too.
您的数据如何使用？如果它纯粹用于历史数据，则可能会将您的点击次数降低到查找表中，例如。每周或每周每位用户的操作。如果每5分钟/小时整理一次数据，那么下载原始数据并在本地处理它也可以正常工作。
You can denormalise, eg. combine user agent|session|action type|details|timestamp into one row, but you potentially increase your storage requirements and lookup time.
你可以反规范化，例如。将用户代理|会话|操作类型|详细信息|时间戳合并为一行，但可能会增加存储要求和查找时间。
Alternatively, more normalisation can help too. Breaking out the user agent string into its own table will reduce that table's data requirements and might speed things up.
或者，更多的规范化也可以提供帮助。将用户代理字符串分解为自己的表将减少该表的数据要求并可能加快速度。
It seems like your data might be able to be split up / sharded by user, so that could be another option.
您的数据似乎可以被用户拆分/分片，因此这可能是另一种选择。

In general, the fastest way to work these questions out is to give it a try for your specific workloads, eg. how many of your typical requests (or random dashboards) can you do on a development machine with a reasonable amount of RAM (or spin up a server/create a different test database).

通常，解决这些问题的最快方法是尝试针对您的特定工作负载，例如。您可以在具有合理数量的RAM（或启动服务器/创建不同的测试数据库）的开发计算机上执行多少典型请求（或随机仪表板）。

Also, if you're mostly used to relational databases, there'll be some overhead in switching (particularly for bleeding edge solutions), so you need to be fairly sure that the costs outweigh the benefits before you switch, or switch a little bit at a time so that you can switch back if it doesn't work out. Again, testing helps.

此外，如果您主要习惯于关系数据库，那么切换会产生一些开销（特别是对于前沿解决方案），因此您需要相当确定在切换之前成本会超过收益，或者切换一点一次，这样你可以切换回来，如果它没有成功。再次，测试有帮助。

#3

If practical, do not store the massive amount of data at all!

如果可行，请不要存储大量数据！

Instead, summarize (aggregate) chunks of data as they arrive, then store the summaries.

相反，在数据到达时汇总（聚合）数据块，然后存储摘要。

Advantages:

优点：

Perhaps one-tenth as much disk space needed;
可能需要十分之一的磁盘空间;
Reports are perhaps 10 times as fast,
报告的速度可能快10倍，
Can be done in the existing RDBMS.
可以在现有的RDBMS中完成。

Disadvantages:

缺点：

You cannot retrofit a different summarization. (OK, you could keep the raw data and start over; this may be better anyway.)
您无法改进不同的摘要。（好的，你可以保留原始数据并重新开始;无论如何，这可能会更好。）
More code complexity.
更复杂的代码。

Discussion of Summary Tables.

讨论汇总表。

#1