BigQuery替代中型数据

This is a follow-up to the question Why doesn't BigQuery perform as well on small data sets.

这是为什么BigQuery在小数据集上表现不佳的问题的后续问题。

Let's suppose I have a data-set that is ~1M rows. In the current database that we're using (mysql) aggregation queries would run quite slow, perhaps taking ~10s or so on complex aggregations. On BigQuery, the initialization time required might make this query take ~3 seconds, better than in mysql, but the wrong tool for the job, if we need to return queries in 1s or under.

假设我有一个大约1M行的数据集。在我们使用的当前数据库（mysql）中，聚合查询运行速度很慢，可能需要大约10秒左右的复杂聚合。在BigQuery上，所需的初始化时间可能会使这个查询花费大约3秒，比在mysql中更好，但是如果我们需要在1s或更低版本中返回查询，那么该工作的工具是错误的。

My question then is, what would be a good alternative to using BigQuery on doing aggregated queries on moderate-sized data-sets, such as 1-10M rows? An example query might be:

那么我的问题是，使用BigQuery对中等大小的数据集（例如1-10M行）进行聚合查询会有什么好的选择？示例查询可能是：

SELECT studio, territory, count(*)
FROM mytable
GROUP BY studio, territory
ORDER BY count(*) DESC

Possible solutions I've thought of are ElasticSearch (https://github.com/NLPchina/elasticsearch-sql) and Redshift (postgres is too slow). What would be a good option here that can be queried via SQL?

我想到的可能解决方案是ElasticSearch（https://github.com/NLPchina/elasticsearch-sql）和Redshift（postgres太慢）。什么是可以通过SQL查询的好选择？

Note: I'm not looking for why or how BQ should be used, I'm looking for an alternative for data sets under 10M rows where the query can be returned in under ~1s.

注意：我不是在寻找为什么或如何使用BQ，我正在寻找10M行以下数据集的替代方案，其中查询可以在~1s内返回。

11 个解决方案

#1

Here are a few alternatives to consider for data of this size:

以下是此尺寸数据的一些替代方案：

Single Redshift small SSD node
- No setup. Easily returns answers on this much data in under 1s.
- 没有设置。在1s以下轻松返回这么多数据的答案。
单个Redshift小型SSD节点无设置。在1s以下轻松返回这么多数据的答案。
Greenplum on a small T2 instance
- Postgres-like. Similar perf to Redshift. Not paying for storage you won't need. Start with their single node "sandbox" AMI.
- Postgres的样。与Redshift相似的性能。不支付存储费用，您不需要。从他们的单节点“沙箱”AMI开始。
Greenplum在一个小的T2实例上像Postgres一样。与Redshift相似的性能。不支付存储费用，您不需要。从他们的单节点“沙箱”AMI开始。
MariaDB Columnstore
- MySQL-like. Used to be called InfiniDB. Very good performance. Supported by MariaDB (the company).
- 类似MySQL的。曾经被称为InfiniDB。非常好的表现。由MariaDB（公司）提供支持。
MariaDB Columnstore类似于MySQL。曾经被称为InfiniDB。非常好的表现。由MariaDB（公司）提供支持。
Apache Drill
- Drill has a very similar philosophy to BiqQuery but can be used to anywhere (it's just a jar). Queries will be fast on this size data.
- Drill与BiqQuery有着非常相似的理念，但可以用于任何地方（它只是一个罐子）。此大小数据的查询速度很快。
Apache Drill Drill与BiqQuery有着非常相似的理念，但可以用于任何地方（它只是一个罐子）。此大小数据的查询速度很快。

If low admin / quick start is critical go with Redshift. If money / flexibility is critical start with Drill. If you prefer MySQL start with MariaDB Columnstore.

如果低管理/快速启动很关键，请使用Redshift。如果金钱/灵活性至关重要，请从Drill开始。如果您更喜欢MySQL，请使用MariaDB Columnstore。

#2

If you need answers in less than a second, you need to think about indexing.

如果您需要在不到一秒的时间内获得答案，则需要考虑索引。

Typical story:

典型故事：

MySQL (or any other database proposed here) is fast, until...
MySQL（或此处提出的任何其他数据库）很快，直到......
One day some of your aggregation queries start running slow. Minutes, hours, days, etc.
有一天，你的一些聚合查询开始运行缓慢。分钟，小时，天等
Typical solution for step 2 is indexing and pre-aggregating. If you want answers in less than a second for certain type of questions, you'll need to invest time and optimization cycles to answer just that type of questions.
步骤2的典型解决方案是索引和预聚合。如果您想在某种类型的问题中在不到一秒的时间内得到答案，那么您需要投入时间和优化周期来回答这类问题。
BigQuery's beauty is that you can skip step 3. Bring those minutes/hours/days to seconds, with minimal investment - any query, at any time.
BigQuery的美妙之处在于您可以跳过第3步。将这些分钟/小时/天带到秒，只需最少的投资 - 随时查询。

BigQuery is awesome because it gives you 4. But you are asking for 3, MySQL is fine for that, Elasticsearch is fine too, any indexed database will bring you results in less than a second - as long as you invest time on optimizing your system for certain type of question. Then to get answers for any arbitrary question without investing any optimization time, use BigQuery.

BigQuery很棒，因为它给你4.但是你要求3，MySQL就可以了，Elasticsearch也很好，任何索引数据库都会在不到一秒的时间内为你带来结果 - 只要你花时间优化你的系统对于某些类型的问题。然后在不投入任何优化时间的情况下获得任意问题的答案，请使用BigQuery。

BigQuery: Will answer arbitrary questions in seconds, no preparation needed.

BigQuery：将在几秒钟内回答任意问题，无需准备。

MySQL and alternatives: Will answer certain type of questions in less than a second, but it will take development time to get there.

MySQL和替代方案：将在不到一秒的时间内回答某些类型的问题，但需要花费一些时间才能实现。

#3

I know SQL Server, so my answer is biased.

我知道SQL Server，所以我的答案是有偏见的。

10M rows should easily fit in memory, so any kind of aggregation should be fast, especially if you have a covering index. If it doesn't, the server configuration may need adjustment. Also, SQL Server has so-called in-memory tables, which may be a good fit here.

10M行应该很容易适合内存，因此任何类型的聚合都应该很快，特别是如果你有覆盖索引。如果没有，则可能需要调整服务器配置。此外，SQL Server具有所谓的内存表，这可能非常适合这里。
SQL Server has a feature called indexed view. Your aggregating query is a classic use case of an indexed view. Indexed view is essentially a copy of the data stored on disk and maintained by the server automatically as the underlying data in the table changes. It slows INSERTS, DELETES and UPDATES, but makes SELECT fast, because summary is always pre-calculated. See: What You Can (and Can’t) Do With Indexed Views. Other DBMSes should have similar features.

SQL Server有一个称为索引视图的功能。您的聚合查询是索引视图的经典用例。索引视图本质上是存储在磁盘上的数据的副本，由服务器自动维护，因为表中的基础数据发生了变化。它会减慢INSERTS，DELETES和UPDATES，但会使SELECT快速，因为摘要总是预先计算的。请参阅：索引视图可以（而且不能）执行的操作。其他DBMS应该具有类似的功能。

#4

If you don't need concurrency, multiple users connecting simultaneously, and your data can fit in a single disk file, then SQLite might be appropriate.

如果您不需要并发，多个用户同时连接，并且您的数据可以放在单个磁盘文件中，那么SQLite可能是合适的。

As they say, SQLite does not compete with client/server databases. SQLite competes with fopen().

正如他们所说，SQLite不会与客户端/服务器数据库竞争。 SQLite与fopen（）竞争。

http://www.sqlite.org/whentouse.html

#5

I think Microsoft SQL Server Analysis Services is a good option, I used myself, it is database behind the PowerBI service,which has a very good free tier option.

我认为Microsoft SQL Server Analysis Services是一个不错的选择，我自己使用，它是PowerBI服务背后的数据库，它有一个非常好的免费层选项。

if you want a free on premise solution, you can always use SQL Server express with the new columnstore technology, i did not use it myself, but i heard some very good results

如果你想要一个免费的内部部署解决方案，你可以随时使用SQL Server express和新的columnstore技术，我自己没有使用它，但我听到了一些非常好的结果

#6

If that is your only query, then this will make it run faster:

如果这是您唯一的查询，那么这将使其运行更快：

INDEX(studio, territory)  -- in either order.

If there are other variations, let's see them, plus SHOW CREATE TABLE.

如果还有其他变体，让我们看看它们，再加上SHOW CREATE TABLE。

Another thing to check: How much RAM do you have, ans what is the value of innodb_buffer_pool_size? That setting should be about 70% of RAM (if you have more than 4GB of ram).

要检查的另一件事是：你有多少RAM，和innodb_buffer_pool_size的值是多少？该设置应该是RAM的70％左右（如果你有超过4GB的RAM）。

#7

Do not use COUNT(*).

不要使用COUNT（*）。

Use COUNT() on a single column, preferably an indexed one like the PRIMARY KEY.

在单个列上使用COUNT（），最好是像PRIMARY KEY那样的索引。

#8

My Answer: Optimize query and table structure as previously addressed (1 sec or less). Read on below for further reasoning, because we all fall into this trap. Note: The above is not necessarily a big dataset.

我的答案：优化查询和表结构，如前所述（1秒或更短）。请继续阅读下面的进一步推理，因为我们都陷入了这个陷阱。注意：以上不一定是大数据集。

A great question. It is such a struggle to decipher what is the problem and what is a solution. Here's a shot coming from old school. In the old days, we use to say you ask a hardware, OS, or developer what the problem/solution is and you will get three different answers.

一个很好的问题。解读什么是问题以及什么是解决方案是如此艰难。这是来自旧学校的镜头。在过去，我们会说你问硬件，操作系统或开发人员问题/解决方案是什么，你会得到三个不同的答案。

My understanding is this question is asking to solve/compare a SQL performance problem with a cloud infrastructure solution. This question will have many different answers based on background. It is confusing, you have just old school database installations(Mysql, Oracle, MSsql), Database As A Service(DBAAS), Big Data Cloud solutions, Big Data Application Solutions(hadoop)

我的理解是这个问题要求解决/比较SQL性能问题与云基础架构解决方案。这个问题将根据背景有许多不同的答案。令人困惑的是，你只有旧的学校数据库安装（Mysql，Oracle，MSsql），数据库即服务（DBAAS），大数据云解决方案，大数据应用解决方案（hadoop）

It is so easy to get tangled up in all this technology. Maybe here is a little clarity.

所有这些技术都很容易纠缠在一起。也许这里有点清晰。

SQL performance problems can be solved in a variety of points of performance(POP) .

SQL性能问题可以在各种性能点（POP）中解决。

SQL Optimization and Tuning (Temp tables, In-Memory, OLAP functions, Sql Plan, Parallelization, analytics ) Tools (MySql Workbench, cmdline, Toad, etc)
SQL优化和调优（临时表，内存，OLAP函数，Sql计划，并行化，分析）工具（MySql Workbench，cmdline，Toad等）
Structure Optimization (Tables, Indexing, Partitioning, Pre-Ag Structures)
结构优化（表，索引，分区，预Ag结构）
Database configuration (Memory Size, Cache sizes, Parallelization, Block Size, etc..
数据库配置（内存大小，缓存大小，并行化，块大小等）。
OS memory, page size, Processes)
操作系统内存，页面大小，进程）
Hardware and Network - Mostly irrellivant now.
硬件和网络 - 现在大部分都是无关紧要的。
Server Provisioning.
服务器配置。
Cloud provisioning and clustering.
云配置和群集。
Infrastructure and Software decisions.
基础设施和软件决策。

Bottom Line: I will stop here, we have so many solutions for problems. Try to start with the most basic usage of a technology before incurring costs solving solutions with bigger technologies. Hopefully this will give the user a skeleton of a path to work through or terminology to use when asking a question. How do I get x query to run in time t?

一句话：我会在这里停下来，我们有很多问题的解决方案。在使用更大的技术产生成本解决方案之前，尝试从技术的最基本用途开始。希望这将为用户提供一个工作路径的骨架或在提问时使用的术语。如何让x查询在时间t运行？

#9

You don't talk much about the problem space you're in - but have you considered python pandas, or R? These are great tools for data analytics / development.

你不太谈论你所处的问题空间 - 但你考虑过python pandas还是R？这些是用于数据分析/开发的出色工具。

Assuming you have python and pandas handy pip install pandas you can get started with something like this:

假设你有python和pandas方便的pip安装pandas，你可以开始使用这样的东西：

import pandas as pd
import pyodbc

conn = pyodbc.connect(...) # You'll need to figure out the settings for your DB here
# this slow but only needs to be done once:
data = pd.read_sql_query('select * from mytable') # Load everything into memory 

# Now do the query:
data.groupby(['studio', 'territory']).count().sort_values(ascending=False)

I strongly recommend trying out pandas with Jupyter Notebooks

我强烈建议用Jupyter笔记本试用大熊猫

#10

If you are looking for sub-second OLAP query results then Druid (http://druid.io/) was built for that purpose. It is a beast to deploy and tune, but once you get it configured properly for your data it is very very fast. It has streaming support so you can ingest from Kafka with exactly once semantics which is awesome. It scales very well from small to massive amounts of data - although you will pay a cost as it does pre-aggregation so if you have a lot of dimensions the data size explodes. SQL support has only recently been added and is incomplete. Also it does not support joins so you have to structure your data properly to get your answers out.

如果您正在寻找亚秒级OLAP查询结果，那么德鲁伊（http://druid.io/）就是为此目的而构建的。它是一个部署和调整的野兽，但是一旦你为数据正确配置它就会非常快。它具有流媒体支持，因此您可以使用恰好一次的语义从Kafka中摄取，这非常棒。它可以很好地从小型数据扩展到大量数据 - 尽管您将支付预先聚合的成本，因此如果您有大量维度，则数据大小会爆炸。 SQL支持最近才被添加并且不完整。此外，它不支持连接，因此您必须正确构建数据以获得答案。

#11

-1

BigQuery is meant to perform best at the end of the Big Data pipeline. It has been designed so as to perform well with large data sets, not small ones, and is not meant as a replacement for existing technologies, but rather as an excellent complement in certain situations. An example can be read in the “Google Cloud Big Data and Machine Learning Blog” document.

BigQuery旨在在大数据管道的末尾表现最佳。它的设计是为了能够很好地处理大型数据集，而不是小型数据集，并不是现有技术的替代品，而是在某些情况下的优秀补充。可以在“Google Cloud大数据和机器学习博客”文档中阅读示例。

#1