In the Redshift FAQ under
在Redshift的FAQ中。
Q: How does the performance of Amazon Redshift compare to most traditional databases for data warehousing and analytics?
问:与大多数传统的数据仓库和分析数据库相比,亚马逊红移的表现如何?
It says the following:
它说以下几点:
Advanced Compression: Columnar data stores can be compressed much more than row-based data stores because similar data is stored sequentially on disk. Amazon Redshift employs multiple compression techniques and can often achieve significant compression relative to traditional relational data stores. In addition, Amazon Redshift doesn't require indexes or materialized views and so uses less space than traditional relational database systems. When loading data into an empty table, Amazon Redshift automatically samples your data and selects the most appropriate compression scheme.
高级压缩:与基于行的数据存储相比,可以对柱状数据存储进行更多的压缩,因为类似的数据是按顺序存储在磁盘上的。与传统的关系数据存储相比,Amazon Redshift采用了多种压缩技术,通常可以实现显著的压缩。此外,Amazon Redshift不需要索引或物化视图,因此比传统的关系数据库系统占用的空间更少。当将数据加载到空表中时,Amazon Redshift将自动对数据进行采样,并选择最合适的压缩方案。
Why is this the case?
为什么会这样?
3 个解决方案
#1
5
It's a bit dis-ingenuous to be honest (in my opinion). Although RedShift has neither of these, I'm not sure that's the same as saying it wouldn't benefit from them.
老实说(在我看来)有点不诚实。虽然红移都没有,但我不确定这是否等同于说它不会从中受益。
Materialised Views
实现视图
I have no real idea why they make this claim. Possibly because they consider the engine so performant that the gains from having them are minimal.
我不知道他们为什么要这样说。可能是因为他们认为引擎的性能如此之高,以至于拥有它们的收益微乎其微。
I would dispute this and the product I work on maintains its own materialised views and can show significant performance gains from doing so. Perhaps AWS believe I must be doing something wrong in the first place?
我将对此提出异议,我所开发的产品将维护其自身的物化视图,并可以从中显示显著的性能收益。也许AWS认为我一开始一定做错了什么?
Indexes
索引
RedShift does not have indexes.
红移没有索引。
It does have SORT ORDER
which is exceptionally similar to a clustered index. It is simply a list of fields by which the data is ordered (like a composite clustered index).
它的排序顺序与聚集索引非常相似。它只是数据排序所依据的字段列表(如组合聚集索引)。
It even has recently introduced INTERLEAVED SORT KEYS
. This is a direct attempt to have multiple independent sort orders. Instead of ordering by a THEN b THEN c
it effectively orders by each of them at the same time.
它甚至最近还引入了交错排序键。这是一个拥有多个独立排序顺序的直接尝试。而不是由a,然后b,然后c,它实际上是由他们每个人同时订货。
That becomes kind of possible because of how RedShift implements its column store.
- Each column is stored separately from each other column
- Each column is stored in 1MB blocks
- Each 1MB block has summary statistics
这之所以成为可能,是因为RedShift是如何实现它的列存储的。-每个列分别与其他列存储-每个列存储在1MB块中-每个1MB块都有汇总统计信息
As well as being the storage pattern this effectively becomes a set of pseudo indexes.
- If the data is sorted by a then b then x
- But you want z = 1234
- RedShift looks at the block statistics (for column z) first
- Those stats will say the minimum and maximum values stored by that block
- This allows Redshift to skip many of those blocks in certain conditions
- This intern allows RedShift to identify which blocks to read from the other columns
作为存储模式,它有效地成为一组伪索引。——如果数据是按然后b x -但你想z = 1234 -红移看着第一块数据列(z)- - -那些统计数据会说的最大值和最小值的值存储在块——这允许红移跳过许多块在一定条件下,该实习生允许红移来确定哪些块读取其他列
#2
1
This is too long for a comment.
这对评论来说太长了。
The simple answer is: because it can read the needed data really, really fast and in parallel.
简单的答案是:因为它可以以非常快的速度并行地读取所需的数据。
One of the primary uses of indexes are "needle-in-the-haystack" queries. These are queries where only a relatively small number of rows are needed and these match a WHERE
clause. Columnar datastores handle these differently. The entire column is read into memory -- but only the column, not the rest of the row's data. This is sort of similar to having an index on each column, except the values need to be scanned for the match (that is where the parallelism comes in handy).
索引的主要用途之一是“大海捞针”查询。这些查询只需要相对较少的行,并且它们匹配where子句。柱状数据存储以不同的方式处理这些数据。整个列被读入内存——但只有列,而不是行的其余数据。这有点类似于在每个列上都有索引,除了需要扫描匹配的值(这就是并行性派上用场的地方)。
Other uses of indexes are for matching key pairs for joining or for aggregations. These can be handled by alternative hash-based algorithms.
索引的其他用途是用于匹配用于连接或聚合的键对。这些可以通过基于哈希的算法来处理。
As for materialized views, RedShift's strength is not updating data. Many such queries are quite fast enough without materialization. And, materialization incurs a lot of overhead for maintaining the data in a high transaction environment. If you don't have a high transaction environment, then you can increment temporary tables after batch loads.
至于物化视图,RedShift的优势不是更新数据。许多这样的查询在没有实现的情况下足够快。而且,在高交易环境中,物化会产生大量的开销来维护数据。如果您没有高的事务环境,那么您可以在批量加载后增加临时表。
#3
1
Indexes are basically used in OLTP systems to retrieve a specific or a small group of values. On the contrary, OLAP systems retrieve a large set of values and performs aggregation on the large set of values. Indexes would not be a right fit for OLAP systems. Instead it uses a secondary structure called zone maps with sort keys.
在OLTP系统中,索引主要用于检索特定的或一小组值。相反,OLAP系统检索大量的值,并对大量的值进行聚合。索引不适合OLAP系统。相反,它使用了一个名为带排序键的区域映射的二级结构。
The indexes operate on B trees. The 'life without a btree' section in the below blog explains with examples how an index based out of btree affects OLAP workloads.
索引作用于B树。下面的博客中“没有树的生活”一节解释了基于btree的索引如何影响OLAP工作负载。
https://blog.chartio.com/blog/understanding-interleaved-sort-keys-in-amazon-redshift-part-1
https://blog.chartio.com/blog/understanding-interleaved-sort-keys-in-amazon-redshift-part-1
The combination of columnar storage, compression codings, data distribution, compression, query compilations, optimization etc. provides the power to Redshift to be faster.
柱状存储、压缩编码、数据分布、压缩、查询编译、优化等组合,使红移速度更快。
Implementing the above factors, reduces IO operations on Redshift and eventually providing better performance. To implement an efficient solution, it requires a great deal of knowledge on the above sections and as well as the on the queries that you would run on Amazon Redshift.
实现上述因素,减少红移上的IO操作,最终提供更好的性能。要实现一个高效的解决方案,它需要对上述部分以及您将在Amazon Redshift上运行的查询有大量的了解。
for eg. Redshift supports Sort keys, Compound Sort keys and Interleaved Sort keys. If your table structure is lineitem(orderid,linenumber,supplier,quantity,price,discount,tax,returnflat,shipdate). If you select orderid as your sort key but if your queries are based on shipdate, Redshift will be operating efficiently. If you have a composite sortkey on (orderid, shipdate) and if your query only on ship date, Redshift will not be operating efficiently. If you have an interleaved soft key on (orderid, shipdate) and if your query
如。红移支持排序键、复合排序键和交错排序键。如果您的表结构是lineitem(orderid、linenumber、supplier、quantity、price、discount、tax、returnflat、shipdate)。如果您选择orderid作为排序键,但是如果您的查询基于shipdate,那么Redshift将有效地运行。如果您在(orderid, shipdate)上有一个复合sortkey,并且您的查询仅在ship日期上,那么Redshift将不能有效地运行。如果您有一个交错的软键(orderid, shipdate),如果您的查询
Redshift does not support materialized views but it easily allows you to create (temporary/permant) tables by running select queries on existing tables. It eventually duplicates data but at the required format to be executed for queries (similar to materialized view) The below blog gives your some information on the above approach.
Redshift不支持实用化的视图,但是它很容易通过在现有表上运行select查询来创建(临时/永久)表。它最终会复制数据,但需要为查询执行所需的格式(类似于实体化视图),下面的博客提供了上述方法的一些信息。
https://www.periscopedata.com/blog/faster-redshift-queries-with-materialized-views-lifetime-daily-arpu.html
Redshift does fare well with other systems like Hive, Impala, Spark, BQ etc. during one of our recent benchmark frameworks
在我们最近的一个基准框架中,红移在Hive、Impala、Spark、BQ等其他系统中表现得很好
#1
5
It's a bit dis-ingenuous to be honest (in my opinion). Although RedShift has neither of these, I'm not sure that's the same as saying it wouldn't benefit from them.
老实说(在我看来)有点不诚实。虽然红移都没有,但我不确定这是否等同于说它不会从中受益。
Materialised Views
实现视图
I have no real idea why they make this claim. Possibly because they consider the engine so performant that the gains from having them are minimal.
我不知道他们为什么要这样说。可能是因为他们认为引擎的性能如此之高,以至于拥有它们的收益微乎其微。
I would dispute this and the product I work on maintains its own materialised views and can show significant performance gains from doing so. Perhaps AWS believe I must be doing something wrong in the first place?
我将对此提出异议,我所开发的产品将维护其自身的物化视图,并可以从中显示显著的性能收益。也许AWS认为我一开始一定做错了什么?
Indexes
索引
RedShift does not have indexes.
红移没有索引。
It does have SORT ORDER
which is exceptionally similar to a clustered index. It is simply a list of fields by which the data is ordered (like a composite clustered index).
它的排序顺序与聚集索引非常相似。它只是数据排序所依据的字段列表(如组合聚集索引)。
It even has recently introduced INTERLEAVED SORT KEYS
. This is a direct attempt to have multiple independent sort orders. Instead of ordering by a THEN b THEN c
it effectively orders by each of them at the same time.
它甚至最近还引入了交错排序键。这是一个拥有多个独立排序顺序的直接尝试。而不是由a,然后b,然后c,它实际上是由他们每个人同时订货。
That becomes kind of possible because of how RedShift implements its column store.
- Each column is stored separately from each other column
- Each column is stored in 1MB blocks
- Each 1MB block has summary statistics
这之所以成为可能,是因为RedShift是如何实现它的列存储的。-每个列分别与其他列存储-每个列存储在1MB块中-每个1MB块都有汇总统计信息
As well as being the storage pattern this effectively becomes a set of pseudo indexes.
- If the data is sorted by a then b then x
- But you want z = 1234
- RedShift looks at the block statistics (for column z) first
- Those stats will say the minimum and maximum values stored by that block
- This allows Redshift to skip many of those blocks in certain conditions
- This intern allows RedShift to identify which blocks to read from the other columns
作为存储模式,它有效地成为一组伪索引。——如果数据是按然后b x -但你想z = 1234 -红移看着第一块数据列(z)- - -那些统计数据会说的最大值和最小值的值存储在块——这允许红移跳过许多块在一定条件下,该实习生允许红移来确定哪些块读取其他列
#2
1
This is too long for a comment.
这对评论来说太长了。
The simple answer is: because it can read the needed data really, really fast and in parallel.
简单的答案是:因为它可以以非常快的速度并行地读取所需的数据。
One of the primary uses of indexes are "needle-in-the-haystack" queries. These are queries where only a relatively small number of rows are needed and these match a WHERE
clause. Columnar datastores handle these differently. The entire column is read into memory -- but only the column, not the rest of the row's data. This is sort of similar to having an index on each column, except the values need to be scanned for the match (that is where the parallelism comes in handy).
索引的主要用途之一是“大海捞针”查询。这些查询只需要相对较少的行,并且它们匹配where子句。柱状数据存储以不同的方式处理这些数据。整个列被读入内存——但只有列,而不是行的其余数据。这有点类似于在每个列上都有索引,除了需要扫描匹配的值(这就是并行性派上用场的地方)。
Other uses of indexes are for matching key pairs for joining or for aggregations. These can be handled by alternative hash-based algorithms.
索引的其他用途是用于匹配用于连接或聚合的键对。这些可以通过基于哈希的算法来处理。
As for materialized views, RedShift's strength is not updating data. Many such queries are quite fast enough without materialization. And, materialization incurs a lot of overhead for maintaining the data in a high transaction environment. If you don't have a high transaction environment, then you can increment temporary tables after batch loads.
至于物化视图,RedShift的优势不是更新数据。许多这样的查询在没有实现的情况下足够快。而且,在高交易环境中,物化会产生大量的开销来维护数据。如果您没有高的事务环境,那么您可以在批量加载后增加临时表。
#3
1
Indexes are basically used in OLTP systems to retrieve a specific or a small group of values. On the contrary, OLAP systems retrieve a large set of values and performs aggregation on the large set of values. Indexes would not be a right fit for OLAP systems. Instead it uses a secondary structure called zone maps with sort keys.
在OLTP系统中,索引主要用于检索特定的或一小组值。相反,OLAP系统检索大量的值,并对大量的值进行聚合。索引不适合OLAP系统。相反,它使用了一个名为带排序键的区域映射的二级结构。
The indexes operate on B trees. The 'life without a btree' section in the below blog explains with examples how an index based out of btree affects OLAP workloads.
索引作用于B树。下面的博客中“没有树的生活”一节解释了基于btree的索引如何影响OLAP工作负载。
https://blog.chartio.com/blog/understanding-interleaved-sort-keys-in-amazon-redshift-part-1
https://blog.chartio.com/blog/understanding-interleaved-sort-keys-in-amazon-redshift-part-1
The combination of columnar storage, compression codings, data distribution, compression, query compilations, optimization etc. provides the power to Redshift to be faster.
柱状存储、压缩编码、数据分布、压缩、查询编译、优化等组合,使红移速度更快。
Implementing the above factors, reduces IO operations on Redshift and eventually providing better performance. To implement an efficient solution, it requires a great deal of knowledge on the above sections and as well as the on the queries that you would run on Amazon Redshift.
实现上述因素,减少红移上的IO操作,最终提供更好的性能。要实现一个高效的解决方案,它需要对上述部分以及您将在Amazon Redshift上运行的查询有大量的了解。
for eg. Redshift supports Sort keys, Compound Sort keys and Interleaved Sort keys. If your table structure is lineitem(orderid,linenumber,supplier,quantity,price,discount,tax,returnflat,shipdate). If you select orderid as your sort key but if your queries are based on shipdate, Redshift will be operating efficiently. If you have a composite sortkey on (orderid, shipdate) and if your query only on ship date, Redshift will not be operating efficiently. If you have an interleaved soft key on (orderid, shipdate) and if your query
如。红移支持排序键、复合排序键和交错排序键。如果您的表结构是lineitem(orderid、linenumber、supplier、quantity、price、discount、tax、returnflat、shipdate)。如果您选择orderid作为排序键,但是如果您的查询基于shipdate,那么Redshift将有效地运行。如果您在(orderid, shipdate)上有一个复合sortkey,并且您的查询仅在ship日期上,那么Redshift将不能有效地运行。如果您有一个交错的软键(orderid, shipdate),如果您的查询
Redshift does not support materialized views but it easily allows you to create (temporary/permant) tables by running select queries on existing tables. It eventually duplicates data but at the required format to be executed for queries (similar to materialized view) The below blog gives your some information on the above approach.
Redshift不支持实用化的视图,但是它很容易通过在现有表上运行select查询来创建(临时/永久)表。它最终会复制数据,但需要为查询执行所需的格式(类似于实体化视图),下面的博客提供了上述方法的一些信息。
https://www.periscopedata.com/blog/faster-redshift-queries-with-materialized-views-lifetime-daily-arpu.html
Redshift does fare well with other systems like Hive, Impala, Spark, BQ etc. during one of our recent benchmark frameworks
在我们最近的一个基准框架中,红移在Hive、Impala、Spark、BQ等其他系统中表现得很好