1. Ztsandard介绍
Zstandard(或Zstd)是由Facebook的Yann Collet开发的一个无损数据压缩算法,Zstandard在设计上与DEFLATE(.zip、gzip)算法有着差不多的压缩比,但有更高的压缩和解压缩速度。在其官网(/facebook/zstd)给出的性能测试中,Zstandard比snappy、lzo等算法有较高的优势。
Compressor name |
Ratio |
Compression |
Decompress. |
zstd 1.4.5 -1 |
2.884 |
500 MB/s |
1660 MB/s |
zlib 1.2.11 -1 |
2.743 |
90 MB/s |
400 MB/s |
brotli 1.0.7 -0 |
2.703 |
400 MB/s |
450 MB/s |
zstd 1.4.5 --fast=1 |
2.434 |
570 MB/s |
2200 MB/s |
zstd 1.4.5 --fast=3 |
2.312 |
640 MB/s |
2300 MB/s |
quicklz 1.5.0 -1 |
2.238 |
560 MB/s |
710 MB/s |
zstd 1.4.5 --fast=5 |
2.178 |
700 MB/s |
2420 MB/s |
lzo1x 2.10 -1 |
2.106 |
690 MB/s |
820 MB/s |
lz4 1.9.2 |
2.101 |
740 MB/s |
4530 MB/s |
zstd 1.4.5 --fast=7 |
2.096 |
750 MB/s |
2480 MB/s |
lzf 3.6 -1 |
2.077 |
410 MB/s |
860 MB/s |
snappy 1.1.8 |
2.073 |
560 MB/s |
1790 MB/s |
Zstd算法可以通过参数--fast来权衡压缩比与解压缩速度。解压速度越高,压缩比约低。Hive3.1.1中Orc默认采用zlib作为压缩算法(OrcConfig类中参数指定),parquet格式默认不压缩。Zstd在最高压缩率的情况下,其压缩速度是zlib的5.56倍,解压速度是其4.15倍。所以如果hive的orc和parquet格式默认采用zstd算法,那么在hive的map读数据阶段,可以极大的减少数据解压耗时,在reduce阶段,减少数据压缩的耗时,在整体上可以提升hive的性能。
2. Hadoop开启Zstd压缩能力
HADOOP-13578(/jira/browse/HADOOP-13578) 在Hadoop3中增加了Zstd压缩本地库,需要依赖facebook的Zstd库。编译Hadoop时开启Zstd本地库编译的步骤如下:
1. 下载编译并安装Zstd依赖库
wget /facebook/zstd/releases/download/v1.4.4/zstd-1.4. tar -xzf zstd-1.4. cd zstd-1.4.4 make && make install |
2. 编译Hadoop3时默认是不开启的,需要在maven参数中设置相关开启参数。
mvn clean package -=/usr/local/lib -=true |
参数指向本地库中zstd依赖,使用表示开启编译zstd,如果本地zstd库找不到,编译会失败。
3. Hive orc格式设置ZSTD为默认压缩算法。
ORC-363(/jira/browse/ORC-363)增加了zStandard压缩算法,影响版本1.6。hive-3.1.1版本中使用orc-1.5.1,需要升级为orc-1.6.3(当前hive不支持orc-1.6)。
在hive中设置ORC格式的压缩算法有两种方式:1.建表时在TBLPROPERTIES中增加属性””=”ZSTD” ; 2.设置hive参数=ZSTD。第一中方式需要对每张表进行设置,第二种方式是针对hive全局设置的,比较方便。因此在中做如下的配置即可开启ORC的ZSTD压缩算法。
<span style="color:#000000"><span style="color:#cccccc"><code class="language-javascript"><span style="color:#67cdcc"><</span>property<span style="color:#67cdcc">></span>
<span style="color:#67cdcc"><</span>name<span style="color:#67cdcc">></span>hive<span style="color:#cccccc">.</span>exec<span style="color:#cccccc">.</span>orc<span style="color:#cccccc">.</span>default<span style="color:#cccccc">.</span>compress<span style="color:#67cdcc"><</span><span style="color:#67cdcc">/</span>name<span style="color:#67cdcc">></span>
<span style="color:#67cdcc"><</span>value<span style="color:#67cdcc">></span><span style="color:#f8c555">ZSTD</span><span style="color:#67cdcc"><</span><span style="color:#67cdcc">/</span>value<span style="color:#67cdcc">></span>
<span style="color:#67cdcc"><</span>description<span style="color:#67cdcc">></span>orc<span style="color:#67cdcc">-</span><span style="color:#f08d49">1.6</span><span style="color:#f08d49">.0</span>可选的值:<span style="color:#f8c555">NONE</span><span style="color:#cccccc">,</span><span style="color:#f8c555">ZLIB</span><span style="color:#cccccc">,</span><span style="color:#f8c555">SNAPPY</span><span style="color:#cccccc">,</span><span style="color:#f8c555">LZO</span><span style="color:#cccccc">,</span><span style="color:#f8c555">LZ4</span><span style="color:#cccccc">,</span><span style="color:#f8c555">ZSTD</span><span style="color:#67cdcc"><</span><span style="color:#67cdcc">/</span>description<span style="color:#67cdcc">></span>
<span style="color:#67cdcc"><</span><span style="color:#67cdcc">/</span>property<span style="color:#67cdcc">></span></code></span></span>
4. Hive parquet格式设置ZSTD为默认压缩算法
Hive Parquet默认不采用压缩算法,有两种方式可以修改压缩算法:
1.在TBLPROPERTIES中设置参数””=”zstd”;
2.设置Hadoop的参数来指定parquet压缩算法,
<span style="color:#000000"><span style="color:#cccccc"><code class="language-javascript"><span style="color:#67cdcc"><</span>property<span style="color:#67cdcc">></span>
<span style="color:#67cdcc"><</span>name<span style="color:#67cdcc">></span> mapreduce<span style="color:#cccccc">.</span>output<span style="color:#cccccc">.</span>fileoutputformat<span style="color:#cccccc">.</span>compress <span style="color:#67cdcc"><</span><span style="color:#67cdcc">/</span>name<span style="color:#67cdcc">></span>
<span style="color:#67cdcc"><</span>value<span style="color:#67cdcc">></span><span style="color:#f08d49">true</span><span style="color:#67cdcc"><</span><span style="color:#67cdcc">/</span>value<span style="color:#67cdcc">></span>
<span style="color:#67cdcc"><</span><span style="color:#67cdcc">/</span>property<span style="color:#67cdcc">></span>
<span style="color:#67cdcc"><</span>property<span style="color:#67cdcc">></span>
<span style="color:#67cdcc"><</span>name<span style="color:#67cdcc">></span> mapreduce<span style="color:#cccccc">.</span>output<span style="color:#cccccc">.</span>fileoutputformat<span style="color:#cccccc">.</span>compress<span style="color:#cccccc">.</span>codec <span style="color:#67cdcc"><</span><span style="color:#67cdcc">/</span>name<span style="color:#67cdcc">></span>
<span style="color:#67cdcc"><</span>value<span style="color:#67cdcc">></span> org<span style="color:#cccccc">.</span>apache<span style="color:#cccccc">.</span>hadoop<span style="color:#cccccc">.</span>io<span style="color:#cccccc">.</span>compress<span style="color:#cccccc">.</span>ZStandardCodec<span style="color:#67cdcc"><</span><span style="color:#67cdcc">/</span>value<span style="color:#67cdcc">></span>
<span style="color:#67cdcc"><</span><span style="color:#67cdcc">/</span>property<span style="color:#67cdcc">></span></code></span></span>