hive随机采样

时间:2022-10-19 11:11:59

hive> select * from account limit 10;
OK
account.accountname     account.accid   account.platid  account.dateid  account.createtime
1004210 1004210 6       20180116        2018-01-16 10:39:50.0
20946754        20946754        0       20170913        2017-09-13 10:02:37.0
20946766        20946766        0       20170901        2017-09-01 16:51:30.0
20946793        20946793        0       20171117        2017-11-17 16:51:07.0
20946796        20946796        0       20180110        2018-01-10 13:30:46.0
20946962        20946962        0       20171219        2017-12-19 15:43:14.0
20957641        20957641        0       20171117        2017-11-17 17:44:58.0
20957642        20957642        0       20171220        2017-12-20 15:32:21.0
20963649        20963649        6       20171220        2017-12-20 10:13:57.0
20963674        20963674        33      20171219        2017-12-19 22:59:39.0

只通过limit河南保证数据随机的返回,通常情况下,此种方式会按文件的顺序返回,没到达随机采样的目的。

scala> hivcon.sql("select * from gamedw.account order by rand() limit 10").show
+-----------+--------+------+--------+--------------------+                    
|accountname|   accid|platid|  dateid|          createtime|
+-----------+--------+------+--------+--------------------+
|   22780583|22780583|    18|20180117|2018-01-17 07:27:...|
|   22777237|22777237|     6|20180116|2018-01-16 20:52:...|
|   22868751|22868751|     1|20180208|2018-02-08 10:20:...|
|   22993835|22993835|    83|20180327|2018-03-27 19:48:...|
|   22779551|22779551|     9|20180117|2018-01-17 00:33:...|
|   22796970|22796970|     1|20180119|2018-01-19 17:53:...|
|   22713995|22713995|    33|20171227|2017-12-27 21:05:...|
|   22811076|22811076|     6|20180123|2018-01-23 13:16:...|
|   22715294|22715294|    15|20171228|2017-12-28 01:59:...|
|   22843107|22843107|    33|20180129|2018-01-29 20:13:...|
+-----------+--------+------+--------+--------------------+

scala> hivcon.sql("select * from gamedw.account order by rand() limit 10").show
+-----------+--------+------+--------+--------------------+                    
|accountname|   accid|platid|  dateid|          createtime|
+-----------+--------+------+--------+--------------------+
|   22809368|22809368|    35|20180122|2018-01-22 20:55:...|
|   22810321|22810321|    18|20180123|2018-01-23 03:42:...|
|   22850089|22850089|    15|20180201|2018-02-01 07:16:...|
|   23040545|23040545|    27|20180415|2018-04-15 10:45:...|
|   21095611|21095611|     3|20180118|2018-01-18 19:02:...|
|   22787794|22787794|     9|20180117|2018-01-17 22:10:...|
|   22701928|22701928|     6|20171226|2017-12-26 14:42:...|
|   23123308|23123308|     6|20180526|2018-05-26 02:02:...|
|   23037031|23037031|    27|20180413|2018-04-13 23:52:...|
|   22798775|22798775|    27|20180120|2018-01-20 00:32:...|
+-----------+--------+------+--------+--------------------+

使用rand()确实可以实现真正意义的随机,但性能不高。为了实现总排序,Hive必须将所有数据强制传输到单个reducer。该reducer将对整个数据集进行排序。这很不好。

scala> hivcon.sql("select * from gamedw.account sort by rand() limit 10").show
+-----------+--------+------+--------+--------------------+
|accountname|   accid|platid|  dateid|          createtime|
+-----------+--------+------+--------+--------------------+
|   22708141|22708141|    15|20171226|2017-12-26 22:13:...|
|   22780855|22780855|    15|20180117|2018-01-17 08:25:...|
|   22781579|22781579|    35|20180117|2018-01-17 09:55:...|
|   22775456|22775456|     1|20180116|2018-01-16 18:35:...|
|   22705922|22705922|    18|20171226|2017-12-26 20:18:...|
|   22773860|22773860|   138|20180116|2018-01-16 16:09:...|
|   22771885|22771885|     4|20180116|2018-01-16 14:05:...|
|   22633182|22633182|     0|20171202|2017-12-02 04:07:...|
|   22228137|22228137|     0|20170828|2017-08-28 17:55:...|
|   22593212|22593212|     0|20171121|2017-11-21 18:38:...|
+-----------+--------+------+--------+--------------------+

scala> hivcon.sql("select * from gamedw.account sort by rand() limit 10").show
+-----------+--------+------+--------+--------------------+
|accountname|   accid|platid|  dateid|          createtime|
+-----------+--------+------+--------+--------------------+
|   22703713|22703713|    15|20171226|2017-12-26 18:02:...|
|   22766790|22766790|     4|20180116|2018-01-16 10:11:...|
|   22728340|22728340|    27|20171231|2017-12-31 19:59:...|
|   22770064|22770064|     1|20180116|2018-01-16 12:13:...|
|   22705604|22705604|    15|20171226|2017-12-26 19:58:...|
|   22771835|22771835|    21|20180116|2018-01-16 14:01:...|
|   22720562|22720562|    33|20171229|2017-12-29 11:56:...|
|   22779785|22779785|    83|20180117|2018-01-17 01:36:...|
|   22705319|22705319|    33|20171226|2017-12-26 19:46:...|
|   22717315|22717315|    27|20171228|2017-12-28 17:04:...|
+-----------+--------+------+--------+--------------------+

Hive有一个非标准SQL“sort by”子句,它只在单个reducer中排序,并且不保证数据跨多个reducers中排序.

Hive的将数据拆分为多个reducer的方法是未定义的。它可能是真正随机的,它可能基于文件顺序,它可能基于数据中的某些值。Hive如何在reducers中实现limit子句也是未定义的。也许它按顺序从reducer中获取数据 - 即,reducer 0中的所有数据,然后全部来reducer1,等等。也许它通过它们循环并将所有内容混合在一起。

假设reduce 的key是基于数据列,而limit子句是reducers的顺序。然后样品会非常倾斜。

解决方案是另一个非标准的Hive功能:“distribute by”。对于reduce key不是由查询结构确定的查询(没有“group by”,没有join),可以准确指定reduce key的内容。

scala> hivcon.sql("select * from gamedw.account distribute by rand() sort by rand() limit 10").show
+-----------+--------+------+--------+--------------------+                    
|accountname|   accid|platid|  dateid|          createtime|
+-----------+--------+------+--------+--------------------+
|   22808048|22808048|    27|20180122|2018-01-22 14:15:...|
|   22984607|22984607|    60|20180324|2018-03-24 00:56:...|
|   22775548|22775548|    15|20180116|2018-01-16 18:40:...|
|   22386073|22386073|     4|20170914|2017-09-14 22:33:...|
|   22709109|22709109|    18|20171226|2017-12-26 23:06:...|
|   22713147|22713147|     6|20171227|2017-12-27 18:55:...|
|   22847395|22847395|    15|20180131|2018-01-31 09:21:...|
|   22863523|22863523|    50|20180206|2018-02-06 10:06:...|
|   22772443|22772443|     3|20180116|2018-01-16 14:41:...|
|   23063884|23063884|    27|20180424|2018-04-24 19:25:...|
+-----------+--------+------+--------+--------------------+

scala> hivcon.sql("select * from gamedw.account distribute by rand() sort by rand() limit 10").show
+-----------+--------+------+--------+--------------------+                    
|accountname|   accid|platid|  dateid|          createtime|
+-----------+--------+------+--------+--------------------+
|   22702272|22702272|     6|20171226|2017-12-26 15:32:...|
|   22588521|22588521|     0|20171121|2017-11-21 17:00:...|
|   23045191|23045191|    27|20180416|2018-04-16 22:31:...|
|   22997042|22997042|    83|20180328|2018-03-28 21:17:...|
|   22801335|22801335|    18|20180120|2018-01-20 17:59:...|
|   22789961|22789961|     6|20180118|2018-01-18 09:51:...|
|   22884362|22884362|     3|20180214|2018-02-14 00:01:...|
|   22786025|22786025|     1|20180117|2018-01-17 18:45:...|
|   22769132|22769132|    19|20180116|2018-01-16 11:32:...|
|   23045125|23045125|    27|20180416|2018-04-16 23:12:...|
+-----------+--------+------+--------+--------------------+

作为最后一次优化,可以在map-side做一些过滤。如果表的总大小是已知的,轻松设置一个随机阈值条件来进行数据过滤,如下:

scala> hivcon.sql("select * from gamedw.account where rand()<0.01 distribute by rand() sort by rand() limit 10").show
+-----------+--------+------+--------+--------------------+                    
|accountname|   accid|platid|  dateid|          createtime|
+-----------+--------+------+--------+--------------------+
|   23120261|23120261|    15|20180524|2018-05-24 12:05:...|
|   22726833|22726833|     4|20171231|2017-12-31 09:53:...|
|   22731460|22731460|     4|20180101|2018-01-01 22:15:...|
|   22379995|22379995|     4|20170913|2017-09-13 10:01:...|
|   22767301|22767301|    21|20180116|2018-01-16 10:25:...|
|   22701994|22701994|    21|20171226|2017-12-26 14:52:...|
|   22792831|22792831|    60|20180118|2018-01-18 20:09:...|
|   23145016|23145016|    15|20180608|2018-06-08 05:04:...|
|   22702216|22702216|    21|20171226|2017-12-26 15:28:...|
|   22858386|22858386|    33|20180204|2018-02-04 10:19:...|
+-----------+--------+------+--------+--------------------+

抽样可以从被抽取的数据中估计和推断出整体的特性,是科学实验、质量检验、社会调查普遍采用的一种经济有效的工作和研究方法。当数据量特别大时,对全体数据进行处理存在困难时,抽样就显得尤其重要了

Hive支持桶表抽样和块抽样:

桶表抽样的语法如下:

   table_sample: TABLESAMPLE (BUCKET x OUT OF y [ON colname])

TABLESAMPLE子句允许用户编写用于数据抽样而不是整个表的查询,该子句出现FROM子句中,可用于任何表中。桶编号从1开始,colname表明抽取样本的列,可以是非分区列中的任意一列,或者使用rand()表明在整个行中抽取样本而不是单个列。在colname上分桶的行随机进入1到y个桶中,返回属于桶x的行。

下例返回3000个桶中第2个桶的数据。

hive> select * from gamedw.account TABLESAMPLE(BUCKET 2 OUT OF 3000 on rand()) s;
OK
s.accountname   s.accid s.platid        s.dateid        s.createtime
22701274        22701274        33      20171226        2017-12-26 13:17:31.0
22702663        22702663        33      20171226        2017-12-26 16:30:12.0
22714480        22714480        27      20171227        2017-12-27 22:20:26.0
22729211        22729211        4       20180101        2018-01-01 00:37:46.0
22772179        22772179        18      20180116        2018-01-16 14:23:18.0
22774547        22774547        4       20180116        2018-01-16 17:12:40.0
22779739        22779739        1       20180117        2018-01-17 01:20:23.0
22787395        22787395        4       20180117        2018-01-17 21:27:29.0
22789721        22789721        15      20180118        2018-01-18 08:27:52.0
22793281        22793281        35      20180118        2018-01-18 21:14:48.0
22795002        22795002        15      20180119        2018-01-19 10:01:58.0
22797054        22797054        3       20180119        2018-01-19 18:12:43.0
22804855        22804855        138     20180121        2018-01-21 15:38:33.0
22813246        22813246        1       20180124        2018-01-24 00:09:43.0
22877117        22877117        3       20180211        2018-02-11 13:33:44.0
23045597        23045597        21      20180417        2018-04-17 04:02:17.0
23060405        23060405        15      20180423        2018-04-23 07:07:09.0
23075002        23075002        191     20180430        2018-04-30 17:24:44.0
23076553        23076553        158     20180501        2018-05-01 15:45:16.0
23085391        23085391        18      20180506        2018-05-06 10:53:59.0
Time taken: 0.938 seconds, Fetched: 20 row(s)

通常情况下,TABLESAMPLE将会扫描整个表然后抽取样本,显然这种做法效率不是很高。

替代方法是,由于在使用CLUSTERED BY时指定了分桶的列,如果抽样时TABLESAMPLE子句中指定的列匹配CLUSTERED BY子句中的列,TABLESAMPLE只扫描表中要求的分区。

与rand()相比,rand()每次执行的结果都不同,指定分桶列后,每次执行结果相同

TABLESAMPLE(BUCKET 3 OUT OF 32 on rand()):  account表在创建时使用了CLUSTEREDBY accid  INTO 32 BUCKETS,那么下面的语句将返回第3个和第19个簇中的行,因为每个桶由(32/16)=2个簇组成。为什么是3和19呢,因为要返回的是第3个桶,而每个桶由原来的2个簇组成,第3个桶就由原来的第3个和19个簇组成,根据简单的哈希算法(3%16=19%16)。

hive> select * from gamedw.account TABLESAMPLE(BUCKET 5 OUT OF 3000 on accid) s;
OK
s.accountname   s.accid s.platid        s.dateid        s.createtime
21078004        21078004        3       20180116        2018-01-16 20:21:32.0
22380004        22380004        4       20170913        2017-09-13 10:02:13.0
22386004        22386004        27      20170914        2017-09-14 22:15:23.0
22392004        22392004        4       20170917        2017-09-17 00:36:06.0
22671004        22671004        0       20171215        2017-12-15 11:58:57.0
22701004        22701004        33      20171226        2017-12-26 12:49:32.0
22704004        22704004        4       20171226        2017-12-26 18:22:16.0
22707004        22707004        15      20171226        2017-12-26 21:20:56.0
22710004        22710004        15      20171227        2017-12-27 01:45:43.0
22722004        22722004        27      20171229        2017-12-29 19:49:04.0
22767004        22767004        1       20180116        2018-01-16 10:17:55.0
22770004        22770004        19      20180116        2018-01-16 12:10:56.0
22773004        22773004        138     20180116        2018-01-16 15:11:39.0
22779004        22779004        3       20180116        2018-01-16 23:19:48.0
22782004        22782004        18      20180117        2018-01-17 10:38:42.0
22803004        22803004        1       20180120        2018-01-20 23:59:56.0
22812004        22812004        6       20180123        2018-01-23 18:36:55.0
22821004        22821004        21      20180124        2018-01-24 20:38:20.0
22830004        22830004        4       20180126        2018-01-26 03:44:23.0
22887004        22887004        15      20180214        2018-02-14 23:16:07.0
22911004        22911004        83      20180223        2018-02-23 18:11:57.0
22971004        22971004        1       20180318        2018-03-18 17:01:03.0
22977004        22977004        19      20180320        2018-03-20 22:22:38.0
23043004        23043004        27      20180416        2018-04-16 09:23:30.0
Time taken: 0.926 seconds, Fetched: 24 row(s)

使用块抽样:

hive> select * from gamedw.account TABLESAMPLE(0.01 percent) s;
OK
s.accountname   s.accid s.platid        s.dateid        s.createtime
1004210 1004210 6       20180116        2018-01-16 10:39:50.0
20946754        20946754        0       20170913        2017-09-13 10:02:37.0
20946766        20946766        0       20170901        2017-09-01 16:51:30.0
20946793        20946793        0       20171117        2017-11-17 16:51:07.0
20946796        20946796        0       20180110        2018-01-10 13:30:46.0
20946962        20946962        0       20171219        2017-12-19 15:43:14.0
20957641        20957641        0       20171117        2017-11-17 17:44:58.0
20957642        20957642        0       20171220        2017-12-20 15:32:21.0
Time taken: 0.873 seconds, Fetched: 8 row(s)

hive> select * from gamedw.account TABLESAMPLE(1k) s;
OK
s.accountname   s.accid s.platid        s.dateid        s.createtime
1004210 1004210 6       20180116        2018-01-16 10:39:50.0
20946754        20946754        0       20170913        2017-09-13 10:02:37.0
20946766        20946766        0       20170901        2017-09-01 16:51:30.0
20946793        20946793        0       20171117        2017-11-17 16:51:07.0
20946796        20946796        0       20180110        2018-01-10 13:30:46.0
20946962        20946962        0       20171219        2017-12-19 15:43:14.0
20957641        20957641        0       20171117        2017-11-17 17:44:58.0
20957642        20957642        0       20171220        2017-12-20 15:32:21.0
20963649        20963649        6       20171220        2017-12-20 10:13:57.0
20963674        20963674        33      20171219        2017-12-19 22:59:39.0
20964032        20964032        15      20171221        2017-12-21 17:04:03.0
20964042        20964042        33      20171221        2017-12-21 20:06:33.0
20964052        20964052        3       20171221        2017-12-21 14:17:28.0
20964128        20964128        0       20171123        2017-11-23 15:52:38.0
20964153        20964153        3       20170906        2017-09-06 16:39:35.0
20964307        20964307        18      20180101        2018-01-01 17:10:39.0
20964461        20964461        18      20180116        2018-01-16 13:14:22.0
20964769        20964769        0       20170825        2017-08-25 17:03:35.0
20965027        20965027        18      20180116        2018-01-16 11:16:07.0
20965038        20965038        33      20171226        2017-12-26 20:02:26.0
Time taken: 0.588 seconds, Fetched: 20 row(s)

hive> select * from gamedw.account TABLESAMPLE(30 rows) s;
OK
s.accountname   s.accid s.platid        s.dateid        s.createtime
1004210 1004210 6       20180116        2018-01-16 10:39:50.0
20946754        20946754        0       20170913        2017-09-13 10:02:37.0
20946766        20946766        0       20170901        2017-09-01 16:51:30.0
20946793        20946793        0       20171117        2017-11-17 16:51:07.0
20946796        20946796        0       20180110        2018-01-10 13:30:46.0
20946962        20946962        0       20171219        2017-12-19 15:43:14.0
20957641        20957641        0       20171117        2017-11-17 17:44:58.0
20957642        20957642        0       20171220        2017-12-20 15:32:21.0
20963649        20963649        6       20171220        2017-12-20 10:13:57.0
20963674        20963674        33      20171219        2017-12-19 22:59:39.0
20964032        20964032        15      20171221        2017-12-21 17:04:03.0
20964042        20964042        33      20171221        2017-12-21 20:06:33.0
20964052        20964052        3       20171221        2017-12-21 14:17:28.0
20964128        20964128        0       20171123        2017-11-23 15:52:38.0
20964153        20964153        3       20170906        2017-09-06 16:39:35.0
20964307        20964307        18      20180101        2018-01-01 17:10:39.0
20964461        20964461        18      20180116        2018-01-16 13:14:22.0
20964769        20964769        0       20170825        2017-08-25 17:03:35.0
20965027        20965027        18      20180116        2018-01-16 11:16:07.0
20965038        20965038        33      20171226        2017-12-26 20:02:26.0
20965124        20965124        18      20171226        2017-12-26 11:10:57.0
20965187        20965187        33      20171226        2017-12-26 22:35:32.0
20965282        20965282        33      20171226        2017-12-26 11:50:23.0
20965301        20965301        33      20180121        2018-01-21 17:34:58.0
20965326        20965326        33      20171228        2017-12-28 10:22:44.0
20965646        20965646        18      20180119        2018-01-19 21:13:33.0
20965650        20965650        18      20180116        2018-01-16 11:08:33.0
20965815        20965815        18      20180116        2018-01-16 13:59:43.0
20965938        20965938        18      20180116        2018-01-16 10:01:19.0
20966100        20966100        18      20180122        2018-01-22 14:25:31.0
Time taken: 0.898 seconds, Fetched: 30 row(s)