hive> select * from account limit 10;
OK
account.accountname account.accid account.platid account.dateid account.createtime
1004210 1004210 6 20180116 2018-01-16 10:39:50.0
20946754 20946754 0 20170913 2017-09-13 10:02:37.0
20946766 20946766 0 20170901 2017-09-01 16:51:30.0
20946793 20946793 0 20171117 2017-11-17 16:51:07.0
20946796 20946796 0 20180110 2018-01-10 13:30:46.0
20946962 20946962 0 20171219 2017-12-19 15:43:14.0
20957641 20957641 0 20171117 2017-11-17 17:44:58.0
20957642 20957642 0 20171220 2017-12-20 15:32:21.0
20963649 20963649 6 20171220 2017-12-20 10:13:57.0
20963674 20963674 33 20171219 2017-12-19 22:59:39.0
只通过limit河南保证数据随机的返回,通常情况下,此种方式会按文件的顺序返回,没到达随机采样的目的。
scala> hivcon.sql("select * from gamedw.account order by rand() limit 10").show
+-----------+--------+------+--------+--------------------+
|accountname| accid|platid| dateid| createtime|
+-----------+--------+------+--------+--------------------+
| 22780583|22780583| 18|20180117|2018-01-17 07:27:...|
| 22777237|22777237| 6|20180116|2018-01-16 20:52:...|
| 22868751|22868751| 1|20180208|2018-02-08 10:20:...|
| 22993835|22993835| 83|20180327|2018-03-27 19:48:...|
| 22779551|22779551| 9|20180117|2018-01-17 00:33:...|
| 22796970|22796970| 1|20180119|2018-01-19 17:53:...|
| 22713995|22713995| 33|20171227|2017-12-27 21:05:...|
| 22811076|22811076| 6|20180123|2018-01-23 13:16:...|
| 22715294|22715294| 15|20171228|2017-12-28 01:59:...|
| 22843107|22843107| 33|20180129|2018-01-29 20:13:...|
+-----------+--------+------+--------+--------------------+
scala> hivcon.sql("select * from gamedw.account order by rand() limit 10").show
+-----------+--------+------+--------+--------------------+
|accountname| accid|platid| dateid| createtime|
+-----------+--------+------+--------+--------------------+
| 22809368|22809368| 35|20180122|2018-01-22 20:55:...|
| 22810321|22810321| 18|20180123|2018-01-23 03:42:...|
| 22850089|22850089| 15|20180201|2018-02-01 07:16:...|
| 23040545|23040545| 27|20180415|2018-04-15 10:45:...|
| 21095611|21095611| 3|20180118|2018-01-18 19:02:...|
| 22787794|22787794| 9|20180117|2018-01-17 22:10:...|
| 22701928|22701928| 6|20171226|2017-12-26 14:42:...|
| 23123308|23123308| 6|20180526|2018-05-26 02:02:...|
| 23037031|23037031| 27|20180413|2018-04-13 23:52:...|
| 22798775|22798775| 27|20180120|2018-01-20 00:32:...|
+-----------+--------+------+--------+--------------------+
使用rand()确实可以实现真正意义的随机,但性能不高。为了实现总排序,Hive必须将所有数据强制传输到单个reducer。该reducer将对整个数据集进行排序。这很不好。
scala> hivcon.sql("select * from gamedw.account sort by rand() limit 10").show
+-----------+--------+------+--------+--------------------+
|accountname| accid|platid| dateid| createtime|
+-----------+--------+------+--------+--------------------+
| 22708141|22708141| 15|20171226|2017-12-26 22:13:...|
| 22780855|22780855| 15|20180117|2018-01-17 08:25:...|
| 22781579|22781579| 35|20180117|2018-01-17 09:55:...|
| 22775456|22775456| 1|20180116|2018-01-16 18:35:...|
| 22705922|22705922| 18|20171226|2017-12-26 20:18:...|
| 22773860|22773860| 138|20180116|2018-01-16 16:09:...|
| 22771885|22771885| 4|20180116|2018-01-16 14:05:...|
| 22633182|22633182| 0|20171202|2017-12-02 04:07:...|
| 22228137|22228137| 0|20170828|2017-08-28 17:55:...|
| 22593212|22593212| 0|20171121|2017-11-21 18:38:...|
+-----------+--------+------+--------+--------------------+
scala> hivcon.sql("select * from gamedw.account sort by rand() limit 10").show
+-----------+--------+------+--------+--------------------+
|accountname| accid|platid| dateid| createtime|
+-----------+--------+------+--------+--------------------+
| 22703713|22703713| 15|20171226|2017-12-26 18:02:...|
| 22766790|22766790| 4|20180116|2018-01-16 10:11:...|
| 22728340|22728340| 27|20171231|2017-12-31 19:59:...|
| 22770064|22770064| 1|20180116|2018-01-16 12:13:...|
| 22705604|22705604| 15|20171226|2017-12-26 19:58:...|
| 22771835|22771835| 21|20180116|2018-01-16 14:01:...|
| 22720562|22720562| 33|20171229|2017-12-29 11:56:...|
| 22779785|22779785| 83|20180117|2018-01-17 01:36:...|
| 22705319|22705319| 33|20171226|2017-12-26 19:46:...|
| 22717315|22717315| 27|20171228|2017-12-28 17:04:...|
+-----------+--------+------+--------+--------------------+
Hive有一个非标准SQL“sort by”子句,它只在单个reducer中排序,并且不保证数据跨多个reducers中排序.
Hive的将数据拆分为多个reducer的方法是未定义的。它可能是真正随机的,它可能基于文件顺序,它可能基于数据中的某些值。Hive如何在reducers中实现limit子句也是未定义的。也许它按顺序从reducer中获取数据 - 即,reducer 0中的所有数据,然后全部来reducer1,等等。也许它通过它们循环并将所有内容混合在一起。
假设reduce 的key是基于数据列,而limit子句是reducers的顺序。然后样品会非常倾斜。
解决方案是另一个非标准的Hive功能:“distribute by”。对于reduce key不是由查询结构确定的查询(没有“group by”,没有join),可以准确指定reduce key的内容。
scala> hivcon.sql("select * from gamedw.account distribute by rand() sort by rand() limit 10").show
+-----------+--------+------+--------+--------------------+
|accountname| accid|platid| dateid| createtime|
+-----------+--------+------+--------+--------------------+
| 22808048|22808048| 27|20180122|2018-01-22 14:15:...|
| 22984607|22984607| 60|20180324|2018-03-24 00:56:...|
| 22775548|22775548| 15|20180116|2018-01-16 18:40:...|
| 22386073|22386073| 4|20170914|2017-09-14 22:33:...|
| 22709109|22709109| 18|20171226|2017-12-26 23:06:...|
| 22713147|22713147| 6|20171227|2017-12-27 18:55:...|
| 22847395|22847395| 15|20180131|2018-01-31 09:21:...|
| 22863523|22863523| 50|20180206|2018-02-06 10:06:...|
| 22772443|22772443| 3|20180116|2018-01-16 14:41:...|
| 23063884|23063884| 27|20180424|2018-04-24 19:25:...|
+-----------+--------+------+--------+--------------------+
scala> hivcon.sql("select * from gamedw.account distribute by rand() sort by rand() limit 10").show
+-----------+--------+------+--------+--------------------+
|accountname| accid|platid| dateid| createtime|
+-----------+--------+------+--------+--------------------+
| 22702272|22702272| 6|20171226|2017-12-26 15:32:...|
| 22588521|22588521| 0|20171121|2017-11-21 17:00:...|
| 23045191|23045191| 27|20180416|2018-04-16 22:31:...|
| 22997042|22997042| 83|20180328|2018-03-28 21:17:...|
| 22801335|22801335| 18|20180120|2018-01-20 17:59:...|
| 22789961|22789961| 6|20180118|2018-01-18 09:51:...|
| 22884362|22884362| 3|20180214|2018-02-14 00:01:...|
| 22786025|22786025| 1|20180117|2018-01-17 18:45:...|
| 22769132|22769132| 19|20180116|2018-01-16 11:32:...|
| 23045125|23045125| 27|20180416|2018-04-16 23:12:...|
+-----------+--------+------+--------+--------------------+
作为最后一次优化,可以在map-side做一些过滤。如果表的总大小是已知的,轻松设置一个随机阈值条件来进行数据过滤,如下:
scala> hivcon.sql("select * from gamedw.account where rand()<0.01 distribute by rand() sort by rand() limit 10").show
+-----------+--------+------+--------+--------------------+
|accountname| accid|platid| dateid| createtime|
+-----------+--------+------+--------+--------------------+
| 23120261|23120261| 15|20180524|2018-05-24 12:05:...|
| 22726833|22726833| 4|20171231|2017-12-31 09:53:...|
| 22731460|22731460| 4|20180101|2018-01-01 22:15:...|
| 22379995|22379995| 4|20170913|2017-09-13 10:01:...|
| 22767301|22767301| 21|20180116|2018-01-16 10:25:...|
| 22701994|22701994| 21|20171226|2017-12-26 14:52:...|
| 22792831|22792831| 60|20180118|2018-01-18 20:09:...|
| 23145016|23145016| 15|20180608|2018-06-08 05:04:...|
| 22702216|22702216| 21|20171226|2017-12-26 15:28:...|
| 22858386|22858386| 33|20180204|2018-02-04 10:19:...|
+-----------+--------+------+--------+--------------------+
抽样可以从被抽取的数据中估计和推断出整体的特性,是科学实验、质量检验、社会调查普遍采用的一种经济有效的工作和研究方法。当数据量特别大时,对全体数据进行处理存在困难时,抽样就显得尤其重要了
Hive支持桶表抽样和块抽样:
桶表抽样的语法如下:
table_sample: TABLESAMPLE (BUCKET x OUT OF y [ON colname])
TABLESAMPLE子句允许用户编写用于数据抽样而不是整个表的查询,该子句出现FROM子句中,可用于任何表中。桶编号从1开始,colname表明抽取样本的列,可以是非分区列中的任意一列,或者使用rand()表明在整个行中抽取样本而不是单个列。在colname上分桶的行随机进入1到y个桶中,返回属于桶x的行。
下例返回3000个桶中第2个桶的数据。
hive> select * from gamedw.account TABLESAMPLE(BUCKET 2 OUT OF 3000 on rand()) s;
OK
s.accountname s.accid s.platid s.dateid s.createtime
22701274 22701274 33 20171226 2017-12-26 13:17:31.0
22702663 22702663 33 20171226 2017-12-26 16:30:12.0
22714480 22714480 27 20171227 2017-12-27 22:20:26.0
22729211 22729211 4 20180101 2018-01-01 00:37:46.0
22772179 22772179 18 20180116 2018-01-16 14:23:18.0
22774547 22774547 4 20180116 2018-01-16 17:12:40.0
22779739 22779739 1 20180117 2018-01-17 01:20:23.0
22787395 22787395 4 20180117 2018-01-17 21:27:29.0
22789721 22789721 15 20180118 2018-01-18 08:27:52.0
22793281 22793281 35 20180118 2018-01-18 21:14:48.0
22795002 22795002 15 20180119 2018-01-19 10:01:58.0
22797054 22797054 3 20180119 2018-01-19 18:12:43.0
22804855 22804855 138 20180121 2018-01-21 15:38:33.0
22813246 22813246 1 20180124 2018-01-24 00:09:43.0
22877117 22877117 3 20180211 2018-02-11 13:33:44.0
23045597 23045597 21 20180417 2018-04-17 04:02:17.0
23060405 23060405 15 20180423 2018-04-23 07:07:09.0
23075002 23075002 191 20180430 2018-04-30 17:24:44.0
23076553 23076553 158 20180501 2018-05-01 15:45:16.0
23085391 23085391 18 20180506 2018-05-06 10:53:59.0
Time taken: 0.938 seconds, Fetched: 20 row(s)
通常情况下,TABLESAMPLE将会扫描整个表然后抽取样本,显然这种做法效率不是很高。
替代方法是,由于在使用CLUSTERED BY时指定了分桶的列,如果抽样时TABLESAMPLE子句中指定的列匹配CLUSTERED BY子句中的列,TABLESAMPLE只扫描表中要求的分区。
与rand()相比,rand()每次执行的结果都不同,指定分桶列后,每次执行结果相同
TABLESAMPLE(BUCKET 3 OUT OF 32 on rand()): account表在创建时使用了CLUSTEREDBY accid INTO 32 BUCKETS,那么下面的语句将返回第3个和第19个簇中的行,因为每个桶由(32/16)=2个簇组成。为什么是3和19呢,因为要返回的是第3个桶,而每个桶由原来的2个簇组成,第3个桶就由原来的第3个和19个簇组成,根据简单的哈希算法(3%16=19%16)。
hive> select * from gamedw.account TABLESAMPLE(BUCKET 5 OUT OF 3000 on accid) s;
OK
s.accountname s.accid s.platid s.dateid s.createtime
21078004 21078004 3 20180116 2018-01-16 20:21:32.0
22380004 22380004 4 20170913 2017-09-13 10:02:13.0
22386004 22386004 27 20170914 2017-09-14 22:15:23.0
22392004 22392004 4 20170917 2017-09-17 00:36:06.0
22671004 22671004 0 20171215 2017-12-15 11:58:57.0
22701004 22701004 33 20171226 2017-12-26 12:49:32.0
22704004 22704004 4 20171226 2017-12-26 18:22:16.0
22707004 22707004 15 20171226 2017-12-26 21:20:56.0
22710004 22710004 15 20171227 2017-12-27 01:45:43.0
22722004 22722004 27 20171229 2017-12-29 19:49:04.0
22767004 22767004 1 20180116 2018-01-16 10:17:55.0
22770004 22770004 19 20180116 2018-01-16 12:10:56.0
22773004 22773004 138 20180116 2018-01-16 15:11:39.0
22779004 22779004 3 20180116 2018-01-16 23:19:48.0
22782004 22782004 18 20180117 2018-01-17 10:38:42.0
22803004 22803004 1 20180120 2018-01-20 23:59:56.0
22812004 22812004 6 20180123 2018-01-23 18:36:55.0
22821004 22821004 21 20180124 2018-01-24 20:38:20.0
22830004 22830004 4 20180126 2018-01-26 03:44:23.0
22887004 22887004 15 20180214 2018-02-14 23:16:07.0
22911004 22911004 83 20180223 2018-02-23 18:11:57.0
22971004 22971004 1 20180318 2018-03-18 17:01:03.0
22977004 22977004 19 20180320 2018-03-20 22:22:38.0
23043004 23043004 27 20180416 2018-04-16 09:23:30.0
Time taken: 0.926 seconds, Fetched: 24 row(s)
使用块抽样:
hive> select * from gamedw.account TABLESAMPLE(0.01 percent) s;
OK
s.accountname s.accid s.platid s.dateid s.createtime
1004210 1004210 6 20180116 2018-01-16 10:39:50.0
20946754 20946754 0 20170913 2017-09-13 10:02:37.0
20946766 20946766 0 20170901 2017-09-01 16:51:30.0
20946793 20946793 0 20171117 2017-11-17 16:51:07.0
20946796 20946796 0 20180110 2018-01-10 13:30:46.0
20946962 20946962 0 20171219 2017-12-19 15:43:14.0
20957641 20957641 0 20171117 2017-11-17 17:44:58.0
20957642 20957642 0 20171220 2017-12-20 15:32:21.0
Time taken: 0.873 seconds, Fetched: 8 row(s)
hive> select * from gamedw.account TABLESAMPLE(1k) s;
OK
s.accountname s.accid s.platid s.dateid s.createtime
1004210 1004210 6 20180116 2018-01-16 10:39:50.0
20946754 20946754 0 20170913 2017-09-13 10:02:37.0
20946766 20946766 0 20170901 2017-09-01 16:51:30.0
20946793 20946793 0 20171117 2017-11-17 16:51:07.0
20946796 20946796 0 20180110 2018-01-10 13:30:46.0
20946962 20946962 0 20171219 2017-12-19 15:43:14.0
20957641 20957641 0 20171117 2017-11-17 17:44:58.0
20957642 20957642 0 20171220 2017-12-20 15:32:21.0
20963649 20963649 6 20171220 2017-12-20 10:13:57.0
20963674 20963674 33 20171219 2017-12-19 22:59:39.0
20964032 20964032 15 20171221 2017-12-21 17:04:03.0
20964042 20964042 33 20171221 2017-12-21 20:06:33.0
20964052 20964052 3 20171221 2017-12-21 14:17:28.0
20964128 20964128 0 20171123 2017-11-23 15:52:38.0
20964153 20964153 3 20170906 2017-09-06 16:39:35.0
20964307 20964307 18 20180101 2018-01-01 17:10:39.0
20964461 20964461 18 20180116 2018-01-16 13:14:22.0
20964769 20964769 0 20170825 2017-08-25 17:03:35.0
20965027 20965027 18 20180116 2018-01-16 11:16:07.0
20965038 20965038 33 20171226 2017-12-26 20:02:26.0
Time taken: 0.588 seconds, Fetched: 20 row(s)
hive> select * from gamedw.account TABLESAMPLE(30 rows) s;
OK
s.accountname s.accid s.platid s.dateid s.createtime
1004210 1004210 6 20180116 2018-01-16 10:39:50.0
20946754 20946754 0 20170913 2017-09-13 10:02:37.0
20946766 20946766 0 20170901 2017-09-01 16:51:30.0
20946793 20946793 0 20171117 2017-11-17 16:51:07.0
20946796 20946796 0 20180110 2018-01-10 13:30:46.0
20946962 20946962 0 20171219 2017-12-19 15:43:14.0
20957641 20957641 0 20171117 2017-11-17 17:44:58.0
20957642 20957642 0 20171220 2017-12-20 15:32:21.0
20963649 20963649 6 20171220 2017-12-20 10:13:57.0
20963674 20963674 33 20171219 2017-12-19 22:59:39.0
20964032 20964032 15 20171221 2017-12-21 17:04:03.0
20964042 20964042 33 20171221 2017-12-21 20:06:33.0
20964052 20964052 3 20171221 2017-12-21 14:17:28.0
20964128 20964128 0 20171123 2017-11-23 15:52:38.0
20964153 20964153 3 20170906 2017-09-06 16:39:35.0
20964307 20964307 18 20180101 2018-01-01 17:10:39.0
20964461 20964461 18 20180116 2018-01-16 13:14:22.0
20964769 20964769 0 20170825 2017-08-25 17:03:35.0
20965027 20965027 18 20180116 2018-01-16 11:16:07.0
20965038 20965038 33 20171226 2017-12-26 20:02:26.0
20965124 20965124 18 20171226 2017-12-26 11:10:57.0
20965187 20965187 33 20171226 2017-12-26 22:35:32.0
20965282 20965282 33 20171226 2017-12-26 11:50:23.0
20965301 20965301 33 20180121 2018-01-21 17:34:58.0
20965326 20965326 33 20171228 2017-12-28 10:22:44.0
20965646 20965646 18 20180119 2018-01-19 21:13:33.0
20965650 20965650 18 20180116 2018-01-16 11:08:33.0
20965815 20965815 18 20180116 2018-01-16 13:59:43.0
20965938 20965938 18 20180116 2018-01-16 10:01:19.0
20966100 20966100 18 20180122 2018-01-22 14:25:31.0
Time taken: 0.898 seconds, Fetched: 30 row(s)