hive查询表数据总数(大数据之Hivegroupingsets分组查询)(1)

group by

sql 查询时,我们常将聚合函数和group by 结合起来对某一个或多个字段进行分组查询,例如:

select addcode,count(distinct sbtid)uv from tb_hive_window group by addcode; ---------- ----- | addcode | uv | ---------- ----- | 0002 | 2 | | 000201 | 1 | | 000202 | 1 | | 000205 | 1 | | 000206 | 1 | | 000208 | 2 | ---------- -----

group by fields ... grouping sets ()

有时候因业务需要,我们需要group by 多个字段,例如:

select addcode,count(distinct sbtid)uv from tb_hive_window group by addcode; select rscode,count(distinct sbtid)uv from tb_hive_window group by rscode; select addcode,rscode,count(distinct sbtid)uv from tb_hive_window group by addcode,rscode; //这种情况我们需要写3条sql 语句

通过grouping sets (), 我们可以通过一条sql 完成,相当于是将上面三条语句执行的结果通过union all 组合起来。

select addcode,rscode,count(*)pv,count(distinct sbtid)uv,grouping__id from tb_hive_window group by addcode,rscode grouping sets ((addcode,rscode),(addcode),(rscode)); ---------- --------- ----- ----- --------------- | addcode | rscode | pv | uv | grouping__id | ---------- --------- ----- ----- --------------- | NULL | 34 | 1 | 1 | 2 | | NULL | 35 | 5 | 3 | 2 | | NULL | 40 | 1 | 1 | 2 | | NULL | 65 | 1 | 1 | 2 | | NULL | 351 | 1 | 1 | 2 | | NULL | 352 | 1 | 1 | 2 | | NULL | 395 | 1 | 1 | 2 | | 0002 | NULL | 2 | 2 | 1 | | 0002 | 34 | 1 | 1 | 3 | | 0002 | 352 | 1 | 1 | 3 | | 000201 | NULL | 1 | 1 | 1 | | 000201 | 35 | 1 | 1 | 3 | | 000202 | NULL | 1 | 1 | 1 | | 000202 | 35 | 1 | 1 | 3 | | 000205 | NULL | 3 | 1 | 1 | | 000205 | 35 | 2 | 1 | 3 | | 000205 | 395 | 1 | 1 | 3 | | 000206 | NULL | 2 | 1 | 1 | | 000206 | 40 | 1 | 1 | 3 | | 000206 | 65 | 1 | 1 | 3 | | 000208 | NULL | 2 | 2 | 1 | | 000208 | 35 | 1 | 1 | 3 | | 000208 | 351 | 1 | 1 | 3 | ---------- --------- ----- ----- ---------------

注意,

grouping by 后面的字段排序处理成二进制数,靠近group by 的是低位,远离group by 的是高位。查询出的每一行结果中,如果有统计此列即此列值不为null,二进制位用1表示,否则用0表示。

---------- --------- ----- ----- --------------- | addcode | rscode | pv | uv | grouping__id | ---------- --------- ----- ----- --------------- | NULL | 34 | 1 | 1 | 2 | --->> 转换成二进制数:01 -->2 (注意远离group by 的字段是高位) | 0002 | NULL | 2 | 2 | 1 |--->> 转换成二进制数:10 -->1 | 0002 | 34 | 1 | 1 | 3 | --->> 转换成二进制数:11 -->1 2=3 | 000201 | NULL | 1 | 1 | 1 | | 000201 | 35 | 1 | 1 | 3 |

注意:此处开源版hive 的grouping__id 计算方式与华为云的MRS 集群hive的grouping__id计算方式不同。mrs 中的grouping__id 计算时是以靠近group by 的字段为高位,远离的为低位,并且字段值为null 时 二进制位为1,否则为0

| 20200114 | 2020011406 | fengmizhibonew | NULL | NULL | NULL | NULL | NULL | NULL | 836 | 613 | 63

group by fields ... with cube

通过group by fields ... with cube 可是让hive 实现所有组合维度的查询,例如

select a,b,c,d,count(e) pv,count(distinct e) uv, grouping__id from tb_test group by a,b,c,d with cube;

根据排列组合计算,最终的组合有 C(4,1) C(4,2) C(4,3) C(4,4) 1 = 4 6 4 1 1 = 16 种组合,如果是通过单条group by 来实现,需要写16 条sql 才能完成,因此这种方案极大地降低了程序复杂度。

select addcode,rscode,count(*)pv,count(distinct sbtid)uv,grouping__id from tb_hive_window group by addcode,rscode with cube; ---------- --------- ----- ----- --------------- | addcode | rscode | pv | uv | grouping__id | ---------- --------- ----- ----- --------------- | NULL | NULL | 11 | 6 | 0 | | NULL | 34 | 1 | 1 | 2 | | NULL | 35 | 5 | 3 | 2 | | NULL | 40 | 1 | 1 | 2 | | NULL | 65 | 1 | 1 | 2 | | NULL | 351 | 1 | 1 | 2 | | NULL | 352 | 1 | 1 | 2 | | NULL | 395 | 1 | 1 | 2 | | 0002 | NULL | 2 | 2 | 1 | | 0002 | 34 | 1 | 1 | 3 | | 0002 | 352 | 1 | 1 | 3 | | 000201 | NULL | 1 | 1 | 1 | | 000201 | 35 | 1 | 1 | 3 | | 000202 | NULL | 1 | 1 | 1 | | 000202 | 35 | 1 | 1 | 3 | | 000205 | NULL | 3 | 1 | 1 | | 000205 | 35 | 2 | 1 | 3 | | 000205 | 395 | 1 | 1 | 3 | | 000206 | NULL | 2 | 1 | 1 | | 000206 | 40 | 1 | 1 | 3 | | 000206 | 65 | 1 | 1 | 3 | | 000208 | NULL | 2 | 2 | 1 | | 000208 | 35 | 1 | 1 | 3 | | 000208 | 351 | 1 | 1 | 3 | ---------- --------- ----- ----- ---------------

group by fields ... with rollup

rollup 是cube 的子集,通过group by fields ... with rollup 可以实现以左侧维度为准,计算某一层次维度的聚合

select a,b,c,d,count(e) pv,count(distinct e) uv, grouping__id from tb_test group by a,b,c,d with rollup; //等效于 select a,b,c,d,count(e) pv,count(distinct e) uv, grouping__id from tb_test group by a,b,c,d grouping sets ((a,b,c,d),(a,b,c),(a,b),(a),());

select addcode,rscode,count(*)pv,count(distinct sbtid)uv,grouping__id from tb_hive_window group by addcode,rscode with rollup; ---------- --------- ----- ----- --------------- | addcode | rscode | pv | uv | grouping__id | ---------- --------- ----- ----- --------------- | NULL | NULL | 11 | 6 | 0 | | 0002 | NULL | 2 | 2 | 1 | | 0002 | 34 | 1 | 1 | 3 | | 0002 | 352 | 1 | 1 | 3 | | 000201 | NULL | 1 | 1 | 1 | | 000201 | 35 | 1 | 1 | 3 | | 000202 | NULL | 1 | 1 | 1 | | 000202 | 35 | 1 | 1 | 3 | | 000205 | NULL | 3 | 1 | 1 | | 000205 | 35 | 2 | 1 | 3 | | 000205 | 395 | 1 | 1 | 3 | | 000206 | NULL | 2 | 1 | 1 | | 000206 | 40 | 1 | 1 | 3 | | 000206 | 65 | 1 | 1 | 3 | | 000208 | NULL | 2 | 2 | 1 | | 000208 | 35 | 1 | 1 | 3 | | 000208 | 351 | 1 | 1 | 3 | ---------- --------- ----- ----- ---------------

//通过grouping sets 实现得到同样的结果 select addcode,rscode,count(*)pv,count(distinct sbtid)uv,grouping__id from tb_hive_window group by addcode,rscode grouping sets((addcode,rscode),(addcode),()); ---------- --------- ----- ----- --------------- | addcode | rscode | pv | uv | grouping__id | ---------- --------- ----- ----- --------------- | NULL | NULL | 11 | 6 | 0 | | 0002 | NULL | 2 | 2 | 1 | | 0002 | 34 | 1 | 1 | 3 | | 0002 | 352 | 1 | 1 | 3 | | 000201 | NULL | 1 | 1 | 1 | | 000201 | 35 | 1 | 1 | 3 | | 000202 | NULL | 1 | 1 | 1 | | 000202 | 35 | 1 | 1 | 3 | | 000205 | NULL | 3 | 1 | 1 | | 000205 | 35 | 2 | 1 | 3 | | 000205 | 395 | 1 | 1 | 3 | | 000206 | NULL | 2 | 1 | 1 | | 000206 | 40 | 1 | 1 | 3 | | 000206 | 65 | 1 | 1 | 3 | | 000208 | NULL | 2 | 2 | 1 | | 000208 | 35 | 1 | 1 | 3 | | 000208 | 351 | 1 | 1 | 3 | ---------- --------- ----- ----- ---------------

,