Impalarow_number()使⽤
1、需求:到每个分组中pv最⼤的第⼀条数据,取10万条;
row_number 格式:ROW_NUMBER() OVER (partition BY COLUMN_A ORDER BY COLUMN_B ASC/DESC) rn
其中:
partition by:类似hive的建表,分区的意思;COLUMN_A 是分组字段 order by :排序,默认是升序,加desc降序;COLUMN_B 是排序字段
注解:其实这个排序具体理解如下:
1、按照sql中的group 操作之后会得到很多的分类;
2、这个分类不变,按照 partition by 后的字段COLUMN_A 作为ROW_NUMBER要分组字段,如果字段相同,则会产⽣1到N的⾏号;
3 、orderby 的作⽤就是按照那个字段值来排序,以此产⽣是从到⼤到⼩还是从⼩到⼤的排序;
本⽂查询⽤例:
select md5(stt.msisdn),stt.os_code from (select ro.msisdn,ro.imei,ro.pv,ro.os_code,ROW_NUMBER() OVER(PARTITION BY ro.msisdn ORDER BY ro.pv d
***********************************************************************************
[evercloud17:21000] > select md5(stt.msisdn),stt.os_code from (select ro.msisdn,ro.imei,ro.pv,ro.os_code,ROW_NUMBER() OVER(PARTITION BY ro.msisdn OR Query: select md5(stt.msisdn),stt.os_code from (select ro.msisdn,ro.imei,ro.pv,ro.os_code,ROW_NUMBER() OVER(PARTITION BY ro.msisdn ORDER BY ro.pv d +----------------------------------+---------+
| default.md5(stt.msisdn) | os_code |
+----------------------------------+---------+
| f2f15925d59d38675612bcf70dc114e9 | 1 |
| 38b63fee0f74c0ffdf3572fd4a9d5e64 | 1 |
| 94b03cf77619a0e9e0309ad1c0a1a46b | 1 |
| a5ff2002635605e8554ee324f7051932 | 3 |
| d7a83461d3c12ebb7b1ed17eb5f64dd2 | 1 |
| f037d0028f8da6d83695ec6e54229db2 | 1 |
| d2c955d29caba4bba41f280f0aee88b7 | 1 |
| 448f4cc33a538a2c6c9fc003f34ca569 | 1 |
| 337c229108d579fea706b57b3212d690 | 1 |
| 39c2352f194d5b2cf077f8708d4a6665 | 1 |
+----------------------------------+---------+
Fetched 10 row(s) in 3.54s
2.1 聚合函数直接写:
select label,ecid,eid,findedname,ppv from (select ap.id,st.eid,ap.findedname,sum(pv) ppv,ROW_NUMBER() OVER(PARTITION BY ap.label OR
其中rownum 的order by 字段⽤聚合函数:就是⽤的分组后聚合---按照原来的分组聚合,不是按照COLUMN_A 分组聚合;
2.2 为了测试正确性--分开写,先统计后再⽤row_number;
select * from (select label,ecid,eid,findedname,ppv,ROW_NUMBER() OVER(PARTITION BY label ORDER BY ppv desc) AS rn from( select ap.id
2.3、测试结果:
(1)、整体写的结果:
Query: select label,ecid,eid,findedname,ppv from (select ap.id,st.eid,ap.findedname,sum(pv) ppv,ROW_NUMBER() OVER(PARTITION BY ap.la +----------+------+------+-----------------------------------------------+-------------+----+
| label | ecid | eid | findedname | ppv | rn |
+----------+------+------+-----------------------------------------------+-------------+----+
| ⾦融理财 | 9 | 3 | ⽀付宝 | 1274317627 | 1 |
| ⾦融理财 | 9 | 6 | 财付通 | 258357291 | 2 |
| ⾦融理财 | 21 | 8108 | 同花顺炒股票 | 242589336 | 3 |
| ⾦融理财 | 9 | 5 | 招商银⾏ | 203848183 | 4 |
| ⾦融理财 | 16 | 7128 | 平安普惠 | 178164508 | 5 |
| ⾦融理财 | 16 | 3 | 同花顺股票开户 | 142253306 | 6 |
| ⾦融理财 | 16 | 51 | 财富 | 131791277 | 7 |
| ⾦融理财 | 21 | 7512 | 百度钱包 | 104820340 | 8 |
| ⾦融理财 | 9 | 9 | 中国建设银⾏ | 77952913 | 9 |
| ⾦融理财 | 9 | 12 | 农⾏掌上银⾏ | 74952353 | 10 |
(2)、分开写结果:
Query: select * from (select label,ecid,eid,findedname,ppv,ROW_NUMBER() OVER(PARTITION BY label ORDER BY ppv desc) AS rn from( select ap.labe +----------+------+------+-----------------------------------------------+-------------+----+
| label | ecid | eid | findedname | ppv | rn |
+----------+------+------+-----------------------------------------------+-------------+----+
| ⾦融理财 | 9 | 3 | ⽀付宝 | 1274317627 | 1 |
| ⾦融理财 | 9 | 6 | 财付通 | 258357291 | 2 |
| ⾦融理财 | 21 | 8108 | 同花顺炒股票 | 242589336 | 3 |
| ⾦融理财 | 9 | 5 | 招商银⾏ | 203848183 | 4 |
| ⾦融理财 | 16 | 7128 | 平安普惠 | 178164508 | 5 |
| ⾦融理财 | 16 | 3 | 同花顺股票开户 | 142253306 | 6 |
| ⾦融理财 | 16 | 51 | 财富 | 131791277 | 7 |
| ⾦融理财 | 21 | 7512 | 百度钱包 | 104820340 | 8 |
| ⾦融理财 | 9 | 9 | 中国建设银⾏ | 77952913 | 9 |
| ⾦融理财 | 9 | 12 | 农⾏掌上银⾏ | 74952353 | 10 |
结论:两种写法均正确;