hive窗口函数(二)

序列函数:NTILE,ROW_NUMBER,RANK,DENSE_RANK

数据准备

创建数据表

1
2
3
4
5
6
CREATE EXTERNAL TABLE supertest2(
cookieid string,
createtime string, --day
pv INT
) ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',' ;

插入数据

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
INSERT INTO supertest2(cookieid,createtime,pv) VALUES('cookie1','2019-07-16',4) 
INSERT INTO supertest2(cookieid,createtime,pv) VALUES('cookie1','2019-07-15',4)
INSERT INTO supertest2(cookieid,createtime,pv) VALUES('cookie1','2019-07-14',2)
INSERT INTO supertest2(cookieid,createtime,pv) VALUES('cookie1','2019-07-13',3)
INSERT INTO supertest2(cookieid,createtime,pv) VALUES('cookie1','2019-07-12',7)
INSERT INTO supertest2(cookieid,createtime,pv) VALUES('cookie1','2019-07-11',5)
INSERT INTO supertest2(cookieid,createtime,pv) VALUES('cookie1','2019-07-10',1)

INSERT INTO supertest2(cookieid,createtime,pv) VALUES('cookie2','2019-07-14',3)
INSERT INTO supertest2(cookieid,createtime,pv) VALUES('cookie2','2019-07-16',7)
INSERT INTO supertest2(cookieid,createtime,pv) VALUES('cookie2','2019-07-15',9)
INSERT INTO supertest2(cookieid,createtime,pv) VALUES('cookie2','2019-07-13',6)
INSERT INTO supertest2(cookieid,createtime,pv) VALUES('cookie2','2019-07-12',5)
INSERT INTO supertest2(cookieid,createtime,pv) VALUES('cookie2','2019-07-11',3)
INSERT INTO supertest2(cookieid,createtime,pv) VALUES('cookie2','2019-07-10',2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
select * from supertest2;
cookieid createtime pv
1 cookie1 2019-07-12 7
2 cookie2 2019-07-11 3
3 cookie2 2019-07-14 3
4 cookie1 2019-07-14 2
5 cookie2 2019-07-10 2
6 cookie1 2019-07-11 5
7 cookie1 2019-07-16 4
8 cookie2 2019-07-15 9
9 cookie2 2019-07-16 7
10 cookie2 2019-07-13 6
11 cookie2 2019-07-12 5
12 cookie1 2019-07-13 3
13 cookie1 2019-07-15 4
14 cookie1 2019-07-10 2

NTILE

NTILE(n):用于将分组数据按照顺序切分成n片
如果切片不均匀,默认增加第一个切片的分布

1
2
3
4
5
6
7
8
9
10
11
12
SELECT 
cookieid,
createtime,
pv,
--分组内将数据分成2片
NTILE(2) OVER(PARTITION BY cookieid ORDER BY createtime) AS rn1,
--分组内将数据分成3片
NTILE(3) OVER(PARTITION BY cookieid ORDER BY createtime) AS rn2,
--将所有数据分成4片
NTILE(4) OVER(ORDER BY createtime) AS rn3
FROM supertest2
ORDER BY cookieid,createtime;

结果如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
cookieid	createtime	pv	rn1	rn2	rn3
1 cookie1 2019-07-10 1 1 1 1
2 cookie1 2019-07-11 5 1 1 1
3 cookie1 2019-07-12 7 1 1 2
4 cookie1 2019-07-13 3 1 2 2
5 cookie1 2019-07-14 2 2 2 3
6 cookie1 2019-07-15 4 2 3 3
7 cookie1 2019-07-16 4 2 3 4
8 cookie2 2019-07-10 2 1 1 1
9 cookie2 2019-07-11 3 1 1 1
10 cookie2 2019-07-12 5 1 1 2
11 cookie2 2019-07-13 6 1 2 3
12 cookie2 2019-07-14 3 2 2 3
13 cookie2 2019-07-15 9 2 3 4
14 cookie2 2019-07-16 7 2 3 4

ROW_NUMBER()

ROW_NUMBER():从1开始,按照顺序,生成分组内记录的序列

1
2
3
4
5
6
SELECT 
cookieid,
createtime,
pv,
row_number() OVER(PARTITION BY cookieid ORDER BY createtime) as rn
from supertest2

结果如下,按照cookie分成两组,每组按照createtime进行排序,然后进行编号:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
cookieid	createtime	pv	rn
1 cookie2 2019-07-10 2 1
2 cookie2 2019-07-11 3 2
3 cookie2 2019-07-12 5 3
4 cookie2 2019-07-13 6 4
5 cookie2 2019-07-14 3 5
6 cookie2 2019-07-15 9 6
7 cookie2 2019-07-16 7 7
8 cookie1 2019-07-10 1 1
9 cookie1 2019-07-11 5 2
10 cookie1 2019-07-12 7 3
11 cookie1 2019-07-13 3 4
12 cookie1 2019-07-14 2 5
13 cookie1 2019-07-15 4 6
14 cookie1 2019-07-16 4 7

RANK 和 DENSE_RANK

RANK():生成数据项在分组中的排名,排名相等会在名次中留下空位
DENSE_RANK():生成数据项在分组中的排名,排名相等在名次中不会留下空位

1
2
3
4
5
6
7
8
9
SELECT 
cookieid,
createtime,
pv,
RANK() OVER(PARTITION BY cookieid ORDER BY pv desc) AS rn1,
DENSE_RANK() OVER(PARTITION BY cookieid ORDER BY pv desc) AS rn2,
ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY pv DESC) AS rn3
FROM supertest2
WHERE cookieid = 'cookie1';
1
2
3
4
5
6
7
8
cookieid	createtime	pv	rn1	rn2	rn3
1 cookie1 2019-07-12 7 1 1 1
2 cookie1 2019-07-11 5 2 2 2
3 cookie1 2019-07-16 4 3 3 3
4 cookie1 2019-07-15 4 3 3 4
5 cookie1 2019-07-13 3 5 4 5
6 cookie1 2019-07-14 2 6 5 6
7 cookie1 2019-07-10 1 7 6 7

在rn1中,第三行和第四行值相等均排第三,第五行序列为第五
在rn2中,第三行和第四行值相等均排第三,第五行序列为第四

0%