hive窗口函数(三)

分析函数:LAG,LEAD,FIRST_VALUE,LAST_VALUE

数据准备

创建数据表

1
2
3
4
5
6
CREATE EXTERNAL TABLE supertest3 (
cookieid STRING,
createtime STRING,
url STRING
) ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',';

插入数据

1
2
3
4
5
6
7
8
9
10
11
12
INSERT INTO supertest3 VALUES('cookie1','2019-07-10 10:00:02','url1')
INSERT INTO supertest3 VALUES('cookie2','2019-07-10 10:32:13','url5')
INSERT INTO supertest3 VALUES('cookie2','2019-07-10 10:23:55','url4')
INSERT INTO supertest3 VALUES('cookie2','2019-07-10 10:13:05','url3')
INSERT INTO supertest3 VALUES('cookie2','2019-07-10 10:10:45','url2')
INSERT INTO supertest3 VALUES('cookie2','2019-07-10 10:10:25','url1')
INSERT INTO supertest3 VALUES('cookie2','2019-07-10 10:00:02','url11')
INSERT INTO supertest3 VALUES('cookie1','2019-07-10 10:00:02','url11')
INSERT INTO supertest3 VALUES('cookie1','2019-07-10 11:00:05','url7')
INSERT INTO supertest3 VALUES('cookie1','2019-07-10 10:50:05','url6')
INSERT INTO supertest3 VALUES('cookie1','2019-07-10 10:06:39','url3')
INSERT INTO supertest3 VALUES('cookie1','2019-07-10 10:02:31','url2')
1
2
3
4
5
6
7
8
9
10
11
12
13
14
select * from supertest3
>> cookieid createtime url
1 cookie2 2019-07-10 10:00:02 url11
2 cookie1 2019-07-10 11:00:05 url7
3 cookie2 2019-07-10 10:13:05 url3
4 cookie2 2019-07-10 10:32:13 url5
5 cookie1 2019-07-10 10:06:39 url3
6 cookie1 2019-07-10 10:02:31 url2
7 cookie1 2019-07-10 10:00:02 url11
8 cookie1 2019-07-10 10:00:02 url1
9 cookie2 2019-07-10 10:10:45 url2
10 cookie2 2019-07-10 10:10:25 url1
11 cookie1 2019-07-10 10:50:05 url6
12 cookie2 2019-07-10 10:23:55 url4

LAG

LAG(col,n,DEFAULT) 用于统计窗口内往上第n行值
第一个参数为列名,第二个参数为往上第n行(可选,默认为1),第三个参数为默认值(当往上第n行为NULL时候,取默认值,如不指定,则为NULL)

1
2
3
4
5
SELECT *,
row_number() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,
lag(createtime,1,'1970-01-01 00:00:00') OVER(PARTITION BY cookieid ORDER BY createtime) as time1,
lag(createtime,2) OVER(PARTITION BY cookieid ORDER BY createtime) as time2
FROM supertest3
1
2
3
4
5
6
7
8
9
10
11
12
13
 	cookieid	createtime	url	rn	time1	time2
1 cookie2 2019-07-10 10:00:02 url11 1 1970-01-01 00:00:00 NULL
2 cookie2 2019-07-10 10:10:25 url1 2 2019-07-10 10:00:02 NULL
3 cookie2 2019-07-10 10:10:45 url2 3 2019-07-10 10:10:25 2019-07-10 10:00:02
4 cookie2 2019-07-10 10:13:05 url3 4 2019-07-10 10:10:45 2019-07-10 10:10:25
5 cookie2 2019-07-10 10:23:55 url4 5 2019-07-10 10:13:05 2019-07-10 10:10:45
6 cookie2 2019-07-10 10:32:13 url5 6 2019-07-10 10:23:55 2019-07-10 10:13:05
7 cookie1 2019-07-10 10:00:02 url1 1 1970-01-01 00:00:00 NULL
8 cookie1 2019-07-10 10:00:02 url11 2 2019-07-10 10:00:02 NULL
9 cookie1 2019-07-10 10:02:31 url2 3 2019-07-10 10:00:02 2019-07-10 10:00:02
10 cookie1 2019-07-10 10:06:39 url3 4 2019-07-10 10:02:31 2019-07-10 10:00:02
11 cookie1 2019-07-10 10:50:05 url6 5 2019-07-10 10:06:39 2019-07-10 10:02:31
12 cookie1 2019-07-10 11:00:05 url7 6 2019-07-10 10:50:05 2019-07-10 10:06:39

LEAD

与LAG相反,LEAD(col,n,DEFAULT) 用于统计窗口内往下第n行值
第一个参数为列名,第二个参数为往下第n行(可选,默认为1),第三个参数为默认值(当往下第n行为NULL时候,取默认值,如不指定,则为NULL)

1
2
3
4
5
SELECT *,
row_number() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,
lead(createtime,1,'1970-01-01 00:00:00') OVER(PARTITION BY cookieid ORDER BY createtime) as time1,
lead(createtime,2) OVER(PARTITION BY cookieid ORDER BY createtime) as time2
FROM supertest3
1
2
3
4
5
6
7
8
9
10
11
12
13
 	cookieid	createtime	url	rn	time1	time2
1 cookie2 2019-07-10 10:00:02 url11 1 2019-07-10 10:10:25 2019-07-10 10:10:45
2 cookie2 2019-07-10 10:10:25 url1 2 2019-07-10 10:10:45 2019-07-10 10:13:05
3 cookie2 2019-07-10 10:10:45 url2 3 2019-07-10 10:13:05 2019-07-10 10:23:55
4 cookie2 2019-07-10 10:13:05 url3 4 2019-07-10 10:23:55 2019-07-10 10:32:13
5 cookie2 2019-07-10 10:23:55 url4 5 2019-07-10 10:32:13 NULL
6 cookie2 2019-07-10 10:32:13 url5 6 1970-01-01 00:00:00 NULL
7 cookie1 2019-07-10 10:00:02 url1 1 2019-07-10 10:00:02 2019-07-10 10:02:31
8 cookie1 2019-07-10 10:00:02 url11 2 2019-07-10 10:02:31 2019-07-10 10:06:39
9 cookie1 2019-07-10 10:02:31 url2 3 2019-07-10 10:06:39 2019-07-10 10:50:05
10 cookie1 2019-07-10 10:06:39 url3 4 2019-07-10 10:50:05 2019-07-10 11:00:05
11 cookie1 2019-07-10 10:50:05 url6 5 2019-07-10 11:00:05 NULL
12 cookie1 2019-07-10 11:00:05 url7 6 1970-01-01 00:00:00 NULL

FIRST_VALUE

取分组内排序后,截止到当前行,第一个值

1
2
3
4
SELECT *,
row_number() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,
first_value(url) OVER(PARTITION BY cookieid ORDER BY createtime) AS first1
FROM supertest3
1
2
3
4
5
6
7
8
9
10
11
12
13
 	cookieid	createtime	url	rn	first1
1 cookie1 2019-07-10 10:00:02 url1 1 url1
2 cookie1 2019-07-10 10:00:02 url11 2 url1
3 cookie1 2019-07-10 10:02:31 url2 3 url1
4 cookie1 2019-07-10 10:06:39 url3 4 url1
5 cookie1 2019-07-10 10:50:05 url6 5 url1
6 cookie1 2019-07-10 11:00:05 url7 6 url1
7 cookie2 2019-07-10 10:00:02 url11 1 url11
8 cookie2 2019-07-10 10:10:25 url1 2 url11
9 cookie2 2019-07-10 10:10:45 url2 3 url11
10 cookie2 2019-07-10 10:13:05 url3 4 url11
11 cookie2 2019-07-10 10:23:55 url4 5 url11
12 cookie2 2019-07-10 10:32:13 url5 6 url11

LAST_VALUE

取分组内排序后,截止到当前行,最后一个值

1
2
3
4
SELECT *,
row_number() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,
last_value(url) OVER(PARTITION BY cookieid ORDER BY createtime) AS first1
FROM supertest3
1
2
3
4
5
6
7
8
9
10
11
12
13
 	cookieid	createtime	url	rn	first1
1 cookie1 2019-07-10 10:00:02 url11 1 url11
2 cookie1 2019-07-10 10:00:02 url1 2 url1
3 cookie1 2019-07-10 10:02:31 url2 3 url2
4 cookie1 2019-07-10 10:06:39 url3 4 url3
5 cookie1 2019-07-10 10:50:05 url6 5 url6
6 cookie1 2019-07-10 11:00:05 url7 6 url7
7 cookie2 2019-07-10 10:00:02 url11 1 url11
8 cookie2 2019-07-10 10:10:25 url1 2 url1
9 cookie2 2019-07-10 10:10:45 url2 3 url2
10 cookie2 2019-07-10 10:13:05 url3 4 url3
11 cookie2 2019-07-10 10:23:55 url4 5 url4
12 cookie2 2019-07-10 10:32:13 url5 6 url5
0%