精选——Hive十道面试题(上)

Hive 专栏收录该内容
21 篇文章 0 订阅

1.现有这么一批数据,现要求出:每个用户截止到每月为止的最大单月访问次数和累计到该月的总访问次数

三个字段的意思:
用户名uid,月份month,访问次数count

A,2015-01,5
A,2015-01,15
B,2015-01,5
A,2015-01,8
B,2015-01,25
A,2015-01,5
A,2015-02,4
A,2015-02,6
B,2015-02,10
B,2015-02,5
A,2015-03,16
A,2015-03,22
B,2015-03,23
B,2015-03,10
B,2015-03,11

最后结果展示:

用户	月份		最大访问次数	总访问次数		当月访问次数
A	2015-01		33			33		  33
A	2015-02		33			43		  10
A	2015-03		38			81		  38
B	2015-01		30			30		  30
B	2015-02		30			45                15
B	2015-03		44			89		  44

建表与导入数据

create table if not exists interview01(userId string,month string,count int) row format delimited fields terminated by ",";
load data local inpath "/home/hadoop/hive_data/interview01" into table interview01;

辅助SQL:

先按userid和month分组,求出总次数counts

select userid,month,sum(count)counts 
from interview01 
group by userid,month

最终SQL:

使用窗口函数求截止到当月的最大访问次数,直接从上表里面查

select userid,month,
max(counts) over(distribute by userid sort by month)maxcount,
sum(counts) over(distribute by userid sort by month)sumcount,counts
from 
(select userid,month,sum(count)counts 
from interview01 
group by userid,month)tmp;

2.求:所有数学课程成绩 大于 语文课程成绩的学生的学号

// 建表语句:
CREATE TABLE `course` (
  `id` int(11) NOT NULL AUTO_INCREMENT PRIMARY KEY,
  `sid` int(11) DEFAULT NULL,
  `course` varchar(255) DEFAULT NULL,
  `score` int(11) DEFAULT NULL 
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

// 插入数据
// 字段解释:id, 学号, 课程, 成绩
INSERT INTO `course` VALUES (1, 1, 'yuwen', 43);
INSERT INTO `course` VALUES (2, 1, 'shuxue', 55);
INSERT INTO `course` VALUES (3, 2, 'yuwen', 77);
INSERT INTO `course` VALUES (4, 2, 'shuxue', 88);
INSERT INTO `course` VALUES (5, 3, 'yuwen', 98);
INSERT INTO `course` VALUES (6, 3, 'shuxue', 65);

辅助SQL:

这题可以看做为一到mysql题,因为每个字段都只有一个成绩,所以这里我们要对没有的字段进行补0操作

select
sid,case course
when "yuwen" then score else 0 end yuwen,
case course when "shuxue" then score else 0 end shuxue
from course

然后我们从上表中操作语文和数学的最大值,然后作为一个表,直接进行查询

select
sid,shuxue,yuwen
from
(select
sid,sum(yuwen) yuwen,max(shuxue) shuxue
from
(select
sid,case course
when "yuwen" then score else 0 end yuwen,
case course when "shuxue" then score else 0 end shuxue
from course) a
group by sid) b where b.shuxue>b.yuwen;

结果

+------+--------+-------+
| sid  | shuxue | yuwen |
+------+--------+-------+
|    1 |     55 |    43 |
|    2 |     88 |    77 |
+------+--------+-------+

3.比如:2010012325表示在2010年01月23日的气温为25度。现在要求使用hive,
计算每一年出现过的最大气温的日期+温度。

2014010114
2014010216
2014010317
2014010410
2014010506
2012010609
2012010732
2012010812
2012010919
2012011023
2001010116
2001010212
2001010310
2001010411
2001010529
2013010619
2013010722
2013010812
2013010929
2013011023
2008010105
2008010216
2008010337
2008010414
2008010516
2007010619
2007010712
2007010812
2007010999
2007011023
2010010114
2010010216
2010010317
2010010410
2010010506
2015010649
2015010722
2015010812
2015010999
2015011023

建表与导入数据

create table if not exists interview03(data string);
load data local inpath "/home/hadoop/hive_data/interview03" into table interview03;

辅助SQL:

先求每一年的最大温度

select
substr(data,1,4) year,max(cast(substr(data,-2,2) as int))
max_temp
from interview03
group by substr(data,1,4)

然后将上表与原表连接,连接条件是年份与气温符合的,然后再找出那天是哪一天

select
a.year year,a.max_temp max_temp,substr(b.data,1,8) day
from
(select
substr(data,1,4) year,max(cast(substr(data,-2,2) as int))
max_temp
from interview03
group by substr(data,1,4)
)a join
interview03 b on
a.year=substr(b.data,1,4) and a.max_temp=cast(substr(b.data,-2,2) as int);

结果

year    max_temp        day
2001    29      20010105
2007    99      20070109
2008    37      20080103
2010    17      20100103
2012    32      20120107
2013    29      20130109
2014    17      20140103
2015    99      20150109

4.现有一份以下格式的数据:表示有id为1,2,3的学生选修了课程a,b,c,d,e,f中其中几门

id course 
1,a 
1,b 
1,c 
1,e 
2,a 
2,c 
2,d 
2,f 
3,a 
3,b 
3,c 
3,e

编写Hive的HQL语句来实现以下结果:
表中的1表示选修,表中的0表示未选修

id	a	b	c	d	e	f
1	1	1	1	0	1	0
2	1	0	1	1	0	1
3	1	1	1	0	1	0

建表与导入数据

create table if not exists interview04(id int,course string) row format delimited fields terminated by ",";

load data local inpath "/home/hadoop/hive_data/interview04" into table interview04;

首先得到所有备选课程

select collect_set(trim(course))courses from interview04;

然后求每一个学员选修的课程

select
id,collect_set(trim(course)) id_course	
from interview04
group by id;

这里要使用笛卡尔积关联(a表中的每一天数据关联b表所有数据),从原表中查询course字段放入set集合中,然后与上表关联,得到已选课程与备选课程

首先修改笛卡尔积参数

set hive.strict.checks.cartesian.product=false;
set hive.mapred.mode=nonstrict;

辅助SQL:

select
b.id id,b.id_course id_course,a.courses courses
from
(select collect_set(trim(course))courses from interview04
) a join
(select
id,collect_set(trim(course)) id_course	
from interview04
group by id
) b;
id      id_course       courses
1       ["a","b","c","e"]       ["a","b","c","e","d","f"]
2       ["a","c","d","f"]       ["a","b","c","e","d","f"]
3       ["a","b","c","e"]       ["a","b","c","e","d","f"]

最终SQL:

判断a表与b表中元素的包含关系即可

select
id,
if(array_contains(id_course,courses[0]),1,0) a,
if(array_contains(id_course,courses[1]),1,0) b,
if(array_contains(id_course,courses[2]),1,0) c,
if(array_contains(id_course,courses[3]),1,0) e,
if(array_contains(id_course,courses[4]),1,0) d,
if(array_contains(id_course,courses[5]),1,0) f
from
(select
b.id id,b.id_course id_course,a.courses courses
from
(select collect_set(trim(course))courses from interview04
) a join
(select
id,collect_set(trim(course)) id_course	
from interview04
group by id
) b
) c;

最终结果

id      a       b       c       e       d       f
1       1       1       1       1       0       0
2       1       0       1       0       1       1
3       1       1       1       1       0       0

5.现有如下格式的一份数据,需求:编写Hive的HQL语句求出每个店铺的当月销售额和累计到当月的总销售额

uid month sale

a,01,150
a,01,200
b,01,1000
b,01,800
c,01,250
c,01,220
b,01,6000
a,02,2000
a,02,3000
b,02,1000
b,02,1500
c,02,350
c,02,280
a,03,350
a,03,250

建表与导入数据

create table if not exists interview05(uid string,month string,sale int) row format delimited fields terminated by ",";
load data local inpath "/home/hadoop/hive_data/interview05" into table interview05;

最终SQL:

与第一题类似,先求总销售额,再进行开窗算出累计到当月的总销售额

select uid,month,
sum(sales) over(distribute by uid sort by month)sumsales,sales
from 
(select uid,month,sum(sale)sales 
from interview05 
group by uid,month)tmp;

结果:

uid     month   sumsales        sales
a       01      350     350
a       02      5350    5000
a       03      5950    600
b       01      7800    7800
b       02      10300   2500
c       01      470     470
c       02      1100    630
  • 12
    点赞
  • 1
    评论
  • 28
    收藏
  • 一键三连
    一键三连
  • 扫一扫,分享海报

相关推荐
©️2020 CSDN 皮肤主题: 博客之星2020 设计师:CY__ 返回首页
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、C币套餐、付费专栏及课程。

余额充值