hive表的连接、聚合、where查询、having查询区别、过滤、子查询的使用

先创建两个文本
a.txt
a,1
b,2
c,3
d,4

b.txt
a,xx
b,yy
d,zz
e,pp

一、基本表连接、创建表、插入数据

create table t_a(name string,numb int)
row format delimited
fields terminated by ',';

create table t_b(name string,nick string)
row format delimited
fields terminated by ',';

load data local inpath '/home/had/a.txt' into table t_a;
load data local inpath '/home/had/b.txt' into table t_b;

–各类join
–内连接

select a.*,b.*
from t_a a join t_b b;

–左外连接 left outer join 没有的值会被null填充

select a.*,b.*
from
t_a a left outer join t_b b
on a.name=b.name;

hive表的连接、聚合、where查询、having查询区别、过滤、子查询的使用
–右外连接 right outer join

select a.*,b.*
from
t_a a right outer join t_b b
on a.name=b.name;

hive表的连接、聚合、where查询、having查询区别、过滤、子查询的使用
–全外连接所有的空值都会被填充

select a.*,b.*
from 
t_a a full outer join t_b b
on a.name = b.name;

hive表的连接、聚合、where查询、having查询区别、过滤、子查询的使用
–左半连接（left semi join的select子句中，不能有右表的字段）（hive里独有的 sql里没有）

select a.*
from
t_a a left semi join t_b b
on a.name=b.name;

hive表的连接、聚合、where查询、having查询区别、过滤、子查询的使用
二、分组聚合、group by 查询 where过滤和having过滤的区别

–1、分组聚合查询
–select upper(“abc”) show functions; sql精髓
select ip,upper(url),access_time --把URL每一行字段变成大写，该表达式是对数据中的每一行进行逐行运算
from t_pv_log;
hive表的连接、聚合、where查询、having查询区别、过滤、子查询的使用
–2、求每条URL的访问总次数

select url,count(1) as cnts  --该表达式是对分好组的数据进行逐组运算
from t_pv_log
group by url;

hive表的连接、聚合、where查询、having查询区别、过滤、子查询的使用
–求每个URL的访问者中ip地址最大的(数据不好，将就看）

select url,max(ip)
from t_pv_log
group by url;

hive表的连接、聚合、where查询、having查询区别、过滤、子查询的使用

--求每个用户访问同一个页面中所有记录中，时间最晚的一条
select ip,url,max(access_time)  --确定到日期的 不会精确到分钟和秒
from t_pv_log
group by ip,url;

hive表的连接、聚合、where查询、having查询区别、过滤、子查询的使用

--建表映射上述数据
create table t_access(ip string,url string,access_time string)
partitioned by (dt string)
row format delimited
fields terminated by ',';

--导入数据
load data local inpath '/home/had/access.log.0804' into table t_access partition(dt='2017-08-04');
load data local inpath '/home/had/access.log.0805' into table t_access partition(dt='2017-08-05');
load data local inpath '/home/had/access.log.0806' into table t_access partition(dt='2017-08-06');

hive表的连接、聚合、where查询、having查询区别、过滤、子查询的使用
–查看表分区

–求8月4号以后，http://www.edu360.cn/job的总访问次数及访问者中的ip地址中最大的
一、

select dt,count(1),max(ip)
from t_access
where url='http://www.edu360.cn/job'
group by dt having dt>'2017-08-04';

hive表的连接、聚合、where查询、having查询区别、过滤、子查询的使用
二、

select dt,max(url),count(1),max(ip)
from t_access
where url='http://www.edu360.cn/job'
group by dt having dt>'2017-08-04';

hive表的连接、聚合、where查询、having查询区别、过滤、子查询的使用
三、

select dt,url,count(1),max(ip)
from t_access
where url='http://www.edu360.cn/job' and dt>'2017-08-04'
group by dt,url;

hive表的连接、聚合、where查询、having查询区别、过滤、子查询的使用
–求8月4号以后，每天每个页面访问的总次数，及访问者最大的ip

select dt,url,count(1),max(ip)
from t_access
where dt>'2017-08-04'
group by dt,url;

hive表的连接、聚合、where查询、having查询区别、过滤、子查询的使用
–求8月4号以后，每天每个页面访问的总次数，及访问者最大的ip且，只查询出总访问次数>2的记录

select dt,url,count(1) as cnts,max(ip)
from t_access
where dt>'2017-08-04'
group by dt,url having cnts>2; --这是对having的应用，求出结果后再过滤的条件

hive表的连接、聚合、where查询、having查询区别、过滤、子查询的使用
where过滤和having过滤的区别:
首先，Where 子句是用来指定 “行” 的条件的，而Having 子句是指定 “组” 的条件的，即
Where 子句 = 指定行所对应的条件，先过滤条件在分组
Having 子句 = 指定组所对应的条件是分完组后在过滤条件

方式二、子查询
select dt,url,cnts,max_ip 
from
--这是一个表了
(select dt,url,count(1) as cnts,max(ip) as max_ip
from t_access
where dt>'2017-08-04'
group by dt,url having cnts>2) tmp

where tmp.cnts>2;

hive表的连接、聚合、where查询、having查询区别、过滤、子查询的使用

hive表的连接、聚合、where查询、having查询区别、过滤、子查询的使用

相关推荐