Hive数据类型(join、函数、hql)
文章目录
Join操作
1. 连接数据准备
(1)创建两张学生表(studenta、studentb)
CREATE TABLE studenta(
id INT,
name STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t';
CREATE TABLE studentb(
id INT,
age INT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t';
(2)数据准备
studenta.txt
10001 shiny
10002 mark
10003 angel
10005 ella
10009 jack
10014 eva
10018 judy
10020 cendy
studentb.txt
10001 23
10004 22
10007 24
10008 21
10009 25
10012 25
10015 20
10018 19
10020 26
(3)导入数据
LOAD DATA LOCAL INPATH '/home/tyhawk/studenta.txt' into table studenta;
LOAD DATA LOCAL INPATH '/home/tyhawk/studentb.txt' into table studentb;
2.内连接JOIN
作用:把符合两边连接条件的数据查询出来
实例
SELECT * FROM studenta a JOIN studentb b ON a.id=b.id;
Total MapReduce CPU Time Spent: 3 seconds 690 msec
OK
10001 shiny 10001 23
10009 jack 10009 25
10018 judy 10018 19
10020 cendy 10020 26
3.外连接OUTER JOIN
3.1 左外连接(LEFT JOIN)
作用
(1)以左表数据为匹配标准,左大右小
(2)匹配不上的就是null
(3)返回的数据条数与左表相同
实例
SELECT * FROM studenta a LEFT JOIN studentb b on a.id=b.id;
Total MapReduce CPU Time Spent: 2 seconds 570 msec
OK
10001 shiny 10001 23
10002 mark NULL NULL
10003 angel NULL NULL
10005 ella NULL NULL
10009 jack 10009 25
10014 eva NULL NULL
10018 judy 10018 19
10020 cendy 10020 26
3.2 右外连接(RIGHT JOIN)
作用
(1)以右表数据为匹配标准,左小右大
(2)匹配不上的就null
(3)返回的数据条数与右表相同
实例
SELECT * FROM studenta a RIGHT JOIN studentb b ON a.id=b.id;
Total MapReduce CPU Time Spent: 2 seconds 670 msec
OK
10001 shiny 10001 23
NULL NULL 10004 22
NULL NULL 10007 24
NULL NULL 10008 21
10009 jack 10009 25
NULL NULL 10012 25
NULL NULL 10015 20
10018 judy 10018 19
10020 cendy 10020 26
3.3 全外连接(FULL JOin)
作用
(1)以两个表的数据为匹配标准
(2)匹配不上的就null
(3)返回的数据条数等于两表数据去重之和
实例
SELECT * FROM studenta a FULL JOIN studentb b ON a.id=b.id;
Total MapReduce CPU Time Spent: 8 seconds 50 msec
OK
10001 shiny 10001 23
10002 mark NULL NULL
10003 angel NULL NULL
NULL NULL 10004 22
10005 ella NULL NULL
NULL NULL 10007 24
NULL NULL 10008 21
10009 jack 10009 25
NULL NULL 10012 25
10014 eva NULL NULL
NULL NULL 10015 20
10018 judy 10018 19
10020 cendy 10020 26
4. 左半连接LEFT SEMI
作用
(1)把符合两边连接条件的左表的数据显示出来
(2)右表只能在ON子句中设置过滤条件,在WHERE子句、SELECT子句或者其他地方过滤都不行
因为如果连接语句中有WHERE子句,会先执行JOIN子句,再执行WHERE子句。
实例
SELECT * FROM studenta a LEFT SEMI JOIN studentb b ON a.id=b.id;
Total MapReduce CPU Time Spent: 3 seconds 250 msec
OK
10001 shiny
10009 jack
10018 judy
10020 cendy
数据类型
1. 原子数据类型
数值、布尔、字符串、日期
2. 复杂数据类型
关于简单数据类型和复杂数据类型可以参考:https://www.cnblogs.com/linzhong/p/8294980.html
两者得区别简要来说是存储数据得类型不同
数组、映射、结构体
2.1 复杂数据类型示例
CREATE TABLE student(
name STRING,
favors ARRAY<STRING>,
scores MAP<STRING, FLOAT>,
address STRUCT<province:STRING,city:STRING,detail:STRING,zip:INT>
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
COLLECTION ITEMS TERMINATED BY ';'
MAP KEYS TERMINATED BY ':' ;
2.2复杂数据类型实例
2.2.1ARRAY
- 创建表
CREATE TABLE employee(
name STRING,
age INT,
work_locations ARRAY<STRING>)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
COLLECTION ITEMS TERMINATED BY ','; ##ARRAY 中的各元素的分隔符
- 数据准备
array.txt
shiny 23 beijing,tianjin,qingdao
jack 34 shanghai,guangzhou
mark 26 beijing,xian
ella 21 beijing
judy 30 shanghai,hangzhou,chongqing
cendy 28 beijing,shanghai,dalian,chengdu
- 导入数据
LOAD DATA LOCAL INPATH '/home/tyhawk/array.txt'
INTO TABLE employee;
- 查询数据
SELECT * FROM employee;
SELECT work_locations FROM employee;
SELECT work_locations[0] FROM employee;
2.2.2 MAP
- 创建表
CREATE TABLE score(
name STRING,
scores MAP<STRING,INT>)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
COLLECTION ITEMS TERMINATED BY ',' ##MAP 中的 key/value 对之间的分隔符
MAP KEYS TERMINATED BY ':'; ##MAP 中 key 与 value 的分隔符
- 数据准备
scores.txt
shiny chinese:90,math:100,english:99
mark chinese:89,math:56,english:87
judy chinese:94,math:78,english:81
ella chinese:54,math:23,english:48
jack chinese:100,math:95,english:69
cendy chinese:67,math:83,english:45
- 导入数据
LOAD DATA LOCAL INPATH '/home/tyhawk/scores.txt'
INTO TABLE score;
查询数据
SELECT * FROM score;
SELECT scores FROM score;
SELECT scores['math'] FROM score;
2.2.3STRUCT
- 创建表
CREATE TABLE coursescore(
id INT,
course STRUCT<name:STRING,score:INT>)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
COLLECTION ITEMS TERMINATED BY ','; ##STRUCT 中的各元素的分隔符
- 数据准备
coursescore.txt
1 chinese,100
2 math,98
3 english,99
4 computer,78
- 导入数据
LOAD DATA LOCAL INPATH '/home/lan/data/coursescore.txt'
INTO TABLE coursescore;
- 查询数据
SELECT * FROM coursescore;
SELECT course FROM coursescore;
SELECT course.name FROM coursescore;
SELECT course.score FROM coursescore;
函数
1. 内置函数
- 查看内置函数
SHOW FUNCTIONS
- 显示函数的详细信息
DESC FUNCTION trim
- 显示函数的扩展信息
DESC FUNCTION EXTENDED trim
2. 自定义函数
- 应用场景
当 Hive 提供的内置函数无法满足业务处理需要时,此时就可以考虑使用用户自定义函数 - 有哪些?
(1)UDF(user-defined function):用户自定义函数xx作用于单个数据行,
产生一个数据行作为输出。(数学函数,字符串函数)
(2)UDAF(User- Defined Aggregation Funcation):用户自定义聚合函数**接收多个输入数据行,并产生一个输出数据行。(COUNT、MAX)
3. 需求:Json数据解析
简易总结来说,json可以理解为字符串解析,当一整段字符串数据,我们只想要其中部分字符数据,就用json来拆分
可以参考:https://www.cnblogs.com/BlueSkyyj/p/9720346.html
https://www.cnblogs.com/qiaoyihang/p/8729368.html
3.1内置函数
(1)创建表
CREATE TABLE rat_json(line STRING)
ROW FORMAT DELIMITED;
(2)准备数据
rating.json
{"movie":"1193","rate":"5","timeStamp":"978300760","uid":"1"}
{"movie":"661","rate":"3","timeStamp":"978302109","uid":"1"}
{"movie":"914","rate":"3","timeStamp":"978301968","uid":"1"}
{"movie":"3408","rate":"4","timeStamp":"978300275","uid":"1"}
{"movie":"2355","rate":"5","timeStamp":"978824291","uid":"1"}
{"movie":"1197","rate":"3","timeStamp":"978302268","uid":"1"}
{"movie":"1287","rate":"5","timeStamp":"978302039","uid":"1"}
{"movie":"2804","rate":"5","timeStamp":"978300719","uid":"1"}
{"movie":"594","rate":"4","timeStamp":"978302268","uid":"1"}
(3)加载数据
LOAD DATA LOCAL INPATH '/home/tyhawk/rating.json'
INTO TABLE rat_json;
(4)需要解析Json数据成四个字段,插入一张新的表 rate(用于存放处理的数据,需要有四个字段)
CREATE TABLE rate(
movie INT,
rate INT,
ts INT,
userid INT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t';
(5)解析Json,得到结果之后存入rate表
INSERT INTO TABLE rate
SELECT
get_json_object(line,'$.movie') AS movie,
get_json_object(line,'$.rate') AS rate,
get_json_object(line,'$.timeStamp') AS ts,
get_json_object(line,'$.uid') AS userid
FROM rat_json;
(6)查询解析出来的Json数据
SELECT * FROM rate;
3.2 自定义函数
【eclipse工具】
(1)新建一个项目HiveDemo,导入apache-hive-1.2.1-bin\lib下的所有包【hvie安装包】
- 新建java project
- 添加包
(2)开发一个简单的 Java 类,继承org.apache.hadoop.hive.ql.exec.UDF,重载 evaluate 方法
package com.tyhawk.hive;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.json.JSONException;
import org.json.JSONObject;
//解析Json格式数据
public class JsonUDF extends UDF{
//必须是public
public String evaluate(String jsonStr,String field) throws JSONException{
JSONObject json=new JSONObject(jsonStr);
String result=(String) json.get(field);
return result;
}
}
(3)打成json.jar包上传到Linux服务器
打包出来后,rz或者SFTP丢上去
(4)将jar包添加到Hive的classpath下
hive> add jar /home/tyhawk/json.jar;
Added [/home/tyhawk/json.jar] to class path
Added resources: [/home/tyhawk/json.jar]
hive> list jar;
/home/tyhawk/json.jar //查看jar包是否加入成功
(5)创建临时函数与开发好的 class 关联起来
hive> create temporary function jsontostring as 'org.tyhawk.hive.udf.JsonUDF';
OK
Time taken: 0.011 seconds
//jsontostring :临时函数的名字
//'org.tyhawk.hive.udf.JsonUDF'是'包名.类名'
6、解析Json,得到结果之后存入rates表(rates表用于存放处理的数据,所以需要有四个字段)
create table rates as select
jsontostring(line,'movie') AS movie,
jsontostring(line,'rate') AS rate,
jsontostring(line,'timeStamp') AS ts,
jsontostring(line,'uid') AS userid
FROM rat_json;
7.、至此,便可以使用自定义的函数
hive> select * from rates limit 10; //limit 显示得行数
OK
1193 5 978300760 1
661 3 978302109 1
914 3 978301968 1
3408 4 978300275 1
2355 5 978824291 1
1197 3 978302268 1
1287 5 978302039 1
2804 5 978300719 1
594 4 978302268 1
919 4 978301368 1
Time taken: 0.063 seconds, Fetched: 10 row(s)
Hive shell
- 参数
(1)-i:从文件初始化 HQL
(2)-e:从命令行执行指定的 HQL
(3)-f:执行 HQL 脚本
(4)-v:输出执行的 HQL 语句到控制台
(5)-p: connect to Hive Server on port number
(6)-hiveconf x=y(Use this to set hive/hadoop configuration variables)
(7)-S:表示以不打印日志的形式执行命名操作 - 从命令行执行指定的HQL
[[email protected] ~]$ cd
[[email protected] data]$ touch studenta
//将数据库tyhawk_test中的表studenta写入这个新建的空文件中
[[email protected] ~]$ hive -e 'select * from yr_test.studenta' > studenta
//如果当前目录不存在studentaaa,那么会自动创建
[[email protected] ~]$ hive -e 'select * from tyhawk_test.studenta' > studentaaa
[[email protected] ~]$ cat studentaaa
10001 shiny
10002 mark
10003 angel
10005 ella
10009 jack
10014 eva
10018 judy
10020 cendy
- 运行一个文件
文件
hive.hql
[[email protected] ~]$ vim hive.hql
[[email protected] ~]$ cat hive.hql
create database if not exists tyhawk_0322;
use tyhawk_0322;
create table if not exists stu(id int,name string)
row format delimited fields terminated by ','
stored as textfile;
load data local inpath '/home/tyhawk/stu.txt' into table stu;
insert overwrite directory '/tyhawk_0322/stu' row format delimited fields terminated by '\t' select * from stu;
overwrite是复写,如果不写就是在后面追加
stu.txt
1,张三
2,李四
3,rose
4,jack
5,haha
6,hehe
7,aaaaaaaaaaaaa
- 执行HQL脚本
[[email protected] ~]$ hive -f hive.hql
//在hdfs上查看
[[email protected] ~]$ hadoop fs -ls /tyhawk_0322/stu
Found 1 items
-rwxr-xr-x 1 tyhawk supergroup 52 2019-03-22 17:25 /tyhawk_0322/stu/000000_0
[[email protected] ~]$ hadoop fs -cat /tyhawk_0322/stu/000000_0
1 张三
2 李四
3 rose
4 jack
5 haha
6 hehe
7 aaaaaaaaaaaaa
//在hive上查看
hive> use tyhawk_0322;
OK
Time taken: 0.026 seconds
hive> select * from stu;
OK
1 张三
2 李四
3 rose
4 jack
5 haha
6 hehe
7 aaaaaaaaaaaaa
Time taken: 0.188 seconds, Fetched: 7 row(s)