写在前面
- 搭建Docker版镜像集群,为大数据开发准备基础环境。
- 基于Python 完成Hive UDF函数
- 提示
- 目的
P1: 建立Docker版镜像集群
docker pull limengjiao029/hive:v0.1
docker run --privileged -tid -p 8000:8000 -p 8088:8088 -p 8042:8042 -p 50070:50070 limengjiao029/hive:v0.1
docker exec -ti containerID /bin/bash
$HADOOP_HOME\sbin\start-all.sh
hive --service metastore
hive
P2: 实现Python UDF测试开发
create external table person(
name string,
idcard string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED as TEXTFILE;
neil 411326199402110030
pony 41132519950911004x
jcak 12312423454556561
tony 412345671234908
hdfs dfs -put person.txt /user/hive/warehouse/person
select idcard,
case when length(idcard) = 18 then
case when substring(idcard,-2,1) % 2 = 1 then '男'
when substring(idcard,-2,1) % 2 = 0 then '女'
else 'unknown' end
when length(idcard) = 15 then
case when substring(idcard,-1,1) % 2 = 1 then '男'
when substring(idcard,-1,1) % 2 = 0 then '女'
else 'unknown' end
else '不合法' end
from person;
import sys
for line in sys.stdin:
detail = line.strip().split("\t")
if len(detail) != 2:
continue
else:
name = detail[0]
idcard = detail[1]
if len(idcard) == 15:
if int(idcard[-1]) % 2 == 0:
print("\t".join([name,idcard,"女"]))
else:
print("\t".join([name,idcard,"男"]))
elif len(idcard) == 18:
if int(idcard[-2]) % 2 == 0:
print("\t".join([name,idcard,"女"]))
else:
print("\t".join([name,idcard,"男"]))
else:
print("\t".join([name,idcard,"身份信息不合法!"]))
add file /home/kngines/hive_test/person.py
hive中使用python定义的UDF函数要借助transform函数去执行
select
transform(name,idcard) USING 'python person.py' AS (name,idcard,gender)
from person;
P3: 结果展示及其他
- 结果详情
- transform语法
- 其中,transfrom和as的columns的个数不必一致
SELECT TRANSFORM (<columns>)
USING 'python <python_script>'
AS (<columns>)
FROM <table>;
# free 命令是一个显示系统中空闲和已用内存大小的工具
free -h
# 查看磁盘详情
df -h
set hive.cli.print.header=true;
Other Refs
- Hive HDF 引用资源文件路径问题
- big-data-europe
- 小插曲
- 本计划实验在私人服务器上完成,毕竟每次开关虚拟机真心费心费神。ECS 单核 2GB内存,在运行包含Python UDF的HQL时,莫名地被 Killed, 竟然是内存不足!
- 啊哈,只能在本地虚拟机完成实验,好在个人笔记本 32GB内存,1T固态,不费事!
- 吹完牛,后续精彩:
- 完成通过自定义函数引用资源,例如表资源、文件资源、数据资源等实验
- xj.wlmq.20190401.01:23