Hive基础之导出数据

我们在hive中分析完数据后有时候需要将结果导出到本地文件以供其他用途。hive中提供了几种方式来导出结果,下面就详细介绍一些这几种导出数据的方法。

1.导出到本地文件目录

hive (default)> insert overwrite local directory '/opt/datas/hive_exp_emp'
              > select * from default.emp ;
Query ID = hive_20190217171414_39007610-8451-4c28-86a1-928f6f3aec5c
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1550060164760_0005, Tracking URL = http://node1:8088/proxy/application_1550060164760_0005/
Kill Command = /opt/cloudera/parcels/CDH-5.15.2-1.cdh5.15.2.p0.3/lib/hadoop/bin/hadoop job  -kill job_1550060164760_0005
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2019-02-17 17:15:01,010 Stage-1 map = 0%,  reduce = 0%
2019-02-17 17:15:26,922 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.91 sec
MapReduce Total cumulative CPU time: 2 seconds 910 msec
Ended Job = job_1550060164760_0005
Copying data to local directory /opt/datas/hive_exp_emp
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1   Cumulative CPU: 2.91 sec   HDFS Read: 4776 HDFS Write: 661 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 910 msec
OK
empno   ename   job     mgr     hiredate        sal     comm    deptno
Time taken: 76.089 seconds

查看一下在本地目录保存的文件:

[[email protected] datas]# cd /opt/datas/hive_exp_emp
[[email protected] hive_exp_emp]# ll
total 4
-rw-r--r-- 1 hive hive 661 Feb 17 17:15 000000_0
[[email protected] hive_exp_emp]# cat 000000_0 
7369SMITHCLERK79021980-12-17800.0\N20
7499ALLENSALESMAN76981981-2-201600.0300.030
7521WARDSALESMAN76981981-2-221250.0500.030
7566JONESMANAGER78391981-4-22975.0\N20
7654MARTINSALESMAN76981981-9-281250.01400.030
7698BLAKEMANAGER78391981-5-12850.0\N30
7782CLARKMANAGER78391981-6-92450.0\N10
7788SCOTTANALYST75661987-4-193000.0\N20
7839KINGPRESIDENT\N1981-11-175000.0\N10
7844TURNERSALESMAN76981981-9-81500.00.030
7876ADAMSCLERK77881987-5-231100.0\N20
7900JAMESCLERK76981981-12-3950.0\N30
7902FORDANALYST75661981-12-33000.0\N20
7934MILLERCLERK77821982-1-231300.0\N10

可以看出,导出后保存的文件名称为000000_0,导出是以默认的分隔符来分隔数据的。
上面的导出方式没有加入格式,我们可以使用创建表时的格式语法来定义导出的数据格式:

insert overwrite local directory '/opt/datas/hive_exp_emp2'
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' 
COLLECTION ITEMS TERMINATED BY '\n'
select * from default.emp ;

执行结果如下:

hive (default)> insert overwrite local directory '/opt/datas/hive_exp_emp2'
              > ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' 
              > COLLECTION ITEMS TERMINATED BY '\n'
              > select * from default.emp ;
Query ID = hive_20190217172424_622c001a-7ca4-4a49-94dc-468cdd6a0475
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1550060164760_0006, Tracking URL = http://node1:8088/proxy/application_1550060164760_0006/
Kill Command = /opt/cloudera/parcels/CDH-5.15.2-1.cdh5.15.2.p0.3/lib/hadoop/bin/hadoop job  -kill job_1550060164760_0006
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2019-02-17 17:24:43,307 Stage-1 map = 0%,  reduce = 0%
2019-02-17 17:25:38,391 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.55 sec
MapReduce Total cumulative CPU time: 1 seconds 550 msec
Ended Job = job_1550060164760_0006
Copying data to local directory /opt/datas/hive_exp_emp2
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1   Cumulative CPU: 1.55 sec   HDFS Read: 4814 HDFS Write: 661 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 550 msec
OK
empno   ename   job     mgr     hiredate        sal     comm    deptno
Time taken: 95.151 seconds

查看一下结果:

[[email protected] hive_exp_emp]# cd /opt/datas/hive_exp_emp2
[[email protected] hive_exp_emp2]# ls
000000_0
[[email protected] hive_exp_emp2]# cat 000000_0 
7369    SMITH   CLERK   7902    1980-12-17      800.0   \N      20
7499    ALLEN   SALESMAN        7698    1981-2-20       1600.0  300.0   30
7521    WARD    SALESMAN        7698    1981-2-22       1250.0  500.0   30
7566    JONES   MANAGER 7839    1981-4-2        2975.0  \N      20
7654    MARTIN  SALESMAN        7698    1981-9-28       1250.0  1400.0  30
7698    BLAKE   MANAGER 7839    1981-5-1        2850.0  \N      30
7782    CLARK   MANAGER 7839    1981-6-9        2450.0  \N      10
7788    SCOTT   ANALYST 7566    1987-4-19       3000.0  \N      20
7839    KING    PRESIDENT       \N      1981-11-17      5000.0  \N      10
7844    TURNER  SALESMAN        7698    1981-9-8        1500.0  0.0     30
7876    ADAMS   CLERK   7788    1987-5-23       1100.0  \N      20
7900    JAMES   CLERK   7698    1981-12-3       950.0   \N      30
7902    FORD    ANALYST 7566    1981-12-3       3000.0  \N      20
7934    MILLER  CLERK   7782    1982-1-23       1300.0  \N      10

可以看出,查询导出来的结果是以制表符来进行分割的。

2.命令行导出

还有一种方式是直接使用hive命令,加入参数-e来导出到本地文件。

[[email protected] datas]# sudo -u hive hive -e "select * from default.emp ;" > /opt/datas/exp_res.txt

Logging initialized using configuration in jar:file:/opt/cloudera/parcels/CDH-5.15.2-1.cdh5.15.2.p0.3/jars/hive-common-1.1.0-cdh5.15.2.jar!/hive-log4j.properties
OK
Time taken: 3.743 seconds, Fetched: 14 row(s)
[[email protected] datas]# ls
dept.txt  emp.txt  exp_res.txt  hive_exp_emp  hive_exp_emp2
[[email protected] datas]# cat exp_res.txt 
empno   ename   job     mgr     hiredate        sal     comm    deptno
7369    SMITH   CLERK   7902    1980-12-17      800.0   NULL    20
7499    ALLEN   SALESMAN        7698    1981-2-20       1600.0  300.0   30
7521    WARD    SALESMAN        7698    1981-2-22       1250.0  500.0   30
7566    JONES   MANAGER 7839    1981-4-2        2975.0  NULL    20
7654    MARTIN  SALESMAN        7698    1981-9-28       1250.0  1400.0  30
7698    BLAKE   MANAGER 7839    1981-5-1        2850.0  NULL    30
7782    CLARK   MANAGER 7839    1981-6-9        2450.0  NULL    10
7788    SCOTT   ANALYST 7566    1987-4-19       3000.0  NULL    20
7839    KING    PRESIDENT       NULL    1981-11-17      5000.0  NULL    10
7844    TURNER  SALESMAN        7698    1981-9-8        1500.0  0.0     30
7876    ADAMS   CLERK   7788    1987-5-23       1100.0  NULL    20
7900    JAMES   CLERK   7698    1981-12-3       950.0   NULL    30
7902    FORD    ANALYST 7566    1981-12-3       3000.0  NULL    20
7934    MILLER  CLERK   7782    1982-1-23       1300.0  NULL    10
WARN: The method class org.apache.commons.logging.impl.SLF4JLogFactory#release() was invoked.
WARN: Please see http://www.slf4j.org/codes.html#release for an explanation.

可以看出,通过命令的方式也可以将hive表中的结果导出到本地表。

3.导出到hdfs文件系统

第三种方式是将查询结果保存在hdfs文件系统中,只需要将local关键字去除即可。

hive (default)> insert overwrite directory '/user/hive/user1/hive_exp_emp'
              > select * from default.emp ;
Query ID = hive_20190217173939_19c69a15-2435-409f-b4d4-b12fa12def18
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1550060164760_0007, Tracking URL = http://node1:8088/proxy/application_1550060164760_0007/
Kill Command = /opt/cloudera/parcels/CDH-5.15.2-1.cdh5.15.2.p0.3/lib/hadoop/bin/hadoop job  -kill job_1550060164760_0007
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2019-02-17 17:40:09,266 Stage-1 map = 0%,  reduce = 0%
2019-02-17 17:40:34,981 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.51 sec
MapReduce Total cumulative CPU time: 2 seconds 510 msec
Ended Job = job_1550060164760_0007
Stage-3 is selected by condition resolver.
Stage-2 is filtered out by condition resolver.
Stage-4 is filtered out by condition resolver.
Moving data to: hdfs://node1:8020/user/hive/user1/hive_exp_emp/.hive-staging_hive_2019-02-17_17-39-22_500_7423280990058179421-1/-ext-10000
Moving data to: /user/hive/user1/hive_exp_emp
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1   Cumulative CPU: 2.51 sec   HDFS Read: 4693 HDFS Write: 661 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 510 msec
OK
empno   ename   job     mgr     hiredate        sal     comm    deptno
Time taken: 75.012 seconds
hive (default)> dfs -ls /user/hive/user1/hive_exp_emp ;
Found 1 items
-rwxrwxr-t   3 hive hive        661 2019-02-17 17:40 /user/hive/user1/hive_exp_emp/000000_0

需要将导出的文件从hdfs上下载到本地才能够进行查看:

[[email protected] datas]# sudo -u hive hdfs dfs -get /user/hive/user1/hive_exp_emp/000000_0   
[[email protected] datas]# ls
000000_0  dept.txt  emp.txt  exp_res.txt  hive_exp_emp  hive_exp_emp2
[[email protected] datas]# cat 000000_0 
7369SMITHCLERK79021980-12-17800.0\N20
7499ALLENSALESMAN76981981-2-201600.0300.030
7521WARDSALESMAN76981981-2-221250.0500.030
7566JONESMANAGER78391981-4-22975.0\N20
7654MARTINSALESMAN76981981-9-281250.01400.030
7698BLAKEMANAGER78391981-5-12850.0\N30
7782CLARKMANAGER78391981-6-92450.0\N10
7788SCOTTANALYST75661987-4-193000.0\N20
7839KINGPRESIDENT\N1981-11-175000.0\N10
7844TURNERSALESMAN76981981-9-81500.00.030
7876ADAMSCLERK77881987-5-231100.0\N20
7900JAMESCLERK76981981-12-3950.0\N30
7902FORDANALYST75661981-12-33000.0\N20
7934MILLERCLERK77821982-1-231300.0\N10

4.通过Sqoop工具

另外还可以使用Sqoop工具将hive中的数据导入到型数据库中,该工具也可以将关系型数据导入到hive或hdfs中,后面会详细介绍该工具。

更多有关大数据的内容请关注微信公众号:大数据与人工智能初学者
扫描下面的二维码即可关注:
Hive基础之导出数据