使用HIVE从JSON中提取字段

问题描述：

我的配置单元代码中有一个问题。我想提取JSON数据使用HIVE.Following为样本JSON格式使用HIVE从JSON中提取字段

{"Rtype":{"ver":"1","os":"ms","type":"ns","vehicle":"Mh-3412","MOD":{"Version":[{"versionModified"{"machine":"123.dfer","founder":"3.0","state":"Florida","fashion":"fg45","cdc":"new","dof":"yes","ts":"2000-04-01T00:00:00.171Z"}}]}}}

我希望得到以下领域

版本
型
车辆
TS
创始人
状态

问题是创始人和国家是在一个阵列“版本” 任何人都可以帮助如何摆脱这一点？一些时间，而不是别的versionmedified东西可能会

如：有些时候我的数据会是怎样

{"Rtype":{"ver":"1","os":"ms","type":"ns","vehicle":"Mh-3412","MOD":{"Version":[{"anotherCriteria":{"engine":"123.dfer","developer":"3.0","state":"Florida","fashion":"fg45","cdc":"new","dof":"yes","ts":"2000-04-01T00:00:00.171Z"}}]}}}

添加下面的一些样本数据：

{"Rtype":{"ver":"1","os":"ms","type":"ns","vehicle":"Mh-3412","MOD":{"Version":[{"ABC"{"XYZ":"123.dfer","founder":"3.0","GHT":"Florida","fashion":"fg45","cdc":"new","dof":"yes","ts":"2000-04-01T00:00:00.171Z"}}]}}} 


{"Rtype":{"ver":"1","os":"ms","type":"ns","vehicle":"Mh-3412","MOD":{"Version":[{"GAP"{"XVY":"123.dfer","FAH":"3.0","GHT":"Florida","fashion":"fg45","cdc":"new","dof":"yes","ts":"2000-04-01T00:00:00.171Z"}}]}}} 


{"Rtype":{"ver":"1","os":"ms","type":"ns","vehicle":"Mh-3412","MOD":{"Version":[{"BOX"{"VOG":"123.dfer","FAH":"3.0","FAX":"Florida","fashion":"fg45","cdc":"new","dof":"yes","ts":"2000-04-01T00:00:00.171Z"}}]}}}

我需要把这个数据基于版本的各种表格如果它是“BOX”放在一个表中如果它是“GAP”把另一个表...

指这在蜂巢使用get_json_object .. ..http：//stackoverflow.com/questions/24447428/parse-json-arrays-using-hive –

请显示您的表架构 –

不要混淆问题。为INSERTissue –

答

您可以使用JSON SERDE获取所有领域

只要按照下面的步骤从http://www.congiu.net/hive-json-serde/1.3/

2.增加JSON SERDE罐

hive> ADD jar /root/json-serde-1.3-jar-with-dependencies.jar; 
Added [/root/json-serde-1.3-jar-with-dependencies.jar] to class path 
Added resources: [/root/json-serde-1.3-jar-with-dependencies.jar]

1.Download JSON SERDE

3.创建表格

CREATE TABLE json_serde_table (
    Rtype struct<ver:int, os:string,type:string,vehicle:string,MOD: struct<Version:Array<struct<versionModified:struct<machine:string,founder:string,state:string,fashion:string,cdc:string,dof:string,ts:string>>>>> 
) 
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe';

4.load JSON文件到下面查询表

hive> load data local inpath '/root/json.txt' INTO TABLE json_serde_table; 
Loading data to table default.json_serde_table 
Table default.json_serde_table stats: [numFiles=1, totalSize=234] 
OK 
Time taken: 0.877 seconds

5.Fire拿到导致

hive> select Rtype.ver ver ,Rtype.type type ,Rtype.vehicle vehicle ,Rtype.MOD.version[0].versionModified.ts ts,Rtype.MOD.version[0].versionModified.founder founder,Rtype.MOD.version[0].versionModified.state state from json_serde_table; 
Query ID = root_20170412170606_a674d31b-31d7-477b-b9ff-3ebd76636cf8 
Total jobs = 1 
Launching Job 1 out of 1 
Number of reduce tasks is set to 0 since there's no reduce operator 
Starting Job = job_1491484583384_0018, Tracking URL = http://mac127:8088/proxy/application_1491484583384_0018/ 
Kill Command = /opt/cloudera/parcels/CDH-5.9.0-1.cdh5.9.0.p0.23/lib/hadoop/bin/hadoop job -kill job_1491484583384_0018 
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0 
2017-04-12 17:06:44,990 Stage-1 map = 0%, reduce = 0% 
2017-04-12 17:06:53,361 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.8 sec 
MapReduce Total cumulative CPU time: 1 seconds 800 msec 
Ended Job = job_1491484583384_0018 
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1 Cumulative CPU: 1.8 sec HDFS Read: 4891 HDFS Write: 50 SUCCESS 
Total MapReduce CPU Time Spent: 1 seconds 800 msec 
OK 
1  ns  Mh-3412 2000-04-01T00:00:00.171Z  3.0  Florida 
Time taken: 19.745 seconds, Fetched: 1 row(s)

在“versionModified”字段后冒号（:)在你的json数据中缺失 –

为什么当它已经是发行版的一部分时下载JSON SerDe？ –

你是正确的可以用org.apache.hadoop.hive.contrib.serde2.JsonSerde完成，我刚刚尝试过，它给了我相同的结果..我将编辑我的答案..谢谢 –

使用HIVE从JSON中提取字段

相关推荐