使用HIVE从JSON中提取字段
我的配置单元代码中有一个问题。我想提取JSON数据使用HIVE.Following为样本JSON格式使用HIVE从JSON中提取字段
{"Rtype":{"ver":"1","os":"ms","type":"ns","vehicle":"Mh-3412","MOD":{"Version":[{"versionModified"{"machine":"123.dfer","founder":"3.0","state":"Florida","fashion":"fg45","cdc":"new","dof":"yes","ts":"2000-04-01T00:00:00.171Z"}}]}}}
我希望得到以下领域
- 版本
- 型
- 车辆
- TS
- 创始人
- 状态
问题是创始人和国家是在一个阵列“版本” 任何人都可以帮助如何摆脱这一点? 一些时间,而不是别的versionmedified东西可能会
如: 有些时候我的数据会是怎样
{"Rtype":{"ver":"1","os":"ms","type":"ns","vehicle":"Mh-3412","MOD":{"Version":[{"anotherCriteria":{"engine":"123.dfer","developer":"3.0","state":"Florida","fashion":"fg45","cdc":"new","dof":"yes","ts":"2000-04-01T00:00:00.171Z"}}]}}}
添加下面的一些样本数据:
{"Rtype":{"ver":"1","os":"ms","type":"ns","vehicle":"Mh-3412","MOD":{"Version":[{"ABC"{"XYZ":"123.dfer","founder":"3.0","GHT":"Florida","fashion":"fg45","cdc":"new","dof":"yes","ts":"2000-04-01T00:00:00.171Z"}}]}}}
{"Rtype":{"ver":"1","os":"ms","type":"ns","vehicle":"Mh-3412","MOD":{"Version":[{"GAP"{"XVY":"123.dfer","FAH":"3.0","GHT":"Florida","fashion":"fg45","cdc":"new","dof":"yes","ts":"2000-04-01T00:00:00.171Z"}}]}}}
{"Rtype":{"ver":"1","os":"ms","type":"ns","vehicle":"Mh-3412","MOD":{"Version":[{"BOX"{"VOG":"123.dfer","FAH":"3.0","FAX":"Florida","fashion":"fg45","cdc":"new","dof":"yes","ts":"2000-04-01T00:00:00.171Z"}}]}}}
我需要把这个数据基于版本的各种表格如果它是“BOX”放在一个表中如果它是“GAP”把另一个表...
您可以使用JSON SERDE获取所有领域
只要按照下面的步骤从http://www.congiu.net/hive-json-serde/1.3/
2.增加JSON SERDE罐
hive> ADD jar /root/json-serde-1.3-jar-with-dependencies.jar; Added [/root/json-serde-1.3-jar-with-dependencies.jar] to class path Added resources: [/root/json-serde-1.3-jar-with-dependencies.jar]
1.Download JSON SERDE
3.创建表格
CREATE TABLE json_serde_table ( Rtype struct<ver:int, os:string,type:string,vehicle:string,MOD: struct<Version:Array<struct<versionModified:struct<machine:string,founder:string,state:string,fashion:string,cdc:string,dof:string,ts:string>>>>> ) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe';
4.load JSON文件到下面查询表
hive> load data local inpath '/root/json.txt' INTO TABLE json_serde_table; Loading data to table default.json_serde_table Table default.json_serde_table stats: [numFiles=1, totalSize=234] OK Time taken: 0.877 seconds
5.Fire拿到导致
hive> select Rtype.ver ver ,Rtype.type type ,Rtype.vehicle vehicle ,Rtype.MOD.version[0].versionModified.ts ts,Rtype.MOD.version[0].versionModified.founder founder,Rtype.MOD.version[0].versionModified.state state from json_serde_table;
Query ID = root_20170412170606_a674d31b-31d7-477b-b9ff-3ebd76636cf8
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1491484583384_0018, Tracking URL = http://mac127:8088/proxy/application_1491484583384_0018/
Kill Command = /opt/cloudera/parcels/CDH-5.9.0-1.cdh5.9.0.p0.23/lib/hadoop/bin/hadoop job -kill job_1491484583384_0018
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2017-04-12 17:06:44,990 Stage-1 map = 0%, reduce = 0%
2017-04-12 17:06:53,361 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.8 sec
MapReduce Total cumulative CPU time: 1 seconds 800 msec
Ended Job = job_1491484583384_0018
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Cumulative CPU: 1.8 sec HDFS Read: 4891 HDFS Write: 50 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 800 msec
OK
1 ns Mh-3412 2000-04-01T00:00:00.171Z 3.0 Florida
Time taken: 19.745 seconds, Fetched: 1 row(s)
在“versionModified”字段后冒号(:)在你的json数据中缺失 –
为什么当它已经是发行版的一部分时下载JSON SerDe? –
你是正确的可以用org.apache.hadoop.hive.contrib.serde2.JsonSerde完成,我刚刚尝试过,它给了我相同的结果..我将编辑我的答案..谢谢 –
指这在蜂巢使用get_json_object .. ..http://stackoverflow.com/questions/24447428/parse-json-arrays-using-hive –
请显示您的表架构 –
不要混淆问题。为INSERTissue –