HIVE内关联顺序影响返回结果?求助。。。
HIVE内关联顺序影响返回结果?求助。。。
执行代码
1.idl_dim_ssj_user_device_pool 作为主表
select count(pool.udid) from dt.idl_dim_ssj_user_device_pool pool
inner join dt.idl_dim_ssj_user usr
on pool.fname=usr.fname
;
1.1 执行计划如下:
| STAGE DEPENDENCIES: |
| Stage-6 is a root stage , consists of Stage-1 |
| Stage-1 |
| Stage-2 depends on stages: Stage-1 |
| Stage-0 depends on stages: Stage-2 |
| |
| STAGE PLANS: |
| Stage: Stage-6 |
| Conditional Operator |
| |
| Stage: Stage-1 |
| Map Reduce |
| Map Operator Tree: |
| TableScan |
| alias: pool |
| Statistics: Num rows: 76926919 Data size: 769269190 Basic stats: COMPLETE Column stats: NONE |
| Filter Operator |
| predicate: fname is not null (type: boolean) |
| Statistics: Num rows: 38463460 Data size: 384634600 Basic stats: COMPLETE Column stats: NONE |
| Reduce Output Operator |
| key expressions: fname (type: string) |
| sort order: + |
| Map-reduce partition columns: fname (type: string) |
| Statistics: Num rows: 38463460 Data size: 384634600 Basic stats: COMPLETE Column stats: NONE |
| value expressions: udid (type: string) |
| TableScan |
| alias: usr |
| Statistics: Num rows: 31374455 Data size: 470616825 Basic stats: COMPLETE Column stats: NONE |
| Filter Operator |
| predicate: fname is not null (type: boolean) |
| Statistics: Num rows: 15687228 Data size: 235308419 Basic stats: COMPLETE Column stats: NONE |
| Reduce Output Operator |
| key expressions: fname (type: string) |
| sort order: + |
| Map-reduce partition columns: fname (type: string) |
| Statistics: Num rows: 15687228 Data size: 235308419 Basic stats: COMPLETE Column stats: NONE |
| Reduce Operator Tree: |
| Join Operator |
| condition map: |
| Inner Join 0 to 1 |
| keys: |
| 0 fname (type: string) |
| 1 fname (type: string) |
| outputColumnNames: _col0 |
| Statistics: Num rows: 42309806 Data size: 423098069 Basic stats: COMPLETE Column stats: NONE |
| Group By Operator |
| aggregations: count(_col0) |
| mode: hash |
| outputColumnNames: _col0 |
| Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE |
| File Output Operator |
| compressed: true |
| table: |
| input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
| output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
| serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe |
| |
| Stage: Stage-2 |
| Map Reduce |
| Map Operator Tree: |
| TableScan |
| Reduce Output Operator |
| sort order: |
| Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE |
| value expressions: _col0 (type: bigint) |
| Reduce Operator Tree: |
| Group By Operator |
| aggregations: count(VALUE._col0) |
| mode: mergepartial |
| outputColumnNames: _col0 |
| Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE |
| File Output Operator |
| compressed: false |
| Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE |
| table: |
| input format: org.apache.hadoop.mapred.TextInputFormat |
| output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat |
| serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
| |
| Stage: Stage-0 |
| Fetch Operator |
| limit: -1 |
| Processor Tree: |
| ListSink |
|
1.2 执行日志和结果:
INFO : Compiling command(queryId=hive_20190111095656_9cf4a75d-2b3c-4f83-8b0b-01e7cb00b392): select count(pool.udid) from dt.idl_dim_ssj_user_device_pool pool
inner join dt.idl_dim_ssj_user usr
on pool.fname=usr.fname
INFO : Semantic Analysis Completed
INFO : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:_c0, type:bigint, comment:null)], properties:null)
INFO : Completed compiling command(queryId=hive_20190111095656_9cf4a75d-2b3c-4f83-8b0b-01e7cb00b392); Time taken: 0.336 seconds
INFO : Executing command(queryId=hive_20190111095656_9cf4a75d-2b3c-4f83-8b0b-01e7cb00b392): select count(pool.udid) from dt.idl_dim_ssj_user_device_pool pool
inner join dt.idl_dim_ssj_user usr
on pool.fname=usr.fname
INFO : Query ID = hive_20190111095656_9cf4a75d-2b3c-4f83-8b0b-01e7cb00b392
INFO : Total jobs = 2
INFO : Starting task [Stage-6:CONDITIONAL] in serial mode
INFO : Stage-1 is selected by condition resolver.
INFO : Launching Job 1 out of 2
INFO : Starting task [Stage-1:MAPRED] in serial mode
INFO : Number of reduce tasks not specified. Estimated from input data size: 51
INFO : In order to change the average load for a reducer (in bytes):
INFO : set hive.exec.reducers.bytes.per.reducer=<number>
INFO : In order to limit the maximum number of reducers:
INFO : set hive.exec.reducers.max=<number>
INFO : In order to set a constant number of reducers:
INFO : set mapreduce.job.reduces=<number>
INFO : number of splits:75
INFO : Submitting tokens for job: job_1541645032015_2447580
......
INFO : MapReduce Total cumulative CPU time: 47 seconds 580 msec
INFO : Ended Job = job_1541645032015_2447599
INFO : MapReduce Jobs Launched:
INFO : Stage-Stage-1: Map: 75 Reduce: 51 Cumulative CPU: 1180.49 sec HDFS Read: 1160032550 HDFS Write: 10965 SUCCESS
INFO : Stage-Stage-2: Map: 34 Reduce: 1 Cumulative CPU: 47.58 sec HDFS Read: 86838 HDFS Write: 2 SUCCESS
INFO : Total MapReduce CPU Time Spent: 20 minutes 28 seconds 70 msec
INFO : Completed executing command(queryId=hive_20190111095656_9cf4a75d-2b3c-4f83-8b0b-01e7cb00b392); Time taken: 121.192 seconds
INFO : OK
+------+--+
| _c0 |
+------+--+
| 0 |
+------+--+
1 row selected (121.668 seconds)
2.idl_dim_ssj_user作为主表
select count(pool.udid) from dt.idl_dim_ssj_user usr
inner join dt.idl_dim_ssj_user_device_pool pool
on usr.fname=pool.fname
;
2.1 执行计划:
| STAGE DEPENDENCIES: |
| Stage-6 is a root stage , consists of Stage-1 |
| Stage-1 |
| Stage-2 depends on stages: Stage-1 |
| Stage-0 depends on stages: Stage-2 |
| |
| STAGE PLANS: |
| Stage: Stage-6 |
| Conditional Operator |
| |
| Stage: Stage-1 |
| Map Reduce |
| Map Operator Tree: |
| TableScan |
| alias: usr |
| Statistics: Num rows: 31374455 Data size: 470616825 Basic stats: COMPLETE Column stats: NONE |
| Filter Operator |
| predicate: fname is not null (type: boolean) |
| Statistics: Num rows: 15687228 Data size: 235308419 Basic stats: COMPLETE Column stats: NONE |
| Reduce Output Operator |
| key expressions: fname (type: string) |
| sort order: + |
| Map-reduce partition columns: fname (type: string) |
| Statistics: Num rows: 15687228 Data size: 235308419 Basic stats: COMPLETE Column stats: NONE |
| TableScan |
| alias: pool |
| Statistics: Num rows: 76926919 Data size: 769269190 Basic stats: COMPLETE Column stats: NONE |
| Filter Operator |
| predicate: fname is not null (type: boolean) |
| Statistics: Num rows: 38463460 Data size: 384634600 Basic stats: COMPLETE Column stats: NONE |
| Reduce Output Operator |
| key expressions: fname (type: string) |
| sort order: + |
| Map-reduce partition columns: fname (type: string) |
| Statistics: Num rows: 38463460 Data size: 384634600 Basic stats: COMPLETE Column stats: NONE |
| value expressions: udid (type: string) |
| Reduce Operator Tree: |
| Join Operator |
| condition map: |
| Inner Join 0 to 1 |
| keys: |
| 0 fname (type: string) |
| 1 fname (type: string) |
| outputColumnNames: _col18 |
| Statistics: Num rows: 42309806 Data size: 423098069 Basic stats: COMPLETE Column stats: NONE |
| Group By Operator |
| aggregations: count(_col18) |
| mode: hash |
| outputColumnNames: _col0 |
| Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE |
| File Output Operator |
| compressed: true |
| table: |
| input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
| output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
| serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe |
| |
| Stage: Stage-2 |
| Map Reduce |
| Map Operator Tree: |
| TableScan |
| Reduce Output Operator |
| sort order: |
| Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE |
| value expressions: _col0 (type: bigint) |
| Reduce Operator Tree: |
| Group By Operator |
| aggregations: count(VALUE._col0) |
| mode: mergepartial |
| outputColumnNames: _col0 |
| Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE |
| File Output Operator |
| compressed: false |
| Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE |
| table: |
| input format: org.apache.hadoop.mapred.TextInputFormat |
| output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat |
| serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
| |
| Stage: Stage-0 |
| Fetch Operator |
| limit: -1 |
| Processor Tree: |
| ListSink |
2.2 执行日志和结果:
INFO : Compiling command(queryId=hive_20190111142020_c6710c56-cd4d-4919-8cea-b06ed2218502): select count(pool.udid) from dt.idl_dim_ssj_user usr
inner join dt.idl_dim_ssj_user_device_pool pool
on usr.fname=pool.fname
INFO : Semantic Analysis Completed
INFO : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:_c0, type:bigint, comment:null)], properties:null)
INFO : Completed compiling command(queryId=hive_20190111142020_c6710c56-cd4d-4919-8cea-b06ed2218502); Time taken: 1.286 seconds
INFO : Executing command(queryId=hive_20190111142020_c6710c56-cd4d-4919-8cea-b06ed2218502): select count(pool.udid) from dt.idl_dim_ssj_user usr
inner join dt.idl_dim_ssj_user_device_pool pool
on usr.fname=pool.fname
INFO : Query ID = hive_20190111142020_c6710c56-cd4d-4919-8cea-b06ed2218502
INFO : Total jobs = 2
INFO : Starting task [Stage-6:CONDITIONAL] in serial mode
INFO : Stage-1 is selected by condition resolver.
INFO : Launching Job 1 out of 2
INFO : Starting task [Stage-1:MAPRED] in serial mode
INFO : Number of reduce tasks not specified. Estimated from input data size: 51
INFO : In order to change the average load for a reducer (in bytes):
INFO : set hive.exec.reducers.bytes.per.reducer=<number>
INFO : In order to limit the maximum number of reducers:
INFO : set hive.exec.reducers.max=<number>
INFO : In order to set a constant number of reducers:
INFO : set mapreduce.job.reduces=<number>
INFO : number of splits:75
INFO : Submitting tokens for job: job_1541645032015_2453904
INFO : Hadoop job information for Stage-1: number of mappers: 75; number of reducers: 51
......
INFO : MapReduce Total cumulative CPU time: 20 minutes 50 seconds 230 msec
INFO : Ended Job = job_1541645032015_2453904
INFO : Launching Job 2 out of 2
INFO : Starting task [Stage-2:MAPRED] in serial mode
INFO : Number of reduce tasks determined at compile time: 1
INFO : In order to change the average load for a reducer (in bytes):
INFO : set hive.exec.reducers.bytes.per.reducer=<number>
INFO : In order to limit the maximum number of reducers:
INFO : set hive.exec.reducers.max=<number>
INFO : In order to set a constant number of reducers:
INFO : set mapreduce.job.reduces=<number>
INFO : number of splits:34
INFO : Submitting tokens for job: job_1541645032015_2453919
INFO : Hadoop job information for Stage-2: number of mappers: 34; number of reducers: 1
......
INFO : MapReduce Total cumulative CPU time: 42 seconds 330 msec
INFO : Ended Job = job_1541645032015_2453919
INFO : MapReduce Jobs Launched:
INFO : Stage-Stage-1: Map: 75 Reduce: 51 Cumulative CPU: 1250.23 sec HDFS Read: 4787835672 HDFS Write: 11118 SUCCESS
INFO : Stage-Stage-2: Map: 34 Reduce: 1 Cumulative CPU: 42.33 sec HDFS Read: 86991 HDFS Write: 9 SUCCESS
INFO : Total MapReduce CPU Time Spent: 21 minutes 32 seconds 560 msec
INFO : Completed executing command(queryId=hive_20190111142020_c6710c56-cd4d-4919-8cea-b06ed2218502); Time taken: 112.435 seconds
INFO : OK
+-----------+--+
| _c0 |
+-----------+--+
| 37683117 |
+-----------+--+
1 row selected (113.862 seconds)
3. 疑问???
内关联,两个表先后顺序不同,最多是影响MP执行效率,但不应该执行的返回结果不同啊!求大佬帮忙看下,谢谢
另外附上两边执行计划对比差异图:上图是对比差异 也看不出有啥异常。。。。求助,感谢