Spark分区:创建RDD分区,但不创建Hive分区
问题描述:
这是Save Spark dataframe as dynamic partitioned table in Hive的后续操作。我试图在答案中使用建议,但无法使其在Spark 1.6.1中工作。Spark分区:创建RDD分区,但不创建Hive分区
我试图从`DataFrame以编程方式创建分区。下面是相关代码(改编自火花试验):
hc.setConf("hive.metastore.warehouse.dir", "tmp/tests")
// hc.setConf("hive.exec.dynamic.partition", "true")
// hc.setConf("hive.exec.dynamic.partition.mode", "nonstrict")
hc.sql("create database if not exists tmp")
hc.sql("drop table if exists tmp.partitiontest1")
Seq(2012 -> "a").toDF("year", "val")
.write
.partitionBy("year")
.mode(SaveMode.Append)
.saveAsTable("tmp.partitiontest1")
hc.sql("show partitions tmp.partitiontest1").show
完整的文件是在这里:https://gist.github.com/SashaOv/7c65f03a51c7e8f9c9e018cd42aa4c4a
分区文件被创建优良的文件系统上,但蜂巢抱怨说,表未分区:
======================
HIVE FAILURE OUTPUT
======================
SET hive.support.sql11.reserved.keywords=false
SET hive.metastore.warehouse.dir=tmp/tests
OK
OK
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Table tmp.partitiontest1 is not a partitioned table
======================
看起来像根本原因是org.apache.spark.sql.hive.HiveMetastoreCatalog.newSparkSQLSpecificMetastoreTable
始终创建带有空分区的表。
任何有助于推动这一进步的赞赏。
编辑:也创造SPARK-14927
答
我发现了一个解决办法:用它,如果你预先创建表,然后saveAsTable()就不会乱。所以下面的工作:
hc.setConf("hive.metastore.warehouse.dir", "tmp/tests")
// hc.setConf("hive.exec.dynamic.partition", "true")
// hc.setConf("hive.exec.dynamic.partition.mode", "nonstrict")
hc.sql("create database if not exists tmp")
hc.sql("drop table if exists tmp.partitiontest1")
// Added line:
hc.sql("create table tmp.partitiontest1(val string) partitioned by (year int)")
Seq(2012 -> "a").toDF("year", "val")
.write
.partitionBy("year")
.mode(SaveMode.Append)
.saveAsTable("tmp.partitiontest1")
hc.sql("show partitions tmp.partitiontest1").show
此解决方案在1.6.1工作,但不是在1.5.1