优化Apache的星火SQL查询
问题描述:
我在Apache上运行星火一些SQL查询时面临很长的等待时间。为了简化该查询,我跑我的计算以顺序的方式:每个查询的输出被存储为临时表(.registerTempTable(“TEMP”)),所以它可以在下面的SQL查询中使用等等。 ..但是查询花费了太多时间,而在“纯Python”代码中,只需要几分钟。优化Apache的星火SQL查询
sqlContext.sql("""
SELECT PFMT.* ,
DICO_SITES.CodeAPI
FROM PFMT
INNER JOIN DICO_SITES
ON PFMT.assembly_department = DICO_SITES.CodeProg """).registerTempTable("PFMT_API_CODE")
sqlContext.sql("""
SELECT GAMMA.*,
(GAMMA.VOLUME*GAMMA.PRORATA)/100 AS VOLUME_PER_SUPPLIER
FROM
(SELECT PFMT_API_CODE.* ,
SUPPLIERS_PROP.CODE_SITE_FOURNISSEUR,
SUPPLIERS_PROP.PRORATA
FROM PFMT_API_CODE
INNER JOIN SUPPLIERS_PROP ON PFMT_API_CODE.reference = SUPPLIERS_PROP.PIE_NUMERO
AND PFMT_API_CODE.project_code = SUPPLIERS_PROP.FAM_CODE
AND PFMT_API_CODE.CodeAPI = SUPPLIERS_PROP.SITE_UTILISATION_FINAL) GAMMA """).registerTempTable("TEMP_ONE")
sqlContext.sql("""
SELECT TEMP_ONE.* ,
ADCP_DATA.* ,
CASE
WHEN ADCP_DATA.WEEK <= weekofyear(from_unixtime(unix_timestamp())) + 24 THEN ADCP_DATA.CAPACITY_ST + ADCP_DATA.ADD_CAPACITY_ST
WHEN ADCP_DATA.WEEK > weekofyear(from_unixtime(unix_timestamp())) + 24 THEN ADCP_DATA.CAPACITY_LT + ADCP_DATA.ADD_CAPACITY_LT
END AS CAPACITY_REF
FROM TEMP_ONE
INNER JOIN ADCP_DATA
ON TEMP_ONE.reference = ADCP_DATA.PART_NUMBER
AND TEMP_ONE.CodeAPI = ADCP_DATA.API_CODE
AND TEMP_ONE.project_code = ADCP_DATA.PROJECT_CODE
AND TEMP_ONE.CODE_SITE_FOURNISSEUR = ADCP_DATA.SUPPLIER_SITE_CODE
AND TEMP_ONE.WEEK_NUM = ADCP_DATA.WEEK_NUM
""").registerTempTable('TEMP_BIS')
sqlContext.sql("""
SELECT TEMP_BIS.CSF_ID,
TEMP_BIS.CF_ID ,
TEMP_BIS.CAPACITY_REF,
TEMP_BIS.VOLUME_PER_SUPPLIER,
CASE
WHEN TEMP_BIS.CAPACITY_REF >= VOLUME_PER_SUPPLIER THEN 'CAPACITY_OK'
WHEN TEMP_BIS.CAPACITY_REF < VOLUME_PER_SUPPLIER THEN 'CAPACITY_NOK'
END AS CAPACITY_CHECK
FROM TEMP_BIS
""").take(100)
谁能亮点(如果有的话),用于星火编写pyspark SQL查询的最佳做法是什么? 在我的计算机上本地脚本比在Hadoop集群上快得多吗? 在此先感谢
中继而是纯粹的SQL语法,您可以使用数据框火花和火花缓存能力 – dumitru