使用两列火花蟒蛇
问题描述:
键我有我的csv文件4列和多行。使用两列火花蟒蛇
Date(MM/DD/YY) Arr_Dep Dom_Int Num_Fl
01/01/15 0:00 Arrival Domestic 357
03/01/15 0:00 Arrival International 269
06/01/15 0:00 Departure Domestic 82
08/01/15 0:00 Departure International 5
05/01/16 0:00 Arrival Domestic 44
06/01/16 0:00 Arrival Domestic 57
07/01/16 0:00 Departure International 51
08/01/16 0:00 Departure International 40
08/01/17 0:00 Arrival Domestic 1996
10/01/17 0:00 Departure International 21
我必须根据航班是抵达还是出发,找到特定年份每月的平均航班数。所以输出我期待为上述输入:
2015, arrival, 313
2015, departure, 44
2016, arrival, 51
2016, departure, 46
2017, arrival, 1996
2017, departure, 21
我现在面临的问题,我怎么应该包括在我的地图功能在我的钥匙,即Arr_Dep和日期列两列,最终减少它得到平均。 我写了下面的脚本为止。不确定如何继续
from pyspark import SparkContext
from operator import add
import sys
sc = SparkContext(appName="example")
input_file = sys.argv[1]
lines = sc.textFile(input_file)
first = lines.map(lambda x : ((x.split(",")[0].split(" ")[0][5:]).encode('ascii','ignore'), int(x.split(",")[-1]), x.split(",")[1]))
second = first.filter(lambda x : "Arrival" in x[1] or "Departure" in x[1])
third = second.map(lambda x : (x[0],x[1]))
result = third.reduceByKey("Not sure how to calculate average")
output = result.collect()
for v in sorted(output, key = lambda x:x[0]):
print '%s, %s' % (v[0], v[1])
我不确定上述脚本。我是新来的火花和蟒蛇。任何想法如何继续?
答
最好是使用SQL
API:
from pyspark.sql.functions import *
df = spark.read.options(inferSchema=True, header=True).csv(input_file)
df\
.groupBy(year(to_date("Date(MM/DD/YY)", "MM/dd/yyH:mm")).alias("year"), "Arr_Dep")\
.avg("Num_Fl")
但我怎么计算用这个平均?你能详细解释一下吗? – Alex
我觉得他是用日平均函数来看看计算平均! –