如何检查实例是否位于pyspark中的数据框中并从数据框中获取出现?
问题描述:
我有一个实例从具有3个不同属性的数据帧中提取:Atr1,Atr2和Atr3。另一方面,我有一个包含4个属性的数据框:Atr1,Atr2,Atr3,Atr4,但Atributes Atr1,Atr2和Atr3与前面提到的实例相同。我有这样的事情:如何检查实例是否位于pyspark中的数据框中并从数据框中获取出现?
Instance:
[Row(Atr1=u'A', Atr2=u'B', Atr3=24)]
Dataframe:
+------+------+------+------+
| Atr1 | Atr2 | Atr3 | Atr4 |
+------+------+------+------+
| 'C' | 'B' | 21 | 'H' |
+------+------+------+------+
| 'D' | 'B' | 21 | 'J' |
+------+------+------+------+
| 'E' | 'B' | 21 | 'K' |
+------+------+------+------+
| 'A' | 'B' | 24 | 'I' |
+------+------+------+------+
所以,有上述情况,我要检查它是否在与属性ATR1,ATR2和ATR3这些值数据帧存在一个实例,如果它存在,请Atr4的值。在这种情况下,'我'。
答
这是一个可接受的答案?
df[(df['Atr1'] == row.Atr1) & (df['Atr2'] == row.Atr2) & (df['Atr3'] == row.Atr3)].Atr4
与row
行和df
你mentionned数据帧。
答
希望这有助于!
from pyspark.sql.types import Row
from pyspark.sql.functions import col
#sample data
row_list = [Row(Atr1=u'A', Atr2=u'B', Atr3=24),
Row(Atr1=u'E', Atr2=u'F', Atr3=20),]
df = sc.parallelize([('C', 'B', 21, 'H'),
('D', 'B', 21, 'J'),
('E', 'B', 21, 'K'),
('A', 'B', 24, 'I')]).\
toDF(["Atr1", "Atr2", "Atr3", "Atr4"])
search_df = df.join(sqlContext.createDataFrame(row_list), ["Atr1", "Atr2", "Atr3"], "right").\
withColumn("rowItem_Exist", col('Atr4').isNotNull())
search_df.show()
输出是:
+----+----+----+----+-------------+
|Atr1|Atr2|Atr3|Atr4|rowItem_Exist|
+----+----+----+----+-------------+
| E| F| 20|null| false|
| A| B| 24| I| true|
+----+----+----+----+-------------+
@jartymcfly请不要忘了[将其标记为正确的答案(https://stackoverflow.com/help/someone-answers)是否能解决你的问题 :) – Prem