加入对键值中对关键地图

问题描述:

我有这个数据集的:加入对键值中对关键地图

(apple,1) 
(banana,4) 
(orange,3) 
(grape,2) 
(watermelon,2) 

,而其他数据集是:

(apple,Map(Bob -> 1)) 
(banana,Map(Chris -> 1)) 
(orange,Map(John -> 1)) 
(grape,Map(Smith -> 1)) 
(watermelon,Map(Phil -> 1)) 

我瞄准结合两套得到:

(apple,1,Map(Bob -> 1)) 
(banana,4,Map(Chris -> 1)) 
(orange,3,Map(John -> 1)) 
(grape,2,Map(Smith -> 1)) 
(watermelon,2,Map(Phil -> 1)) 

代码我:

... 
val counts_firstDataset = words.map(word => 
(word.firstWord, 1)).reduceByKey{case (x, y) => x + y} 

第二个数据集:

... 
val counts_secondDataset = secondSet.map(x => (x._1, 
x._2.toList.groupBy(identity).mapValues(_.size))) 

我试图用join方法val joined_data = counts_firstDataset.join(counts_secondDataset)但没有奏效,因为联接需要对[ K,V]。我将如何解决这个问题?

+0

@philantrovert RDDS –

+1

明白了。我应该完全读完这个问题。 – philantrovert

+0

你用什么数据结构来存储这些数据集?列表,设置等? – fcat

最简单的办法就是将转换为DataFrames,然后join

import spark.implicits._ 
val counts_firstDataset = words 
    .map(word => (word.firstWord, 1)) 
    .reduceByKey{case (x, y) => x + y} 
    .toDF("type", "value") 

val counts_secondDataset = secondSet 
    .map(x => (x._1,x._2.toList.groupBy(identity).mapValues(_.size))) 
    .toDF("type_2","map") 

counts_firstDataset 
    .join(counts_secondDataset, 'type === 'type_2) 
    .drop('type_2) 

作为第一个元素(如水果的名称)的两个名单以相同的顺序,你可以结合元组的两个列表使用拉链然后用到列表改为一个元组通过以下方式:

counts_firstDataset.zip(counts_secondDataset) 
    .map(vk => (vk._1._1, vk._1._2, vk._2._2))