转换火花数据集在斯卡拉
问题描述:
自定义对象我想叫API,它预计Employee对象,如下图所示:转换火花数据集在斯卡拉
public class EmployeeElements {
private Set<Long> eIds;
private Map<Long, List<EmployeeDetails>> employeeDetails;
private Map<Long, List<Address>> address;
}
我想创建格式对象:Map[Long,List[CustomObject]]
了数据集
由于输入我们将有以下数据框。
EmployeeDF
+---------+-------------------+
|emp_Id |hired_Date |
+---------+-------------------+
|1387779 |2016-09-27 19:47:28|
|1387781 |2016-09-27 19:47:28|
|1387780 |2016-09-27 19:47:28|
|1387780 |2016-09-27 19:47:28|
|1387777 |2016-09-27 19:47:28|
|1387778 |2016-09-27 19:47:28|
+---------+-------------------+
EmployeeDetailsDF:
+---------+---------+--------+----------+-------------------+
|emp_Id |firstname|lastname|gendercode|dateofbirth |
+---------+---------+--------+----------+-------------------+
|1387777 |Jon |Snow |F |1985-01-01 00:00:00|
|1387778 |Jon |Snow |M |1985-01-01 00:00:00|
|1387779 |Jon |Snow |F |1985-01-01 00:00:00|
|1387780 |Jon |Snow |F |1985-01-01 00:00:00|
|1387781 |Jon |Snow |F |1985-01-01 00:00:00|
+---------+---------+--------+----------+-------------------+
AddressDf:
+---------+------------------+-----------------+-------------------+
|emp_Id |patient_address_id|Country |joined_Date |
+---------+------------------+-----------------+-------------------+
|1387779 |2435146 |USA |2016-09-27 19:47:28|
|1387781 |2435148 |AUS |2016-09-27 19:47:28|
|1387780 |2435147 |USA |2016-09-27 19:47:28|
|1387780 |2435149 |UK |2016-09-27 19:47:28|
|1387777 |2435144 |AUS |2016-09-27 19:47:28|
|1387778 |2435145 |USA |2016-09-27 19:47:28|
+---------+------------------+-----------------+-------------------+
EmployeeDetailsDF
将获得参加与EmployeeDF
。 同样AddressDf
也将加入EmployeeDF
。
Now out of that join Dataframe I wanted to create `Map[Long,List[CustomObject]]`, for example: couple of rows of joindDataFrame of AddressDf and EmployeeDF
(1387777,List([1387777,2435144,AUS]))
(1387780,List([1387780,2435147,USA], [1387780,2435149,UK]))
Exception:
Exception in thread "main" java.lang.ClassCastException: org.apache.spark.sql.GroupedDataset cannot be cast to scala.collection.immutable.Map
when I am trying to cast groupAddressDs to Map:
//empAddressDs is the joined Dataset of AddressDf and EmployeeDF
val groupAddressDs:GroupedDataset[Long, Address] = empAddressDs.groupBy { x => x.emp_Id }
val addressMap = groupAddressDs.asInstanceOf[Map[String, List[Procedure]]]
但我没有得到同样的任何解决方案,我试图与GroupedDataset
但最终我们无法施展它映射对象它给人类转换异常, 然后用pairRdd
尝试过,但后来我我无法将rdd的行再次转换为我的自定义对象。
我必须处理EmployeeElements
对象中的数百万个数据和数百个自定义对象。
答
Instead of doing asInstanceOf on group data set execute mapGroups function on groupDataset
Foe e.g.
groupAddressDs.mapGroups { case (k, xs) => {
var employeeElements = new EmployeeElements()
employeeElements.setAddress(mapAsJavaMap(Map[Long,List[Address]](k -> xs.toList)).asInstanceOf[Long, List[Address]]])
employeeElements.setEIds(k)
(k, xs.toSeq)
}
}
您是否试图将百万数据编组为一个EmployeeElements对象?你最终需要用EmployeeElements做什么? – josephpconley
@josephpconley:我想将EmployeeElements对象传递给外部API。 – Kalpesh