在Spark中，如何使用SparseVector将DataFrame转换为RDD [Vector]？

问题描述：

正在关注this example我为某些文档计算了TF-IDF权重。现在我想用RowMatrix来计算文件的相似度。但我无法将数据转换为正确的格式。我现在所拥有的是一个DataFrame，它的行具有（String，SparseVector）作为两列的类型。我应该将其转换为RDD[Vector]，我认为将是一样简单：在Spark中，如何使用SparseVector将DataFrame转换为RDD [Vector]？

features.map(row => row.getAs[SparseVector](1)).rdd()

但我得到这个错误：

<console>:58: error: Unable to find encoder for type stored in a 
Dataset. Primitive types (Int, String, etc) and Product types (case 
classes) are supported by importing spark.implicits._ Support for 
serializing other types will be added in future releases.

导入spark.implicits._没什么区别。

那么这是怎么回事？我很惊讶Spark不知道如何编码自己的矢量数据类型。

答

只需在map之前转换为RDD即可。

import org.apache.spark.ml.linalg._ 

val df = Seq((1, Vectors.sparse(1, Array(), Array()))).toDF 

df.rdd.map(row => row.getAs[Vector](1))

在Spark中，如何使用SparseVector将DataFrame转换为RDD [Vector]？

相关推荐