Spark - 使用不同数据类型以编程方式创建模式

问题描述：

我有一个由7-8个字段组成的数据集，这些字段的类型是String，Int & Float。Spark - 使用不同数据类型以编程方式创建模式

我试图用它来创建通过编程的方式架构：

val schema = StructType(header.split(",").map(column => StructField(column, StringType, true)))

然后通过映射它排样型：

val dataRdd = datafile.filter(x => x!=header).map(x => x.split(",")).map(col => Row(col(0).trim, col(1).toInt, col(2).toFloat, col(3), col(4) ,col(5), col(6), col(7), col(8)))

但是，创造数据帧后，当我使用DF.show（）它给整数字段错误。

那么如何创建这样的模式，我们在数据集中

答

有多个数据类型，你在你的代码的问题是，你正在分配的所有领域StringType。

假设在标题中只有字段的名称，那么你不能猜测类型。

让我们假设头部字符串是这样

val header = "field1:Int,field2:Double,field3:String"

然后代码应该是

def inferType(field: String) = field.split(":")(1) match { 
    case "Int" => IntegerType 
    case "Double" => DoubleType 
    case "String" => StringType 
    case _ => StringType 
} 

val schema = StructType(header.split(",").map(column => StructField(column, inferType(column), true)))

你所得到

root 
|-- field1:Int: integer (nullable = true) 
|-- field2:Double: double (nullable = true) 
|-- field3:String: string (nullable = true)

在另一方面头部字符串例子。如果你需要它是一个来自文本的数据框架，我建议你直接从文件本身创建DataFrame。从RDD创建它毫无意义。

val fileReader = spark.read.format("com.databricks.spark.csv") 
    .option("mode", "DROPMALFORMED") 
    .option("header", "true") 
    .option("inferschema", "true") 
    .option("delimiter", ",") 

val df = fileReader.load(PATH_TO_FILE)

但是标题字符串不是这样，数据就像 'dfs8768768,65,76.34,234，dfgdg，34.65 dfs8768768,65,76.34,234，dfgdg，34.65' – AJm

然后就不可能从标题中知道数据的类型，因为它没有提供。 – elghoto

这是标题的确切数据： '拍卖，竞价，bidtime，投标人，bidderrate，openbid，价格，项目，daystolive 8213034715,15,12.373，baman，3,12,20，book1,5 8213034725， 65,21.33，thmpu，2,64,75，watch1,9 8213034735,85,23.3，lovekush，4,45,90，remote1,10 8213034745,115,44.44，jaipanee，3,111,130，s3phone，4' – AJm

答

定义结构类型第一：

val schema1 = StructType(Array(
    StructField("AcutionId", StringType, true), 
    StructField("Bid", IntegerType, false), 
    StructField("BidTime", FloatType, false), 
    StructField("Bidder", StringType, true), 
    StructField("BidderRate", FloatType, false), 
    StructField("OpenBid", FloatType, false), 
    StructField("Price", FloatType, false), 
    StructField("Item", StringType, true), 
    StructField("DaystoLive", IntegerType, false) 
))

然后，通过将其转换为特定类型的指定将要B存在一个行内的每个柱：

val dataRdd = datafile.filter(x => x!=header).map(x => x.split(",")) 
    .map(col => Row(
    col(0).trim, 
    col(1).trim.toInt, 
    col(2).trim.toFloat, 
    col(3).trim, 
    col(4).trim.toFloat, 
    col(5).trim.toFloat, 
    col(6).trim.toFloat, 
    col(7).trim, 
    col(8).trim.toInt) 
)

然后施加Schema to the RDD

val auctionDF = spark.sqlContext.createDataFrame(dataRdd,schema1)

Spark - 使用不同数据类型以编程方式创建模式

相关推荐