CCA Spark and Hadoop Developer Exam(CCA175)考试模拟题讲解四（最后一更三题）

NO.7 CORRECT TEXT（第七题正确文本）

Problem Scenario 89 : You have been given below patient data in csv format,patientID,name,dateOfBirth,lastVisitDate（您已收到以下csv格式的患者数据，包括：患者ID,患者名字,患者出生日期,最后拜访日期）

1001,Ah Teck,1991-12-31,2012-01-20

1002,Kumar,2011-10-29,2012-09-20

1003,Ali,2011-01-30,2012-10-21

Accomplish following activities.（完成以下活动）

1 . Find all the patients whose lastVisitDate between current time and '2012-09-15'（查找lastVisitDate 在当前时间和'2012-09-15'之间的所有患者）

2 . Find all the patients who born in 2011（查找出生在2011年的患者）

3 . Find all the patients age（查找所有患者的年龄）

4 . List patients whose last visited more than 60 days ago（列出最近一次就诊时间超过60天的患者）

5 . Select patients 18 years old or younger（选出18岁或18岁以下的患者）

Answer:（答）

See the explanation for Step by Step Solution and configuration.（请参阅逐步解决方案和配置的说明）

Explanation:（说明）

Solution :（解决方案）

Step 1:数据准备

hdfs dfs -mkdir sparksql3

hdfs dfs -put patients.csv sparksql3/

我这里为了方便目录管理，与考试环境不一样（练习的时候目录自定义即可）

CCA Spark and Hadoop Developer Exam(CCA175)考试模拟题讲解四（最后一更三题）

Step 2 : Now in spark shell（现在使用命令spark-shell命令开启窗口）

// SQLContext entry point for working with structured data

//用于处理结构化数据的SQLContext入口

val sqlContext = new org.apache.spark.sql.SQLContext(sc)

// this is used to implicitly convert an RDD to a DataFrame.

//这用于隐式地将RDD转换为DataFrame

import sqlContext.implicits._

// Import Spark SQL data types and Row.

//导入Spark SQL数据类型和行

import org.apache.spark.sql._

// load the data into a new RDD

//将数据加载到新的RDD中

val patients = sc.textFile("/user/cloudera/sparksql3/patients.csv")

// Return the first element in this RDD

//返回此RDD中的第一个元素

patients.first()

//define the schema using a case class

//使用case类定义模式

case class Patient(patientid: Integer, name: String, dateOfBirth:String , lastVisitDate:String)

// create an RDD

//创建RDD

val patRDD = patients.map(_.split(",")).map(p => Patient(p(0).toInt,p(1),p(2),p(3)))

patRDD.first()

patRDD.count()

// change RDD of Product objects to a DataFrame

//将RDD转换为DataFrame

val patDF = patRDD.toDF()

// register the DataFrame as a temp table

// 注册 DataFrame 作为临时表

patDF.registerTempTable("patients")

// Select data from table

//从表中筛选数据

val results = sqlContext.sql("SELECT * FROM patients")

// display dataframe in a tabular format

//展现结果

results.show()

//Find all the patients whose lastVisitDate between current time and '2012-09-15'

//查找lastVisitDate 在当前时间和'2012-09-15'之间的所有患者

val results =sqlContext.sql("SELECT * FROM patients WHERE TO_DATE(CAST(UNIX_TIMESTAMP(lastVisitDate, 'yyyy-MM-dd') AS TIMESTAMP)) BETWEEN '2012-09-15' AND current_timestamp() ORDER BY lastVisitDate")

results.show()

//Find all the patients who born in 2011

//查找出生在2011年的患者

val results = sqlContext.sql("SELECT * FROM patients WHERE YEAR(TO_DATE(CAST(UNIX_TIMESTAMP(dateOfBirth, 'yyyy-MM-dd') AS TIMESTAMP))) = 2011")

results. show()

//Find all the patients age

//查找所有患者的年龄

val results = sqlContext.sql("SELECT name, dateOfBirth, datediff(current_date(),TO_DATE(CAST(UNIX_TIMESTAMP(dateOfBirth, 'yyyy-MM-dd') AS TIMESTAMP)))/365 AS age FROM patients")

results.show()

//List patients whose last visited more than 60 days ago

//列出最近一次就诊时间超过60天的患者

val results = sqlContext.sql("SELECT name, lastVisitDate FROM patients WHERE datediff(current_date(), TO_DATE(CAST(UNIX_TIMESTAMP(lastVisitDate, 'yyyy-MM-dd') AS TIMESTAMP))) > 60")

results. show();

//Select patients 18 years old or younger

//选择18岁或18岁以下的患者

val results =sqlContext.sql("SELECT * FROM patients WHERE datediff(current_date(),TO_DATE(CAST(UNIX_TIMESTAMP(dateOfBirth, 'yyyy-MM-dd') AS TIMESTAMP)))/365 <= 18");

results. show();

（或者：

val results =sqlContext.sql("SELECT * FROM patients WHERE CAST(UNIX_TIMESTAMP(dateOfBirth,'yyyy-MM-dd') AS TIMESTAMP) >= DATE_SUB(current_timestamp(),18*365)")

results.show()

）

代码与结果如下：

CCA Spark and Hadoop Developer Exam(CCA175)考试模拟题讲解四（最后一更三题）

在使用spark-shell命令启动交互窗口时，web UI：http://manager:4040

CCA Spark and Hadoop Developer Exam(CCA175)考试模拟题讲解四（最后一更三题）

我们登录http://manager:4040，有兴趣的可以自己看一下

CCA Spark and Hadoop Developer Exam(CCA175)考试模拟题讲解四（最后一更三题）

NO.8 CORRECT TEXT

Problem Scenario 40 : You have been given sample data as below in a file called spark15/file1.txt（问题场景40：在一个名为spark15/file1.txt的文件中提供了如下示例数据）

3070811,1963,1096,,"US","CA",,1,

3022811,1963,1096,,"US","CA",,1,56

3033811,1963,1096,,"US","CA",,1,23

Below is the code snippet to process this tile.（下面是处理此文件的代码段）

val field= sc.textFile("spark15/file1.txt")

val mapper = field.map(x=> A)

mapper.map(x => x.map(x=> {B})).collect

Please fill in A and B so it can generate below final output（请填写A和B，生成下面最终输出的结果）

Array(Array(3070811,1963,109G, 0, "US", "CA", 0,1, 0),Array(3022811,1963,1096, 0, "US", "CA", 0,1, 56),Array(3033811,1963,1096, 0, "US", "CA", 0,1, 23))

Answer:（答）

See the explanation for Step by Step Solution and configuration.

Explanation:（说明）

Solution :（解决方案）

A. x.split(",")

B. if (x.isEmpty) 0 else x

数据准备：

可以使用HUE在线创建/user/cloudera/file1.txt

或者使用：

vim filetxt
hdfs dfs –put file1.txt /user/cloudera/

代码与结果如下：

CCA Spark and Hadoop Developer Exam(CCA175)考试模拟题讲解四（最后一更三题）

NO.9 CORRECT TEXT

Problem Scenario 46 : You have been given below list in scala (name,sex,cost) for each work done.( 问题场景46：在scala中为您列出了每项工作的清单（name,sex,cost）)

List( ("Deeapak" , "male", 4000), ("Deepak" , "male", 2000), ("Deepika" , "female",

2000),("Deepak" , "female", 2000), ("Deepak" , "male", 1000) , ("Neeta" , "female", 2000))

Now write a Spark program to load this list as an RDD and do the sum of cost for combination of name and sex (as key)（现在编写一个Spark程序，以RDD的形式加载这个列表，并计算名称和性别（作为key）组合的成本之和）

Answer：（答）

See the explanation for Step by Step Solution and configuration.

Explanation:（说明）

Solution :（解决方案）

Step 1 : Create an RDD out of this list（创建一个RDD）

val rdd=sc.parallelize(List(("Deeapak","male",4000),("Deepak","male",2000),("Deepika","female",2000),("Deepak","female",2000),("Deepak","male",1000),("Neeta","female",2000)))

Step 2 : Convert this RDD in pair RDD（将此RDD转换为成对RDD）

val byKey = rdd.map({case (name,sex,cost) => (name,sex)->cost})

Step 3 : Now group by Key（现在按key分组）

val byKeyGrouped = byKey.groupByKey

Step 4 : Nowsum the cost for each group（现在把每组的费用加起来）

val result = byKeyGrouped.map{case ((id1,id2),values) => (id1,id2,values.sum)}

Step 5 : Save the results （保存结果）

result.repartition(1).saveAsTextFile("spark12/result.txt")

CCA Spark and Hadoop Developer Exam(CCA175)考试模拟题讲解四（最后一更三题）

查看保存的内容：

CCA Spark and Hadoop Developer Exam(CCA175)考试模拟题讲解四（最后一更三题）

CCA Spark and Hadoop Developer Exam(CCA175)考试模拟题讲解四（最后一更三题）

相关推荐