大数据实验——用Spark实现wordcount单词统计
一、实验目的
- 学会启用spark
- 将文本上传到hdfs上
-
在scala模式下编写单词统计
二、实验过程
- 了解spark的构成
2、具体步骤
1、打开一个终端,启动hadoop
[email protected]:/usr/local/hadoop/sbin$ ./start-all.sh
2、启动spark
[email protected]:/usr/local/spark/bin$ ./spark-shell
如下所示则spark启动成功
18/08/29 20:09:24 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
18/08/29 20:09:26 WARN Utils: Your hostname, dblab-VirtualBox resolves to a loopback address: 127.0.1.1, but we couldn't find any external IP address!
18/08/29 20:09:26 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Spark context Web UI available at http://127.0.1.1:4040
Spark context available as 'sc' (master = local[*], app id = local-1535544589211).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.1.0
/_/
Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_131)
Type in expressions to have them evaluated.
Type :help for more information.
scala>
3、打开第二个终端,进行编写需要统计的文件并上传
[email protected]:/usr/local/spark/bin$ vim a
[email protected]:/usr/local/spark/bin$./hadoop fs -mkdir /input
[email protected]:/usr/local/hadoop/bin$ ./hdfs dfs -put a /
[email protected]:/usr/local/hadoop/bin$ cat a
kjd,kjd,ASDF,sjdf,jsadf
klfgldf.fdgjkaj
4、回到第一个终端,在scala下进行读取
scala> sc.textFile("hdfs:localhost:9000/a").flatMap(_.split(",")).map((_,1)).reduceByKey(_+_).collect
结果如下
scala> sc.textFile("hdfs://localhost:9000/a").flatMap(_.split(",")).map((_,1)).reduceByKey(_+_).collect
res1: Array[(String, Int)] = Array((sjdf,1), (klfgldf.fdgjkaj,1), (kjd,2), (ASDF,1), (jsadf,1))