大数据实验——用Spark实现wordcount单词统计

一、实验目的

  1. 学会启用spark
  2. 将文本上传到hdfs上
  3. 在scala模式下编写单词统计

二、实验过程

  1. 了解spark的构成

大数据实验——用Spark实现wordcount单词统计

 

2、具体步骤

    1、打开一个终端,启动hadoop

[email protected]:/usr/local/hadoop/sbin$ ./start-all.sh

    2、启动spark

[email protected]:/usr/local/spark/bin$ ./spark-shell

        如下所示则spark启动成功

18/08/29 20:09:24 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

18/08/29 20:09:26 WARN Utils: Your hostname, dblab-VirtualBox resolves to a loopback address: 127.0.1.1, but we couldn't find any external IP address!

18/08/29 20:09:26 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address

Spark context Web UI available at http://127.0.1.1:4040

Spark context available as 'sc' (master = local[*], app id = local-1535544589211).

Spark session available as 'spark'.

Welcome to

      ____              __

     / __/__  ___ _____/ /__

    _\ \/ _ \/ _ `/ __/  '_/

   /___/ .__/\_,_/_/ /_/\_\   version 2.1.0

      /_/

         

Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_131)

Type in expressions to have them evaluated.

Type :help for more information.

scala>

 

3、打开第二个终端,进行编写需要统计的文件并上传

 

[email protected]:/usr/local/spark/bin$ vim a

[email protected]:/usr/local/spark/bin$./hadoop fs -mkdir /input

[email protected]:/usr/local/hadoop/bin$ ./hdfs dfs -put a /

[email protected]:/usr/local/hadoop/bin$ cat a

kjd,kjd,ASDF,sjdf,jsadf

klfgldf.fdgjkaj

4、回到第一个终端,在scala下进行读取

scala> sc.textFile("hdfs:localhost:9000/a").flatMap(_.split(",")).map((_,1)).reduceByKey(_+_).collect

     结果如下

scala> sc.textFile("hdfs://localhost:9000/a").flatMap(_.split(",")).map((_,1)).reduceByKey(_+_).collect

res1: Array[(String, Int)] = Array((sjdf,1), (klfgldf.fdgjkaj,1), (kjd,2), (ASDF,1), (jsadf,1))