Spark2.0.1 + scala2.11.8 + hadoop2.7 + IDEA开发环境搭建
注意:spark scala hadoop直接的版本匹配非常重要!
一、Spark2.0.1安装
从https://archive.apache.org/dist/spark/spark-2.0.1/中下载编译好的spark包,选择
因为已经编译好了,直接解压即可。但是注意:解压的路径中不能有空格。比如解压到D盘 : D:\spark-2.0.1-bin-hadoop2.7
解压完成后,配置环境变量,path路径
然后在cmd中运行spark-shell
会提示没有hadoop环境,接下来会讲hadoop环境的安装!
二、scala 2.11.8 安装
从https://downloads.lightbend.com/scala/2.11.8/scala-2.11.8.msi 下载scala
下载完成后,直接安装,安装完成后程序会自动在环境变量里添加path
在cmd中运行scala version 命令:
表示安装成功。
三、IDEA集成scala开发环境
可以参考:http://dblab.xmu.edu.cn/blog/1327/
首先下载IDEA的scala插件,在plugin插件商店安装
此处我已安装好,因为下载慢的原因,也可以直接从https://confluence.jetbrains.com/display/SCA/Scala+Plugin+for+IntelliJ+IDEA 去下载插件包,然后本地安装
安装完成重启IDEA后,创建scala projecr
选择IDEA,适合初学者,点击下一步,选择SDK,没有SDK可选,点击create,就可以看到之前安装好的scala2.11.8
然后新建个 scala object类,因为只有object才有main方法
至此Scala环境搭建完毕,最后搭建hadoop环境。
四、hadoop2.7.7 搭建
下载hadoop,从:https://mirror.bit.edu.cn/apache/hadoop/common/hadoop-2.7.7/ 下载
windows需要有管理员权限去打开压缩包,否则解压时候会提示 “客户端没有所需的授权”
然后到环境变量部分设置HADOOP_HOME为Hadoop的解压目录,如图所示:
然后再设置该目录下的bin目录到系统变量的PATH下,我这里也就是C:\Hadoop\bin,如果已经添加了HADOOP_HOME系统变量,也可用%HADOOP_HOME%\bin来指定bin文件夹路径名。
这两个系统变量设置好后,开启一个新的cmd窗口,然后直接输入spark-shell
命令。如图所示:
java.io.IOException: Could not locate executable D:\hadoop-2.7.7\bin\winutils.exe in the Hadoop binaries.
按照提示,可以去 https://github.com/steveloughran/winutils 选择你安装的Hadoop版本号,然后进入到bin目录下,找到winutils.exe
文件,下载方法是点击winutils.exe
文件,进入之后在页面的右上方部分有一个Download
按钮,点击下载即可。 如图所示:
将下载好bin目录
覆盖到Hadoop的bin目录下,
但是运行时会提示无法在64位环境中执行,此时建议去:http://www.pc6.com/softview/SoftView_578664.html 下载对应版本的bin,将bin目录覆盖到hadoop的bin目录下
新建classpath变量,设置为D:\hadoop-2.7.7\bin\winutils.exe
同时,将bin\hadoop.dll复制到C:\Windows\System32目录下
设置spark安装目录的权限不能为隐藏和只读
因为无法识别系统配置的JAVA_HOME,所以需要配置hadoop的java home路径, 打开D:\hadoop-2.7.7\etc\hadoop\hadoop-env.cmd 文件
set JAVA_HOME="D:\hadoop-2.7.7\jdk1.8.0_181"
注意JAVA_HOME中的路径不能有空格,否则无法识别
修改D:\hadoop-2.7.7\etc\hadoop\core-site.xml
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>file:/hadoop/tmp</value>
<description>Abase for other temporary directories.</description>
</property>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
修改D:\hadoop-2.7.7\etc\hadoop\hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/hadoop/data/dfs/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/hadoop/data/dfs/datanode</value>
</property>
</configuration>
默认是在hadoop安装的同级目录下创新hadoop文件夹作为name节点和data节点
配置好后,到sbin目录下运行start-dfs.cmd
会启动namenode和datanode,但是在启动datanode时候提示
java.lang.RuntimeException: Error while running command to get file permissions : java.io.IOException: Cannot run program "D:\hadoop-2.7.7\bin\winutils.exe": CreateProcess error=740, 请求的操作需要提升。
通过hadoop version可以查看hadoop版本
在浏览器中打开链接:http://localhost:50070/dfshealth.html#tab-overview
即可查看:
五、通过maven创建spark项目
我们点击初始界面的Create New Project进入如图界面。并按图创建Maven工程文件。
创建完成后,在project右键点击Add Framework Support.......,
在pom.xml中加入spark必要的jar
<?xml version="1.0" encoding="UTF-8"?> <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"> <modelVersion>4.0.0</modelVersion> <groupId>spark</groupId> <artifactId>picc-spark</artifactId> <version>1.0-SNAPSHOT</version> <name>picc-spark</name> <!-- FIXME change it to the project's website --> <url>http://www.example.com</url> <properties> <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding> <maven.compiler.source>1.8</maven.compiler.source> <maven.compiler.target>1.8</maven.compiler.target> <hadoopVersion>2.7.2</hadoopVersion> <sparkVersion>2.0.1</sparkVersion> <scala.version>2.11</scala.version> </properties> <dependencies> <dependency> <groupId>junit</groupId> <artifactId>junit</artifactId> <version>4.11</version> <scope>test</scope> </dependency> <!-- Hadoop start --> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-common</artifactId> <version>${hadoopVersion}</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-hdfs</artifactId> <version>${hadoopVersion}</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-mapreduce-client-core</artifactId> <version>${hadoopVersion}</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-client</artifactId> <version>${hadoopVersion}</version> </dependency> <!-- Hadoop --> <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-core --> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-core_${scala.version}</artifactId> <version>${sparkVersion}</version> </dependency> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-sql_${scala.version}</artifactId> <version>${sparkVersion}</version> </dependency> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-streaming_${scala.version}</artifactId> <version>${sparkVersion}</version> <scope>provided</scope> </dependency> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-hive_${scala.version}</artifactId> <version>${sparkVersion}</version> </dependency> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-mllib_${scala.version}</artifactId> <version>${sparkVersion}</version> </dependency> </dependencies> <build> <pluginManagement><!-- lock down plugins versions to avoid using Maven defaults (may be moved to parent pom) --> <plugins> <!-- clean lifecycle, see https://maven.apache.org/ref/current/maven-core/lifecycles.html#clean_Lifecycle --> <plugin> <artifactId>maven-clean-plugin</artifactId> <version>3.1.0</version> </plugin> <!-- default lifecycle, jar packaging: see https://maven.apache.org/ref/current/maven-core/default-bindings.html#Plugin_bindings_for_jar_packaging --> <plugin> <artifactId>maven-resources-plugin</artifactId> <version>3.0.2</version> </plugin> <plugin> <artifactId>maven-compiler-plugin</artifactId> <version>3.8.0</version> </plugin> <plugin> <artifactId>maven-surefire-plugin</artifactId> <version>2.22.1</version> </plugin> <plugin> <artifactId>maven-jar-plugin</artifactId> <version>3.0.2</version> </plugin> <plugin> <artifactId>maven-install-plugin</artifactId> <version>2.5.2</version> </plugin> <plugin> <artifactId>maven-deploy-plugin</artifactId> <version>2.8.2</version> </plugin> <!-- site lifecycle, see https://maven.apache.org/ref/current/maven-core/lifecycles.html#site_Lifecycle --> <plugin> <artifactId>maven-site-plugin</artifactId> <version>3.7.1</version> </plugin> <plugin> <artifactId>maven-project-info-reports-plugin</artifactId> <version>3.0.0</version> </plugin> </plugins> </pluginManagement> </build> </project>
编写wordcount程序
运行成功!!