spark搞大数据--填坑笔记(一)
前言:准备利用这段时间好好搞搞大数据这套东西,从很早之前就听闻Google三宝的传说,但是时至今日才得以接触到它们衍生出来各种技术。虽然时常被不知道何处的问题搞的筋疲力竭,可是一旦调通了,真是太有意思了。技术平平,翻阅各种大神的博客,填坑无数,从而写下此笔记。
一.配置与工具
系统版本:ubuntu 18.04LTS
编译器工具:idea 2018.2.1社区版
使用idea自带maven进行调试
二.实验目标
第一搭建好scala编译环境
第二编写Wordcount脚本,并调试
第三打jar包,提交到spark运行
三.配置maven环境
①建立自己的maven项目
可以看到这是一个崭新的maven项目,现在并不支持scala调试,我们下一步将修改pom.xml以使其满足项目需求
②修改pom文件
<properties> <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding> <spark.version>2.2.0</spark.version> <scala.version>2.11</scala.version> <hadoop.version>2.7.3</hadoop.version> </properties>
设定好使用的三组件版本,这个版本号真的是坑的一比,在网上抄了各种配置,每种飘红的地方都不一样,所以在这里记录下来能够应付本项目的这些pom写法,报错的地方在下面记录下来,以备后患。
不过在此之前,我们先改动一下maven库源位置到aliyun,这样下载更新会快很多。
Settins->Maven User setting file 选成我们自己的配置文件,这个文件放到用户根目录下的.m2文件夹下即可
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
<?xml version="1.0" encoding="UTF-8"?>
<settings>
<localRepository>/home/zs/.m2/repository</localRepository><!--需要改成自己的maven的本地仓库地址-->
<mirrors>
<mirror>
<id>alimaven</id>
<name>aliyun maven</name>
<url>http://maven.aliyun.com/nexus/content/groups/public/</url>
<mirrorOf>central</mirrorOf>
</mirror>
</mirrors>
<profiles>
<profile>
<id>nexus</id>
<repositories>
<repository>
<id>nexus</id>
<name>local private nexus</name>
<url>http://maven.oschina.net/content/groups/public/</url>
<releases>
<enabled>true</enabled>
</releases>
<snapshots>
<enabled>false</enabled>
</snapshots>
</repository>
</repositories>
<pluginRepositories>
<pluginRepository>
<id>nexus</id>
<name>local private nexus</name>
<url>http://maven.oschina.net/content/groups/public/</url>
<releases>
<enabled>true</enabled>
</releases>
<snapshots>
<enabled>false</enabled>
</snapshots>
</pluginRepository>
</pluginRepositories>
</profile></profiles>
</settings>
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
<dependencies> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-core_${scala.version}</artifactId> <version>${spark.version}</version> </dependency> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-sql_${scala.version}</artifactId> <version>${spark.version}</version> </dependency> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-hive_${scala.version}</artifactId> <version>${spark.version}</version> </dependency> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-streaming_${scala.version}</artifactId> <version>${spark.version}</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-client</artifactId> <version>2.6.0</version> </dependency> <!-- 这以下到sparkstreamingkafka前为新加入--> <dependency> <groupId>org.apache.kafka</groupId> <artifactId>kafka_2.11</artifactId> <version>1.1.1</version> </dependency> <dependency> <groupId>org.apache.zookeeper</groupId> <artifactId>zookeeper</artifactId> <version>3.4.10</version> <type>pom</type> </dependency> <dependency> <groupId>com.101tec</groupId> <artifactId>zkclient</artifactId> <version>0.10</version> </dependency> <dependency> <groupId>io.dropwizard.metrics</groupId> <artifactId>metrics-core</artifactId> <version>3.1.2</version> </dependency> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-streaming-kafka-0-10_2.11</artifactId> <version>2.2.1</version> </dependency> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-streaming_2.11</artifactId> <version>2.2.1</version> </dependency> <!--下面这个依赖报错解决不掉,估计是版本号不对改为上两个依赖 <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-streaming-kafka_${scala.version}</artifactId> <version>${spark.version}</version> </dependency> --> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-mllib_${scala.version}</artifactId> <version>${spark.version}</version> </dependency> <dependency> <groupId>mysql</groupId> <artifactId>mysql-connector-java</artifactId> <version>5.1.39</version> </dependency> <dependency> <groupId>junit</groupId> <artifactId>junit</artifactId> <version>4.12</version> </dependency> </dependencies>
以上为我使用的依赖,步子比较大,基本上后期用到的依赖都加入进来了。下一部分涉及到打包的部分,主要是因为在之前打包的时候涉及到一个问题: scala部分打包不出来,只能打包出java类,这也就造成运行时显示找不到主类的错误。解决方法是加入一下pom配置
<build> <plugins> 这个plugin为java打包配置 <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-assembly-plugin</artifactId> <version>2.5.5</version> <configuration> <archive> <manifest> <mainClass>com.test.Wordcount</mainClass> </manifest> </archive> <descriptorRefs> <descriptorRef>jar-with-dependencies</descriptorRef> </descriptorRefs> </configuration> <executions> <execution> <id>make-assembly</id> <phase>package</phase> <goals> <goal>single</goal> </goals> </execution> </executions> </plugin> 这个plugin为scala打包配置 <plugin> <groupId>org.scala-tools</groupId> <artifactId>maven-scala-plugin</artifactId> <version>2.15.2</version> <executions> <execution> <goals> <goal>compile</goal> <goal>testCompile</goal> </goals> </execution> </executions> </plugin> </plugins> </build>
贴上pom文件链接https://download.****.net/download/infent/10693499
不过还是不能右键创建scala对象,我们还需要设置一下ScalaSDK的全局变量,步骤如下:
1. 右键项目名称找到Open Module Settings
2. 左侧Project Settings目录中点击Libraries
3. 点击+new Project Library选择Scala SDK
4. 添加下载好的jar文件夹
5. 把java文件夹改名成scala
四.编写Wordcount类
这步就不多说了,Wordcount网上都有
package com.test import org.apache.spark.{SparkConf, SparkContext} object Wordcount { def main(args: Array[String]){ /*to use spark,firstly,you need to initialize a sparkcontext object called sparkconf * sparkconf included all the parameters that spark clusters need * */ val conf = new SparkConf() .setMaster("local")//start local compute,test on idea but need to annotate when use maven to export .setAppName("testRdd") //then you can get a start from code sparkcontext val sc = new SparkContext(conf) //if the parameter is default, is equaled to this val sc = new SparkContext("local","testRdd") val data = sc.textFile("/home/zs/IdeaProjects/sparktest/src/main/resources/testfile") //the _ is a placeholder,flatMap is a splite operator for data's rows data.flatMap(_.split(" ")) .map((_,1))//convert every item to key-value,data is key, value is 1 .reduceByKey(_+_)//merge the item with same key .collect()//return the distributed RDD to a scala array, and apply functions on this array, at last, return result to driver program .foreach(println)//loop print } }
五.调试打包
如果上边的设置都没有报错的话,打开右侧的mavenproject面板,右键package打包即可。
这步卡了好久,看了好多微博都不好使,特别感谢一下这位博主的教程,这里如果还有问题的可以移步下面的链接,不过,对于这个项目我已经整合完这些微博的教程了。
https://blog.****.net/freecrystal_alex/article/details/78296851。
六.提交jar包到spark文件
我们把这个包可以拿出来,提交到spark中跑,也可以直接右键运行,如果右键运行没问题,那么就成功了。我们采取的是本地的方式,其他方式请预先搭好spark框架,提交到spark的方法如下:
参考微博:
https://blog.****.net/h8178/article/details/78323642?locationNum=6&fps=1
https://blog.****.net/englishsname/article/details/72864537