SPARK
安装配置
本地运行模式 (单机)
Spark下载地址http://spark.apache.org/downloads.html 根据hadoop版本选择对应的Spark。下载spark-2.4.0-bin-hadoop2.7.tgz到/app/hadoop/下,解压授权。配置好jdk环境
[[email protected] hadoop]# tar -vzxf spark-2.4.0-bin-hadoop2.7.tgz
[[email protected] hadoop]# chown -R hadoop:hadoop spark-2.4.0-bin-hadoop2.7
[[email protected] hadoop]# vi /etc/proflie
export JAVA_HOME=/usr/lib/java/jdk1.8.0_191
export CLASSPATH=.:JAVA_HOME/lib/tools.jar
export PATH=PATH
export PYSPARK_PYTHON=python3
export SPARK_HOME=/app/hadoop/spark-2.4.0-bin-hadoop2.7
export PATH=SPARK_HOME/bin:$SPARK_HOME/sbin
1
2
3
4
5
6
7
8
9
这里的python需要3.0以上的版本,如果当前系统python环境是2.7需要替换成3以上,不然会pyspark运行会报错。PYSPARK_PYTHON变量主要是设置pyspark运行的python版本。
[[email protected] bin]# cd /app/hadoop/spark-2.4.0-bin-hadoop2.7/bin/
[[email protected] bin]# ./run-example SparkPi 2>&1|grep “Pi is”
Pi is roughly 3.1465357326786636
1
2
3
运行pyspark,PySpark 是 Spark 为 Python 开发者提供的 API
[[email protected] bin]# pyspark
Python 3.6.1 (default, Jan 16 2019, 18:18:10)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-23)] on linux
Type “help”, “copyright”, “credits” or “license” for more information.
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/app/hadoop/spark-2.4.0-bin-hadoop2.7/jars/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/app/hadoop/hadoop-2.8.5/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
2019-02-21 17:30:00 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform… using builtin-java classes where applicable
Setting default log level to “WARN”.
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Welcome to
____ __
/ / ___ / /
\ / _ / _ `/ __/ '/
/ / ._/_,// //_\ version 2.4.0
//
Using Python version 3.6.1 (default, Jan 16 2019 18:18:10)
SparkSession available as ‘spark’.
可以直接运算
1+2+3
6
退出exit()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
spark standalone集群配置
编辑conf里的slave节点
[[email protected] conf]$ vi slaves
hadoop1
hadoop2
hadoop3
1
2
3
4
编辑spark-env.sh文件(vim ./conf/spark-env.sh),在第一行添加以下配置信息:
export SPARK_MASTER_HOST=192.168.189.130
每个节点的ip,节点2和节点3需要修改对应的ip
export SPARK_LOCAL_IP=192.168.189.130
export SPARK_MASTER_PORT=7077
export SPARK_WORKER_CORES=1
export SPARK_WORKER_INSTANCES=1
export SPARK_DIST_CLASSPATH=$(/app/hadoop/hadoop-2.8.5/bin/hadoop classpath)
1
2
3
4
5
6
7
有了SPARK_DIST_CLASSPATH配置信息以后,Spark就可以把数据存储到Hadoop分布式文件系统HDFS中,也可以从HDFS中读取数据。如果没有配置上面信息,Spark就只能读写本地数据,无法读写HDFS数据。
复制Spark目录到另外两台节点(复制完成后记得修改spark-env.sh中的SPARK_LOCAL_IP)
[[email protected] conf]$ scp -r /app/hadoop/spark-2.4.0-bin-hadoop2.7/ hadoop2:/app/hadoop/
[[email protected] conf]$ scp -r /app/hadoop/spark-2.4.0-bin-hadoop2.7/ hadoop3:/app/hadoop/
1
2
复制配置文件到另外两台节点,也可以手动将环境变量配置到另外两台节点
[[email protected] conf]$ scp /etc/profile hadoop2:/etc/profile
[[email protected] conf]$ scp /etc/profile hadoop3:/etc/profile
1
2
standalone 启动方式一键启动(不推荐)
[[email protected] sbin]$ cd /app/hadoop/spark-2.4.0-bin-hadoop2.7/sbin
[[email protected] sbin]$ ./start-all.sh
1
2
或者
[[email protected] sbin]$ ./start-master.sh -h 192.168.189.130
[[email protected] sbin]$ ./start-slave.sh spark://192.168.189.130:7077
[[email protected] sbin]$ ./start-slave.sh spark://192.168.189.130:7077
[[email protected] sbin]$ ./start-slave.sh spark://192.168.189.130:7077
1
2
3
4
参照做
来源:****
原文:https://blog.****.net/lyhkmm/article/details/87881078