您的位置: 首页 > 文章 > Spark Documentation

Spark Documentation

分类: 文章 • 2022-11-01 21:48:29

Core Conception

主从模型–请求处理流程
Spark Documentation
Master只负责接受客户提交作业以及指挥Worker完成任务。Worker和Executor是一个节点上两个不同方面的概念，

Cluster Manager

负责集群资源分配的应用，包括Spark自身的standalone cluster
manager(内置)，mesos,YARN,Kubernetes.
SparkContext会向ClusterManager获取Executor进程。

Worker

可以认为成一个物理机，但是它是逻辑概念，一个工作节点

Executor

应用角度而言，加入你是一个应用，那么Executor是你寄存在其他Worker节点的一个进程。

Task & Stage

好比一块豆腐，纵横两刀，一个从横切看问题，一个从纵切看问题。
Task是从机器角度而言，SparkContext将程序（包含处理逻辑）发送到Worker上的Executor进程。进程收到了数据和逻辑就可以执行一个任务。
Stage是从应用角度看问题，Spark数据分析涉及到数据集的转化（过滤，聚合），不管怎样吧，一个转化阶段就是一个Stage,它是逻辑上的概念，对应于Task是物理上的概念。

Driver

应用的主进程，启动点，官方解释：

The process running the main() function of the application and creating the SparkContext

Job

就是日常所说的我要完成一个什么样的功能，实现功能的东西就是完成一个作业。
作业物理角度可以分成多个Task,逻辑角度分成多个Stage.

Spark Streaming

Spark Documentation

杂注

boardcast variable

Sometimes, a variable needs to be shared across tasks, or between tasks and the driver program. Spark supports two types of shared variables: broadcast variables, which can be used to cache a value in memory on all nodes, and accumulators, which are variables that are only “added” to, such as counters and sums.

Spark Documentation

Reference Guide

http://spark.apache.org/docs/latest/cluster-overview.html