【翻译】ZooKeeper: 一个分布式应用的分布协同服务

ZooKeeper is a distributed, open-source coordination service for distributed applications. It exposes a simple set of primitives that distributed applications can build upon to implement higher level services for synchronization, configuration maintenance, and groups and naming. It is designed to be easy to program to, and uses a data model styled after the familiar directory tree structure of file systems. It runs in Java and has bindings for both Java and C.

ZooKeeper是一个为分布式应用设计的分布式的,开源的协同服务。它为分布式应用提供了一系列原语,用以实现高层的服务同步,配置维护,分组和命名。它易于编程,其数据模型是我们熟悉的文件系统目录树结构。它在Java上运行,并且有Java和C两种binding库。

Coordination services are notoriously hard to get right. They are especially prone to errors such as race conditions and deadlock. The motivation behind ZooKeeper is to relieve distributed applications the responsibility of implementing coordination services from scratch.

协同服务是臭名昭著的难写。它们总是会遇到各种各样的线程竞争和死锁。ZooKeeper背后的动机就在于让具体的分布式应用程序不再需要自己来实现繁琐的同步服务。

Design Goals

ZooKeeper is simple. ZooKeeper allows distributed processes to coordinate with each other through a shared hierarchical namespace which is organized similarly to a standard file system. The namespace consists of data registers - called znodes, in ZooKeeper parlance - and these are similar to files and directories. Unlike a typical file system, which is designed for storage, ZooKeeper data is kept in-memory, which means ZooKeeper can achieve high throughput and low latency numbers.

设计目标

ZooKeeper很简洁。ZooKeeper允许分布式应用彼此之间通过共享的层级名字空间来通信。名字空间由znode组成,znode在名字空间里就如同文件和目录在文件系统里。但是和一般的文件系统不同的是,ZooKeeper的数据时保持在内存里的,而非硬盘上,因此ZooKeeper可以获得更高的吞度量和低延迟。

The ZooKeeper implementation puts a premium on high performance, highly available, strictly ordered access. The performance aspects of ZooKeeper means it can be used in large, distributed systems. The reliability aspects keep it from being a single point of failure. The strict ordering means that sophisticated synchronization primitives can be implemented at the client.

ZooKeeper的实现带来了高性能、高可用、严格有序。
高性能意味着ZooKeeper可以用于大型分布式系统。
高可靠意味着ZooKeeper可以防止单点故障。
严格一致意味着ZooKeeper可以实现复杂的同步原语。

  • ZooKeeper is replicated
    Like the distributed processes it coordinates, ZooKeeper itself is intended to be replicated over a set of hosts called an ensemble.
    The servers that make up the ZooKeeper service must all know about each other. They maintain an in-memory image of state, along with a transaction logs and snapshots in a persistent store. As long as a majority of the servers are available, the ZooKeeper service will be available.
    Clients connect to a single ZooKeeper server. The client maintains a TCP connection through which it sends requests, gets responses, gets watch events, and sends heart beats. If the TCP connection to the server breaks, the client will connect to a different server.

  • ZooKeeper is ordered
    ZooKeeper stamps each update with a number that reflects the order of all ZooKeeper transactions. Subsequent operations can use the order to implement higher-level abstractions, such as synchronization primitives.

  • ZooKeeper is fast. It is especially fast in “read-dominant” workloads. ZooKeeper applications run on thousands of machines, and it performs best where reads are more common than writes, at ratios of around 10:1.

【翻译】ZooKeeper: 一个分布式应用的分布协同服务

  • ZooKeeper是多实例分布的
    如同其调控的分布式应用,ZooKeeper本身也在多台机器上有多份相同实例运行,称作ensemble。
    所有组成ZooKeeper的服务必须相互知晓。他们维持一个内存状态,以及持久化的日志和快照。只要大部分的服务器可用,那么ZooKeeper服务就是可用的。
    客户端向一个ZooKeeper服务器发起连接。客户端维持了一个TCP连接,以发送请求,获取响应,获取监听事件,发送心跳包。如果TCP连接断开,那么客户端会向别的服务器发起连接。

  • ZooKeeper是有序的
    ZooKeeper给每次更新赋予一个序号,以反映其在所有事务中的次序。后续的操作可以使用该顺序实现更高级的抽象,比如同步原语。

  • ZooKeeper是快速的
    它在读密集的应用中格外有速度优势。ZooKeeper应用可以在数千台机器上运行,并且适用于读比写更频繁(读写比为10比1)的场景。

Data model and the hierarchical namespace

The namespace provided by ZooKeeper is much like that of a standard file system. A name is a sequence of path elements separated by a slash (/). Every node in ZooKeeper’s namespace is identified by a path.

数据模型和层级命名空间

ZooKeeper提供的名字空间很像一个标准的文件系统。一个名字是由/分隔的多个路径元素序列。每个ZooKeeper命名空间中的结点都有一条路径标识。

ZooKeeper’s Hierarchical Namespace
ZooKeeper的层级命名空间
【翻译】ZooKeeper: 一个分布式应用的分布协同服务

Nodes and ephemeral nodes

结点和临时结点

Unlike standard file systems, each node in a ZooKeeper namespace can have data associated with it as well as children. It is like having a file-system that allows a file to also be a directory. (ZooKeeper was designed to store coordination data: status information, configuration, location information, etc., so the data stored at each node is usually small, in the byte to kilobyte range.) We use the term znode to make it clear that we are talking about ZooKeeper data nodes.

不像传统文件系统,ZooKeeper名字空间中每个结点及其子结点都可以关联数据。就仿佛一个可以允许文件本身也是目录的文件系统。(ZooKeeper的设计可以存储如下协同数据:状态信息,配置,位置信息,等。所以每个结点存储的数据都非常小,只有几B或几KB。)我们使用znode来称呼ZooKeeper中的数据结点。

Znodes maintain a stat structure that includes version numbers for data changes, ACL changes, and timestamps, to allow cache validations and coordinated updates. Each time a znode’s data changes, the version number increases. For instance, whenever a client retrieves data it also receives the version of the data.

znode维持了一个统计结构,包含了数据变化的版本号,访问控制变化,时间戳,以允许缓存验证和协同更新。每当znode的数据发生变化时,版本号就增加。每当客户端获取数据时,它也获取数据的版本。

The data stored at each znode in a namespace is read and written atomically. Reads get all the data bytes associated with a znode and a write replaces all the data. Each node has an Access Control List (ACL) that restricts who can do what.

命名空间中的每个znode中保存的数据都是读写原子性的。读获取该znode的所有数据,写则将znode的数据全部替换。每个znode都有访问控制列表。

ZooKeeper also has the notion of ephemeral nodes. These znodes exists as long as the session that created the znode is active. When the session ends the znode is deleted.

ZooKeeper还有一种临时znode。这种znode只当创建该znode的session有效时存在。当session结束,znode就被删除。

Conditional updates and watches

条件性更新和监听

ZooKeeper supports the concept of watches. Clients can set a watch on a znode. A watch will be triggered and removed when the znode changes. When a watch is triggered, the client receives a packet saying that the znode has changed. If the connection between the client and one of the ZooKeeper servers is broken, the client will receive a local notification.

ZooKeeper实现了监听。客户端可以对znode设置一个监听。当znode发生变化时,监听被触发并删除。当监听被触发时,客户端接收到一个报文通知该znode发生了变化。当客户端和一个ZooKeeper服务器的链接发生中断时,客户端会收到本地的通知。

New in 3.6.0: Clients can also set permanent, recursive watches on a znode that are not removed when triggered and that trigger for changes on the registered znode as well as any children znodes recursively.
在3.6.0版本中,客户端可以对znode设置永久的,递归的监听。该监听被触发后不会被删除,并且可以监听到子结点的变化。

Guarantees

ZooKeeper is very fast and very simple. Since its goal, though, is to be a basis for the construction of more complicated services, such as synchronization, it provides a set of guarantees. These are:

保障

ZooKeeper快速,简洁。
它的目标是为更复杂的服务,比如同步原语提供如下基石保障。

Sequential Consistency - Updates from a client will be applied in the order that they were sent.

顺序一致性 - 来自客户端的更新以它们发送的顺序按序生效。

Atomicity - Updates either succeed or fail. No partial results.

原子性 - 更新要不成功,要不失败。不会有不完整的更新。

Single System Image - A client will see the same view of the service regardless of the server that it connects to. i.e., a client will never see an older view of the system even if the client fails over to a different server with the same session.

单一系统视图 - 无论客户端连到那一台服务器,它将看到同样的视图。一个客户端永远不会看到系统的旧视图,即使它因故障转移到别的服务器上去。

Reliability - Once an update has been applied, it will persist from that time forward until a client overwrites the update.

可靠性 - 一旦更新生效,它就会持久化到硬盘上,直到客户端有新的更新将旧值覆盖。

Timeliness - The clients view of the system is guaranteed to be up-to-date within a certain time bound.

及时性 - 系统保证客户端视图在一定时间内会更新至最新。

Simple API

简单API

One of the design goals of ZooKeeper is providing a very simple programming interface. As a result, it supports only these operations:

ZooKeeper的一个设计目标就是提供一个简洁的编程接口。因此,它只支持以下一些操作:

create : creates a node at a location in the tree
创建:在树种创建一个结点

delete : deletes a node
删除:删除一个结点

exists : tests if a node exists at a location
判断是否存在:判断某个位置的结点是否存在

get data : reads the data from a node
读数据:从结点中读取数据

set data : writes data to a node
写数据:将数据写入结点中

get children : retrieves a list of children of a node
获得孩子:获取某个结点的所有子结点

sync : waits for data to be propagated
同步:等待数据传播完成

Implementation

实现

ZooKeeper Components shows the high-level components of the ZooKeeper service. With the exception of the request processor, each of the servers that make up the ZooKeeper service replicates its own copy of each of the components.

ZooKeeper服务的高层组件如下图所示。除了请求处理器,所有服务器都拥有每个组件的各自拷贝。
【翻译】ZooKeeper: 一个分布式应用的分布协同服务
The replicated database is an in-memory database containing the entire data tree. Updates are logged to disk for recoverability, and writes are serialized to disk before they are applied to the in-memory database.

复制数据库是一个包含整个数据树的内存数据库。为了可恢复,修改日志写到硬盘上,所有的写都按序串行写入硬盘,之后才会更新到内存数据库中。

Every ZooKeeper server services clients. Clients connect to exactly one server to submit requests. Read requests are serviced from the local replica of each server database. Requests that change the state of the service, write requests, are processed by an agreement protocol.

每一台ZooKeeper服务器均为客户端服务。客户端只向一台服务器提交请求。读请求可以从每台服务器本地数据库中得到满足。涉及修改服务状态的请求,写请求,都按照一致性协议来处理。

As part of the agreement protocol all write requests from clients are forwarded to a single server, called the leader. The rest of the ZooKeeper servers, called followers, receive message proposals from the leader and agree upon message delivery. The messaging layer takes care of replacing leaders on failures and syncing followers with leaders.

根据一致性协议,所有的来自客户端的写请求都被转发到一个服务器上,该服务器称作Leader。其余的ZooKeeper服务器,称作Follower,接收从Leader发起的消息提议,并回复同意。消息层负责在Leader发生故障时替换新的Leader,并将Follower与新的Leader同步。

ZooKeeper uses a custom atomic messaging protocol. Since the messaging layer is atomic, ZooKeeper can guarantee that the local replicas never diverge. When the leader receives a write request, it calculates what the state of the system is when the write is to be applied and transforms this into a transaction that captures this new state.

ZooKeeper使用一个定制的原子性消息协议。因为消息层是原子的,ZooKeeper可以保证本地的拷贝都相同。当Leader接收到一个写请求,它先计算出当写生效前当前系统的状态,然后将写转换成状态转换的事务。

Uses

The programming interface to ZooKeeper is deliberately simple. With it, however, you can implement higher order operations, such as synchronizations primitives, group membership, ownership, etc.

使用

ZooKeeper的编程接口设计得非常简洁。不过,你仍然可以在之上实现高层的顺序操作,比如:同步原语,组成员,所有权,等。

Performance

ZooKeeper is designed to be highly performance. But is it? The results of the ZooKeeper’s development team at Yahoo! Research indicate that it is. (See ZooKeeper Throughput as the Read-Write Ratio Varies.) It is especially high performance in applications where reads outnumber writes, since writes involve synchronizing the state of all servers. (Reads outnumbering writes is typically the case for a coordination service.)

性能

ZooKeeper被设计于提供高性能。Yahoo的ZooKeeper开发组的性能测试结果也证明了这一点。尤其当应用的读比写多很多时,因为写需要同步所有服务器的状态,(读写比很高的应用是此类协同服务的典型应用。)

The ZooKeeper Throughput as the Read-Write Ratio Varies is a throughput graph of ZooKeeper release 3.2 running on servers with dual 2Ghz Xeon and two SATA 15K RPM drives. One drive was used as a dedicated ZooKeeper log device. The snapshots were written to the OS drive. Write requests were 1K writes and the reads were 1K reads. “Servers” indicate the size of the ZooKeeper ensemble, the number of servers that make up the service. Approximately 30 other servers were used to simulate the clients. The ZooKeeper ensemble was configured such that leaders do not allow connections from clients.
【翻译】ZooKeeper: 一个分布式应用的分布协同服务

ZooKeeper的吞吐量随读写比变化图,是根据ZooKeeper 3.2发布版本在拥有双核2Ghz Xeon和两块SATA 15K RPM硬盘的服务器上测试得出的结果。一块硬盘被专用做ZooKeeper日志盘。快照被写往OS所在驱动盘。单个写请求1K,读请求1K。“Servers”指ZooKeeper集群的大小,也就是服务器的数量。另外大致上有30台客户端机器。ZooKeeper集群被配置成不允许客户端连接Leader的方式。

Note
In version 3.2 r/w performance improved by ~2x compared to the previous 3.1 release.
注意
3.2版本读写性能比前一个3.1版本提升了2倍多。

Benchmarks also indicate that it is reliable, too. Reliability in the Presence of Errors shows how a deployment responds to various failures. The events marked in the figure are the following:

Failure and recovery of a follower
Failure and recovery of a different follower
Failure of the leader
Failure and recovery of two followers
Failure of another leader

【翻译】ZooKeeper: 一个分布式应用的分布协同服务

测试集同样证明了ZooKeeper是可靠的。存在故障时的可靠性一图显示了部署是怎样应对各种故障的。图中标明的时间包括:

Follower发生错误和恢复
另一台Follower发生错误和恢复
Leader发生错误
两台Follower发生错误和恢复
另一台Leader发生错误

Reliability

To show the behavior of the system over time as failures are injected we ran a ZooKeeper service made up of 7 machines. We ran the same saturation benchmark as before, but this time we kept the write percentage at a constant 30%, which is a conservative ratio of our expected workloads.

可靠性

为了显示当错误注入时系统的行为,我们在7个机器的集群中跑ZooKeeper服务。我们跑了同样的测试集,但这次我们保持写的比例在30%,该比例相比于我们期望的负载偏保守。

There are a few important observations from this graph. First, if followers fail and recover quickly, then ZooKeeper is able to sustain a high throughput despite the failure. But maybe more importantly, the leader election algorithm allows for the system to recover fast enough to prevent throughput from dropping substantially. In our observations, ZooKeeper takes less than 200ms to elect a new leader. Third, as followers recover, ZooKeeper is able to raise throughput again once they start processing requests.

这张图显示了一些关键点。一,如果Follower发生故障,并很快恢复,那么ZooKeeper仍然可以保证较高的吞吐量。二,更重要的是,Leader选举算法使得系统可以很快的恢复以阻止吞吐量的显著下降。在我们的观察中,ZooKeeper选一个新的Leader仅需要不到200ms。
三,当Follower恢复时,一旦它们开始服务请求,ZooKeeper可以再次提升吞吐量。

The ZooKeeper Project

ZooKeeper has been successfully used in many industrial applications. It is used at Yahoo! as the coordination and failure recovery service for Yahoo! Message Broker, which is a highly scalable publish-subscribe system managing thousands of topics for replication and data delivery. It is used by the Fetching Service for Yahoo! crawler, where it also manages failure recovery. A number of Yahoo! advertising systems also use ZooKeeper to implement reliable services.

ZooKeeper项目

ZooKeeper项目在很多商业应用中得以采纳。它被用于Yahoo的消息Broker的协调及故障恢复,该Broker是高扩展的发布订阅系统,管理了数千个话题的复制和数据传递。它也被用于Yahoo爬虫的获取服务,在其中负责故障恢复。还有一些Yahoo的广告系统采用ZooKeeper来实现可靠的服务。