初识Apache Storm

Apache Storm

 初识Apache Storm


Why use Storm?

Apache Storm 是免费开源的分布式实时计算系统,可以简单且可靠的处理无限制的流式数据,Storm支持多种语言,并且提供了很强大的功能。

Apache Storm 支持实时分析,机器学习,持续计算,分布式 RPC, ETL等等

Apache Storm 很快,每个节点每秒钟可处理100W个元组

Apache Storm 支持常用的队列,数据库组件

Project Information

Storm 可以和任何消息队列整合,官方例子中提供了以下几种队列的Demo:
  1. Kestrel
  2. RabbitMQ / AMQP
  3. Kafka
  4. JMS
  5. Amazon Kinesis
同样,Storm可以和任何的数据库集成,操作方式同日常使用一样,Simply open a connection to your database and read/write like you normally would

【元组】

When programming on Storm, you manipulate and transform streams of tuples, and a tuple is a named list of values.Tuples can contain objects of any type; if you want to use a type Storm doesn't know about it's very easy to register a serializer for that type.


【三个抽象概念】spouts, bolts, and topologies


spout

spout is a source of streams in a computation. Typically a spout reads from a queueing broker such as Kestrel, RabbitMQ, or Kafka, but a spout can also generate its own stream or read from somewhere like the Twitter streaming API. Spout implementations already exist for most queueing systems.


bolts

bolt processes any number of input streams and produces any number of new output streams. Most of the logic of a computation goes into bolts, such as functions, filters, streaming joins, streaming aggregations, talking to databases, and so on.


topologies

topology is a network of spouts and bolts, with each edge in the network representing a bolt subscribing to the output stream of some other spout or bolt. A topology is an arbitrarily complex multi-stage stream computation. Topologies run indefinitely when deployed.


【开发和调试】Storm has a "local mode" where a Storm cluster is simulated in-process. This is useful for development and testing. The "storm" command line client is used when ready to submit a topology for execution on an actual cluster.


【如何入门】The storm-starter project contains example topologies for learning the basics of Storm. Learn more about how to use Storm by reading the tutorial and the documentation.

【可伸缩】

Storm topologies are inherently(天生的) parallel(并发) and run across a cluster of machines. Different parts of the topology can be scaled individually by tweaking their parallelism. The "Storm topologies are inherently(天生的) parallel(并发) and run across a cluster of machines. Different parts of the topology can be scaled individually by tweaking their parallelism. The "rebalance" command of the "storm" command line client can adjust the parallelism of running topologies on the fly.


Storm's inherent parallelism means it can process very high throughputs of messages with very low latency(延迟). Storm was benchmarked at processing one million 100 byte messages per second per node on hardware with the following specs:


【高容错】

Storm is fault-tolerant: when workers die, Storm will automatically restart them. If a node dies, the worker will be restarted on another node.


The Storm daemons, Nimbus and the Supervisors, are designed to be stateless and fail-fast. So if they die, they will restart like nothing happened. This means you can kill -9 the Storm daemons without affecting the health of the cluster or your topologies.


Read more about Storm's fault-tolerance on the manual.

【准确数据处理】

Storm guarantees(担保) every tuple will be fully processed. One of Storm's core mechanisms is the ability to track the lineage of a tuple as it makes its way through the topology in an extremely efficient way(任何一台机器都有能力追踪到一个元组在拓补结构中的高效处理流中的痕迹). Read more about how this works here.


Storm's basic abstractions provide an at-least-once(至少一次) processing guarantee, the same guarantee you get when using a queueing system. Messages are only replayed when there are failures.


Using Trident, a higher level abstraction over Storm's basic abstractions, you can achieve exactly-once(正好一次) processing semantics.

【支持所有语言】

Storm was designed from the ground up(完全彻底的) to be usable with any programming language. At the core of Storm is a Thrift definition for defining and submitting topologies. Since Thrift(Apache Thrift) can be used in any language, topologies can be defined and submitted from any language.


Similarly, spouts and bolts can be defined in any language. Non-JVM spouts and bolts communicate to Storm over a JSON-based protocol over stdin/stdout. Adapters that implement this protocol exist for RubyPythonJavascriptPerl.


storm-starter has an example topology that implements one of the bolts in Python.

【易于部署和操作】

Storm clusters are easy to deploy, requiring a minimum of setup and configuration to get up and running. Storm's out of the box configurations are suitable for production. Read more about how to deploy a Storm cluster here.


Additionally, Storm is easy to operate once deployed. Storm has been designed to be extremely robust – the cluster will just keep on running, month after month.

【免费且开源】

Apache Storm is a free and open source project licensed under the Apache License, Version 2.0


Storm has a large and growing ecosystem(生态系统) of libraries and tools to use in conjunction(联合|连接) with Storm including everything from:

  1. Spouts: These spouts integrate with queueing systems such as JMS, Kafka, Redis pub/sub, and more.
  2. storm-state: storm-state makes it easy to manage large amounts of in-memory state in your computations in a reliable by using a distributed filesystem for persistence
  3. Database integrations: There are helper bolts for integrating with various databases, such as MongoDB, RDBMS's, Cassandra(分布式key-value数据库), and more.
  4. Other miscellaneous(五花八门) utilities

The Storm documentation has links to notable(显著的) Storm-related projects hosted outside of Apache.