BDAHS 第1章 Big Data Analytics at a 10,000-Foot View
The goal of this book is to familiarize you with tools and techniques using Apache Spark, with a focus on Hadoop deployments and tools used on the Hadoop platform. Most production implementations of Spark use Hadoop clusters and users are experiencing many integration challenges with a wide variety of tools used with Spark and Hadoop. This book will address the integration challenges faced with Hadoop Distributed File System (HDFS) and Yet Another Resource Negotiator (YARN) and explain the various tools used with Spark and Hadoop. This will also discuss all the Spark components—Spark Core, Spark SQL, DataFrames, Datasets, Spark Streaming, Structured Streaming, MLlib, GraphX, and SparkR and integration with analytics components such as Jupyter, Zeppelin, Hive, HBase, and dataflow tools such as NiFi. A real-time example of a recommendation system using MLlib will help us understand data science techniques.
本书主要介绍Spark的工具和技术、Hadoop的部署以及Hadoop的平台工具。同时会介绍HDFS和YARN的主要变化。介绍的具体工具在以后具体介绍。
HDFS Hadoop分布式文件系统(HDFS)被设计成适合运行在通用硬件(commodity hardware)上的分布式文件系统。它和现有的分布式文件系统有很多共同点。但同时,它和其他的分布式文件系统的区别也是很明显的。HDFS是一个高度容错性的系统,适合部署在廉价的机器上。HDFS能提供高吞吐量的数据访问,非常适合大规模数据集上的应用。HDFS放宽了一部分POSIX约束,来实现流式读取文件系统数据的目的。HDFS在最开始是作为Apache Nutch搜索引擎项目的基础架构而开发的。HDFS是Apache Hadoop Core项目的一部分。
HDFS有着高容错性(fault-tolerant)的特点,并且设计用来部署在低廉的(low-cost)硬件上。而且它提供高吞吐量(high throughput)来访问应用程序的数据,适合那些有着超大数据集(large data set)的应用程序。HDFS放宽了(relax)POSIX的要求(requirements)这样可以实现流的形式访问(streaming access)文件系统中的数据。
YARN Apache Hadoop YARN (Yet Another Resource Negotiator,另一种资源协调者)是一种新的 Hadoop 资源管理器,它是一个通用资源管理系统,可为上层应用提供统一的资源管理和调度,它的引入为集群在利用率、资源统一管理和数据共享等方面带来了巨大好处。
YARN的基本思想是将JobTracker的两个主要功能(资源管理和作业调度/监控)分离,主要方法是创建一个全局的ResourceManager(RM)和若干个针对应用程序的ApplicationMaster(AM)。这里的应用程序是指传统的MapReduce作业或作业的DAG(有向无环图)。
YARN 分层结构的本质是 ResourceManager。这个实体控制整个集群并管理应用程序向基础计算资源的分配。ResourceManager 将各个资源部分(计算、内存、带宽等)精心安排给基础 NodeManager(YARN 的每节点代理)。ResourceManager 还与 ApplicationMaster 一起分配资源,与 NodeManager 一起启动和监视它们的基础应用程序。在此上下文中,ApplicationMaster 承担了以前的 TaskTracker 的一些角色,ResourceManager 承担了 JobTracker 的角色。
ApplicationMaster 管理一个在 YARN 内运行的应用程序的每个实例。ApplicationMaster 负责协调来自 ResourceManager 的资源,并通过 NodeManager 监视容器的执行和资源使用(CPU、内存等的资源分配)。请注意,尽管目前的资源更加传统(CPU 核心、内存),但未来会带来基于手头任务的新资源类型(比如图形处理单元或专用处理设备)。从 YARN 角度讲,ApplicationMaster 是用户代码,因此存在潜在的安全问题。YARN 假设 ApplicationMaster 存在错误或者甚至是恶意的,因此将它们当作无特权的代码对待。
NodeManager 管理一个 YARN 集群中的每个节点。NodeManager 提供针对集群中每个节点的服务,从监督对一个容器的终生管理到监视资源和跟踪节点健康。MRv1 通过插槽管理 Map 和 Reduce 任务的执行,而 NodeManager 管理抽象容器,这些容器代表着可供一个特定应用程序使用的针对每个节点的资源。YARN 继续使用 HDFS 层。它的主要 NameNode 用于元数据服务,而 DataNode 用于分散在一个集群中的复制存储服务。
要使用一个 YARN 集群,首先需要来自包含一个应用程序的客户的请求。ResourceManager 协商一个容器的必要资源,启动一个 ApplicationMaster 来表示已提交的应用程序。通过使用一个资源请求协议,ApplicationMaster 协商每个节点上供应用程序使用的资源容器。执行应用程序时,ApplicationMaster 监视容器直到完成。当应用程序完成时,ApplicationMaster 从 ResourceManager 注销其容器,执行周期就完成了。
In this chapter, we will approach Big Data analytics from a broad perspective and try to understand what tools and techniques are used on the Apache Hadoop and Apache Spark platforms. Big Data analytics is the process of analyzing Big Data to provide past, current, and future statistics and useful insights that can be used to make better business decisions. Big Data analytics is broadly classified into two major categories, data analytics and data science, which are interconnected disciplines. This chapter will explain the differences between data analytics and data science. Current industry definitions for data analytics and data science vary according to their use cases, but let’s try to understand what they accomplish.
在本章中,我们将概括的了解大数据分析(Big Data analytics,或称为海量数据分析、大数据证析等),并理解不同的工具在Apache Spark平台和Apache Hadoop中所扮演的角色。
大数据分析指的是对海量数据的分析,并提供过去、当下以及未来的统计分析和有效的洞察(以不同方式展现数据,可以理解为数据展示或数据展示,常常涉及不同的图表),以辅助决策。
大数据分析可以大致地分为两个主要领域:数据分析和数据科学。这是两个关联紧密的概念,本章将会解释两个之间的关联与区别。当前业内对数据分析和数据科学的定义会根据不同情景而发生变化,但是我们还是可以努力尝试着去理解这两个术语的具体内涵。
Data analytics focuses on the collection and interpretation of data, typically with a focus on past and present statistics. Data science, on the other hand, focuses on the future by performing explorative analytics to provide recommendations based on models identified by past and present data.
数据分析关注于数据的收集和展示,主要是处理过去和实时数据。二数据科学则是通过探索性分析来对未来进行“预测”以求提供相关建议,这些建议是基于一些数据模型,而这些模型则基于过去和当前的数据。
Data analytics has mainly two types of analytics, descriptive analytics and diagnostic analytics. Data science has two types of analytics, predictive analytics and prescriptive analytics. The following diagram explains data science and
data analytics:
数据分析大致可以解决两者问题:描述性分析(descriptive analytics)和诊断性分析(diagnostic analytics)。数据科学则关注:预测分析(predictive analytics)和规范分析(prescriptive analytics)。
下面表格比较了数据分析、数据科学两者在处理过程、工具、技术、技能以及输出方面的不同:
- | Data analytics | Data science |
---|---|---|
Perspective | Looking backward | Looking forward |
Nature of work | Report and optimize | Explore, discover, investigate,and visualize |
Output | Reports and dashboards | Data product |
Typical tools used | Hive, Impala, Spark SQL, and HBase | MLlib and Mahout |
Typical techniques used | ETL and exploratory analytics | Predictive analytics andsentiment analytics |
Typical skill set necessary | Data engineering, SQL, and programming | Statistics, machine learning, and programming |
This chapter will cover the following topics:
- Big Data analytics and the role of Hadoop and Spark
- Big Data science and the role of Hadoop and Spark
- Tools and techniques
- Real-life use cases
Big Data analytics and the role of Hadoop and Spark
Conventional data analytics uses Relational Database Management Systems (RDBMS) databases to create data warehouses and data marts for analytics using business intelligence tools. RDBMS databases use the Schema-on-Write approach;there are many downsides for this approach.
Traditional data warehouses were designed to Extract, Transform, and Load (ETL) data in order to answer a set of predefined questions, which are directly related to user requirements. Predefined questions are answered using SQL queries. Once the data is transformed and loaded in a consumable format, it becomes easier for users to access it with a variety of tools and applications to generate reports and dashboards. However, creating data in a consumable format requires several steps, which are listed as follows:
- Deciding predefined questions.
- Identifying and collecting data from source systems.
- Creating ETL pipelines to load the data into the analytic database in a consumable format.
传统的数据分析使用关系型数据系统(RDBMS- Relational Database Management Systems)数据库来创建数据仓库(data warehouses)和数据集市(data marts),并使用商业智能(BI-business intelligence)工具。关系型数据库使用在写模式(Schema-on-Write)实现,这种模式有一定的缺点。
传统数据仓库设计有提取、转换、加载(ETL-Extract, Transform, and Load)数据的功能,以求找到答复预定义问题,满足用户需求。一旦数据通过转换成了更加易处理的样式,用户就能够通过不同的工具和应用来访问、应用这些数据,来生成报告和仪表盘(dashboard)。然而,将数据转换为易处理样式数据的步骤如下:
- 确定预定义问题;
- 确定要对接的数据,并进行数据采集;
- 建立ETL管道将数据以易处理的样式加载到数据库中。
If new questions arise, systems need to identify and add new data sources and create new ETL pipelines. This involves schema changes in databases and the effort of implementation typically ranges from one to six months. This is a big constraint and forces the data analyst to operate in predefined boundaries only.
Transforming data into a consumable format generally results in losing raw/atomic data that might have insights or clues to the answers that we are looking for. Processing structured and unstructured data is another challenge in traditional data warehousing systems. Storing and processing large binary images or videos effectively is always a challenge.
Big Data analytics does not use relational databases; instead, it uses the Schema-on-Read (SOR) approach on the Hadoop platform using Hive and HBase typically. There are many advantages of this approach.
如果有新的问题,那就需要添加新的数据源并创建新的ETL数据管道。这就涉及到数据库中模式(schema)的变化,这样的影响造成的影响也许会持续1~6个月。这是一个巨大的限制,使数据分析师不得不只能在预定义的界限中操作。
将数据转换成更易处理的样式往往会造成一些原始/原子数据的丢失,而这些数据常常蕴藏着事情的答案。传统数据仓库系统的另一个挑战就是处理结构化和非结构化的数据。还有,有效存储和处理大的二进制图像或视频也一直是一个较大的痛点。
大数据分析不再使用关系型数据库;取而代之的是一种叫做在读模式(SOR-Schema-on-Read)的实现方式。在Apache Hadoop中这种在读模式是通过使用Hive和HBase两个工具实现的,这种实现试着很多的优点。
The Schema-on-Read approach introduces flexibility and reusability to systems. The Schema-on-Read paradigm emphasizes storing the data in a raw, unmodified format and applying a schema to the data as needed, typically while it is being read or processed. This approach allows considerably more flexibility in the amount and type of data that can be stored. Multiple schemas can be applied to the same raw data to ask a variety of questions. If new questions need to be answered, just get the new data and store it in a new directory of HDFS and start answering new questions.
This approach also provides massive flexibility over how the data can be consumed with multiple approaches and tools. For example, the same raw data can be analyzed using SQL analytics or complex Python or R scripts in Spark. As we are not storing data in multiple layers, which is needed for ETL, so the storage cost and data movement cost is reduced. Analytics can be done for unstructured and structured data sources along with structured data sources.
在读模式的实现为系统添加了灵活性和可重用性。在读模式的规范强调以原始的、未规范化的形式来存储数据,并在读取数据或者处理数据的时候讲数据的“模式”套用在数据上。这种实现方式在数据存储的数量和类型上都有更强的灵活性。在读取或者处理数据的时候,数据可以套用不同的模式来应对不同的问题。如果有新问题,只需取新数据并存储在新的HDFS目录就可以开始解决新的问题了。
这种实现也可以也为多种实现和工具使用数据提供了很大的灵活性。例如,相同的原始数据可以被SQL分析师使用,或者在Spark中被复杂的Python或者R脚本调用。鉴于我们没有将数据存储在多层结构中(ETL所需),因此存储消耗和数据移动的消耗就减少了。由此,数据分析就可以像处理结构化数据源一样来处理非结构化数据和结构化数据了。
A typical Big Data analytics project life cycle
The life cycle of Big Data analytics using Big Data platforms such as Hadoop is similar to traditional data analytics projects. However, a major paradigm shift is using the Schema-on-Read approach for the data analytics.
使用诸如Hadoop这样的大数据分析平台的大数据分析生命周期和传统的数据分析项目的区别并不大,主要的模式转变是使用了SOR实现方式。
Identifying the problem and outcomes 定义输入与输出
Identify the business problem and desired outcome of the project clearly so that it scopes in what data is needed and what analytics can be performed. Some examples of business problems are company sales going down, customers visiting the website but not buying products, customers abandoning shopping carts, a sudden rise in support call volume, and so on. Some examples of project outcomes are improving the buying rate by 10%, decreasing shopping cart abandonment by 50%, and reducing support call volume by 50% by the next quarter while keeping customers happy.
清晰地定义业务问题和需要输出,可以圈定需要数据的范围和大体使用的分析方式。有些业务问题涉及到公司销售额下降、客户访问网页但是不会买相应的产品、客户清空购物车、支持呼叫量的突然上升,等等。输出的例子包括提升10% 的购买率、购物车的清空率下降50%、在满足客户需求的同时在15分钟内将支持呼叫量减少50%。
Identifying the necessary data 确定数据需求
Identify the quality, quantity, format, and sources of data. Data sources can be data warehouses (OLAP), application databases (OLTP), log files from servers, documents from the Internet, and data generated from sensors and network hubs. Identify all the internal and external data source requirements. Also, identify the data anonymization and re-identification requirements of data to remove or mask personally identifiable information (PII).
确定数据的质量、数量、结构和数据源。数据源可以是数据仓库(OLAP)、应用数据库(OLTP)、服务器的记录文件(logfiles)、网上文本以及传感器和网络接口产生的数据。我们需要确定所有内部和外围数据源需求,与此同时确定数据的去敏感化和重认证需求,这一需求一般需要去除或者遮盖数据中的个人身份认证信息(PII-personally identifiable information)
Data collection 数据收集
Collect data from relational databases using the Sqoop tool and stream data using Flume. Consider using Apache Kafka for reliable intermediate storage. Design and collect data considering fault tolerance scenarios.
使用Sqoop工具可以对接关系数据库中的数据,使用Flume工具可以对接流数据。使用Kafka作为可靠的中间存储,设计和收集数据的过程中要考虑容错机制。
Preprocessing data and ETL 预处理数据并进行ETL建设
Data comes in different formats and there can be data quality issues. The preprocessing step converts the data to a needed format or cleanses inconsistent, invalid, or corrupt data. The performing analytics phase will be initiated once the data conforms to the needed format. Apache Hive, Apache Pig, and Spark SQL are great tools for preprocessing massive amounts of data. This step may not be needed in some projects if the data is already in a clean format or analytics are performed directly on the source data with the Schema-on-Read approach.
我们对接的数据可能有不同的结构并且存在数据质量的问题。数据的预处理阶段,将数据转换成我们需要的结构,或者清理异常、无效的数据,或者修正数据。数据分析阶段在数据满足我们所需的结构之后就可以开始了。Apache Hive,Apache Pig,和Spark SQL都是对大量数据进行预处理的得力工具。如果数据已经满足需求,或者我们在SOR之上就可以直接进行数据分析,则不需要进行这一数据预处理步骤。
Performing analytics 实行分析
Analytics are performed in order to answer business questions. This requires an understanding of data and relationships between data points. The types of analytics performed are descriptive and diagnostic analytics to present the past and current views on the data. This typically answers questions such as what happened and why it happened. In some cases, predictive analytics is performed to answer questions such as what would happen based on a hypothesis.
Apache Hive, Pig, Impala, Drill, Tez, Apache Spark, and HBase are great tools for data analytics in batch processing mode. Real-time analytics tools such as Impala, Tez, Drill, and Spark SQL can be integrated into traditional business intelligence tools (Tableau, Qlikview, and others) for interactive analytics.
数据分析是为了回答业务问题。这就需要对数据和数据点之间关系有足够理解。分类分析和特征分析可以展现过去和当前数据的状态,这样的分析一般可以回答发生了什么和为什么发生的问题。在一些情况下,我们需要基于假设的预测分析来回答将会发生什么的问题。
Apache Hive, Pig, Impala, Drill, Tez, Apache Spark, 和 HBase是批处理的有效工具。例如 Impala, Tez, Drill, and Spark SQL 的实时分析工具可以结合到传统BI工具(如:Tableau, Qlikview,等等)之中去,来实现交付式分析。
Visualizing data 数据可视化展现
Data visualization is the presentation of analytics output in a pictorial or graphical format to understand the analysis better and make business decisions based on the data.
Typically, finished data is exported from Hadoop to RDBMS databases using Sqoop for integration into visualization systems or visualization systems are directly integrated into tools such as Tableau, Qlikview, Excel, and so on. Web-based notebooks such as Jupyter, Zeppelin, and Databricks cloud are also used to visualize data by integrating Hadoop and Spark components.
数据可视化是指以图像格式来展现数据,数据可视化可以帮助人们更好的理解数据,以及让决策基于数据。
一般来说,经过处理的数据可以通过Sqoop等工具将Hadoop 输出到 RDBMS数据库中,可视化系统可以从这些关系型数据库中调取数据进行数据可视化分析。另一种方式是可视化系统直接从Tableau, Qlikview, Excel,等工具中提取数据。除此之外,一些基于Web的笔记本应用如 Jupyter, Zeppelin, 和 Databricks cloud也可以通过整合Hadoop和Spark部件来实现数据可视化。
The role of Hadoop and Spark
Hadoop and Spark provide you with great flexibility in Big Data analytics:
- Large-scale data preprocessing; massive datasets can be preprocessed with high performance
- Exploring large and full datasets; the dataset size does not matter
- Accelerating data-driven innovation by providing the Schema-on-Read approach
- A variety of tools and APIs for data exploration
Big Data science and the role of Hadoop and Spark 大数据科学和Hadoop与Spark在数据科学中的作用
Data science is all about the following two aspects:
- Extracting deep meaning from the data
- Creating data products
Extracting deep meaning from data means fetching the value using statistical algorithms. A data product is a software system whose core functionality depends on the application of statistical analysis and machine learning to the data. Google AdWords or Facebook’s People You May Know are a couple of examples of data products.
数据科学关注两个方面的问题:
- 提取数据中的深层意义
- 创建数据产品
提取数据挖掘中的深层含义意味着使用统计算法提取数据价值。数据产品是一个软件系统,它的核心功能是对数据的统计分析和机器学习的应用。
A fundamental shift from data analytics to data science 从数据分析到数据科学的重要转变
A fundamental shift from data analytics to data science is due to the rising need for better predictions and creating better data products. Let’s consider an example use case that explains the difference between data analytics and data science.
Problem: A large telecoms company has multiple call centers that collect caller information and store it in databases and filesystems. The company has already implemented data analytics on the call center data, which provided the following insights:
- Service availability
- The average speed of answering, average hold time, average wait time, and average call time
- The call abandon rate
- The first call resolution rate and cost per call
- Agent occupancy
Now, the telecoms company would like to reduce the customer churn, improve customer experience, improve service quality, and cross-sell and up-sell by understanding the customers in near real-time.
是对未来情况更好的预测和开发更优的数据产品的需求,是推动从数据分析到数据科学根本转变的基本原因。让我们通过一个使用案例来了解数据分析和数据科学的不同。
案例:一个大型通信运营商设有多个呼叫信息中心用于收集用户的信息并保存在数据库中。这个公司的信息中心已经运用了数据分析技术,并提供了如下的洞悉:
- 服务能力
- 回应平均速度、平均保持时间、平均等候时间、平均通话时间
- 通话未接通率
- 首通通话解决率(是否短时间内有二次通话)和每次通话的费用
- 占用率
现在,运行商公司想要减少客户流失、提升客户体验、提高服务质量,并通过对客户近期实时行为的分析来进行向上行销和交叉行销
Solution: Analyze the customer voice. The customer voice has deeper insights than any other information. Convert all calls to text using tools such as CMU Sphinx and scale out on the Hadoop platform. Perform text analytics to derive insights from the data, to gain high accuracy in call-to-text conversion, create models (language and acoustic) that are suitable for the company, and retrain models on a frequent basis with any changes. Also, create models for text analytics using machine learning and natural language processing (NLP) to come up with the following metrics while combining data analytics metrics:
- Top reasons for customer churn
- Customer sentiment analysis
- Customer and problem segmentation
- 360-degree view of the customer
解决方案:分析用户语音,用户语音包含了更多的内涵。使用向CMU Sphinx这样的工具将所有语音通话转换为文本,并拓展存储在Hadoop平台中。平台的文本分析将数据中的内涵提取出来,得到高精度的语音-文本转换正确率,创建适合情景的语言/声学模型,并且不论发生什么改动都会对模型进行再训练。另外,使用机器学习(machine learning)和自然语言处理技术(NLP-nature language processing)建立文本分析模型的过程中,在整合数据分析矩阵时就会产生如下的矩阵:
- 客户流失的高频原因
- 客户情感分析
- 用户和问题分割(分类)
- 用户画像
Sphinx (Sphinx(斯芬克司))
Sphinx是一个基于SQL的全文检索引擎,可以结合MySQL,PostgreSQL做全文搜索,它可以提供比数据库本身更专业的搜索功能,使得应用程序更容易实现专业化的全文检索。Sphinx特别为一些脚本语言设计搜索API接口,如PHP,Python,Perl,Ruby等,同时为MySQL也设计了一个存储引擎插件。
Sphinx 单一索引最大可包含1亿条记录,在1千万条记录情况下的查询速度为0.x秒(毫秒级)。Sphinx创建索引的速度为:创建100万条记录的索引只需 3~4分钟,创建1000万条记录的索引可以在50分钟内完成,而只包含最新10万条记录的增量索引,重建一次只需几十秒。
Sphinx的主要特性包括:
高速索引 (在新款CPU上,近10 MB/秒); 高速搜索 (2-4G的文本量中平均查询速度不到0.1秒); 高可用性 (单CPU上最大可支持100 GB的文本,100M文档); 提供良好的相关性排名 支持分布式搜索; 提供文档摘要生成; 提供从MySQL内部的插件式存储引擎上搜索 支持布尔,短语, 和近义词查询; 支持每个文档多个全文检索域(默认最大32个); 支持每个文档多属性; 支持断词; 支持单字节编码与UTF-8编码;
Notice that the business requirement of this use case created a fundamental shift from data analytics to data science implementing machine learning and NLP algorithms. To implement this solution, new tools and techniques are used and a new role, data scientist, is needed.
A data scientist has a combination of multiple skill sets—statistics, software programming, and business expertise. Data scientists create data products and extract value from the data. Let’s see how data scientists differ from other roles. This will help us in understanding roles and tasks performed in data science and data analytics projects.
正因业务上的需求驱使,大数据分析才向着使用机器学习和自然语义处理的大数据科学转变。为了适应这样的转变,新的工具和技术也向着数据科学需要的方向转变,并扮演着不同的角色。数据科学家需要同时拥有多样的技能:统计分析、软件编程、业务专家。大数据科学家创建数据产品并且从数据中提取价值。让我们来看看大数据科学家与其他角色的区别,这样我们理解在数据科学项目和数据分析项目中的不同角色和任务。
Data scientists versus software engineers 数据科学家 VS 软件工程师
The difference between the data scientist and software engineer roles is as follows:
• Software engineers develop general-purpose software for applications based on business requirements
• Data scientists don’t develop application software, but they develop software to help them solve problems
• Typically, software engineers use Java, C++, and C# programming languages
• Data scientists tend to focus more on scripting languages such as Python and R
大数据科学家与软件工程师之间的区别在于:
- 软件工程师是为商业需求开发面向大众的(广泛的)应用软件;
- 数据科学家并不开发应用软件,但是数据科学家也会开发软件来辅助解决问题;
- 一般说,软件工程师使用诸如:Java、C++、C#这样的编程语言;
- 数据科学家更加关注想Python和R这样的脚本语言。
Data scientists versus data analysts 数据科学家 VS 数据分析师
The difference between the data scientist and data analyst roles is as follows:
• Data analysts perform descriptive and diagnostic analytics using SQL and scripting languages to create reports and dashboards.
• Data scientists perform predictive and prescriptive analytics using statistical techniques and machine learning algorithms to find answers. They typically use tools such as Python, R, SPSS, SAS, MLlib, and GraphX.
数据科学家与数据分析之的主要区别是:
- 数据分析师使用SQL进行分类和特征分析,并用脚本语言来创建报告和仪表盘;
- 数据科学家则使用统计技术、机器学习算法来进行预测性和规范性分析,一般使用Python、RSPSS、SAS、MLlib和CraphX作为分析工具。
Data scientists versus business analysts 数据科学家 VS 商业分析师
The difference between the data scientist and business analyst roles is as follows:
• Both have a business focus, so they may ask similar questions
• Data scientists have the technical skills to find answers
数据科学家与商业分析师之间的区别主要是:
- 都有关注商业焦点,也许会关注相同的问题;
- 数据科学家有相关技术可以找到问题的答案。
A typical data science project life cycle 典型的数据科学项目生命周期
Let’s learn how to approach and execute a typical data science project.
The typical data science project life cycle shown in Figure 1.4 explains that a data science project’s life cycle is iterative, but a data analytics project’s life cycle, as shown in Figure 1.3, is not iterative. Defining problems and outcomes and communicating phases are not in the iterations while improving the outcomes of the project. However, the overall project life cycle is iterative, which needs to be improved from time to time after production implementation.
Defining problems and outcomes in the data preprocessing phase is similar to the data analytics project, which is explained in Figure 1.3. So, let’s discuss the new steps required for data science projects.
在数据处理阶段的定义问题和输出阶段,数据科学和数据分析之间没有很大的区别。我们需要关注的是新的那些区别。
Hypothesis and modeling 假设与建模
Given the problem, consider all the possible solutions that could match the desired outcome. This typically involves a hypothesis about the root cause of the problem. So, questions around the business problem arise, such as why customers are canceling the service, why support calls are increasing significantly, and why customers are abandoning shopping carts.
对给定的问题,考虑所有可能的解决方案。一般要涉及对问题主要影响因素的假设。因而,与业务相关的问题就出现了,例如:用户为何取消服务、为何支持电话数据提升、为何用户会清空购物车。
A hypothesis would identify the appropriate model given a deeper understanding of the data. This involves understanding the attributes of the data and their relationships and building the environment for the modeling by defining datasets for testing, training, and production. Create the appropriate model using machine learning algorithms such as logistic regression, k-means clustering, decision trees, or Naive Bayes.
特定的假设会确定合适的模型并给出数据的更深的理解。这涉及到理解数据的属性以及数据间的关系,并通过定义测试数据集、训练数据集、输出数据集建立建模环境。使用机器学习算法、K-均值聚类、决策树或者朴素贝叶斯。
Measuring the effectiveness 测量模型的效率
Execute the model by running the identified model against the datasets. Measure the effectiveness of the model by checking the results against the desired outcome. Use test data to verify the results and create metrics such as Mean Squared Error (MSE) to measure effectiveness.
在特定数据集上运行模型。通过比较输出与语气输出的求别来观测模型的有效性。使用测试数据集来观测输出的变化,并使用如均方误差(MSE-Mean Squared Error)矩阵来度量有效性。
Making improvements 改善模型
Measurements will illustrate how much improvement is required. Consider what you might change. You can ask yourself the following questions:
• Was the hypothesis around the root cause correct?
• Ingesting additional datasets would provide better results?
• Would other solutions provide better results?
Once you’ve implemented your improvements, test them again and compare them with the previous measurements in order to refine the solution further.
通过对模型的测试,可以反映需要多少改进。在考虑在哪个方面需要改进的时候,您可以自问如下几个问题:
- 假设围绕的根本原因是否正确?
- 引入附加数据集是否可能对结果有所改进?
- 其他的解决方案是否可能更佳?
改进之后需要再次进行评估,然后再确定是否需要继续改进。
Communicating the results 输出结果的综合
Communication of the results is an important step in the data science project life cycle. The data scientist tells the story found within the data by correlating the story to business problems. Reports and dashboards are common tools to communicate the results.
综合不同输出,在数据科学项目的生命周期中是非常重要的一环。在这个环节,数据科学家将输出业务问题的答案。
The role of Hadoop and Spark Hadoop和Spark的作用
Apache Hadoop provides you with distributed storage and resource management, while Spark provides you with in-memory performance for data science applications. Hadoop and Spark have the following advantages for data science projects:
• A wide range of applications and third-party packages
• A machine learning algorithms library for easy usage
• Spark integrations with deep learning libraries such as H2O and TensorFlow
• Scala, Python, and R for interactive analytics using the shell
• A unification feature—using SQL, machine learning, and streaming together
Hadoop提供了分布式数据存贮和资源管理的方案,Spark提供了数据科学的内存处理方案。Hadoop和Spark有以下的优点:
- 有广泛的应用和第三方工具包;
- 有一个易于使用的机器学习算法库;
- Spark集成了诸如H2O和TensorFlow这样的深度学习库;
- Scala、Python和R语言的交叉使用;
- 使用SQL、机器学习、流处理等统一的结构。
Tools and techniques 工具与技术
Let’s take a look at different tools and techniques used in Hadoop and Spark for Big Data analytics.
While the Hadoop platform can be used for both storing and processing the data, Spark can be used for processing only by reading data into memory. The following is a tabular representation of the tools and techniques used in typical Big Data analytics projects:
Hadoop可以用来存储和处理数据,Spark则可以用来在内存中处理数据,下表是大数据分析项目中使用的工具和技术。
- | Tools used | Techniques used |
---|---|---|
Data collection | Apache Flume for real-time data collection and aggregation ;Apache Sqoop for data import and export from relational data stores and NoSQL databases; Apache Kafka for the publish-subscribe messaging system General-purpose tools such as FTP/Copy | Real-time data capture Export Import Message publishing Data APIs Screen scraping |
Data storage and formats | HDFS: Primary storage of Hadoop;HBase: NoSQL database;Parquet: Columnar format; Avro: Serialization system on Hadoop; Sequence File: Binary key-value pairs;RC File: First columnar format in Hadoop;ORC File: Optimized RC File;XML and JSON: Standard data interchange formats;Compression formats: Gzip, Snappy, LZO,Bzip2, Deflate, and others;Unstructured Text, images, videos, and so on | Data storage;Data archival;Data compression ;Data serialization;Schema evolution |
Data transformation and enrichment | MapReduce: Hadoop’s processing framework;Spark: Compute engine;Hive: Data warehouse and querying;Pig: Data flow language;Python: Functional programming;Crunch, Cascading, Scalding, and Cascalog:Special MapReduce tools | Data munging;Filtering;Joining;ETL;File format conversion;Anonymization;Re-identification |
Data analytics | Hive: Data warehouse and querying;Pig: Data flow language;Tez: Alternative to MapReduce;Impala: Alternative to MapReduce;Drill: Alternative to MapReduce;Apache Storm: Real-time compute engine;Spark Core: Spark core compute engine;Spark Streaming: Real-time compute engine;Spark SQL: For SQL analytics;SolR: Search platform;Apache Zeppelin: Web-based notebook;Jupyter Notebooks;Databricks cloud;Apache NiFi: Data flow;Spark-on-HBase connector;Programming languages: Java, Scala, and Python; | Online Analytical Processing (OLAP);Data mining;Data visualization;Complex event processing;Real-time stream processing;Full text search;Interactive data analytics |
Data science | Python: Functional programming;R: Statistical computing language;Mahout: Hadoop’s machine learning library;MLlib: Spark’s machine learning library;GraphX and GraphFrames: Spark’s graph processing framework and DataFrame adoption to graphs. | Predictive analytics; Sentiment analytics;Text and Natural;Language Processing;Network analytics;Cluster analytics |