端到端机器学习_端到端的Optimalflow自动化机器学习教程,带有真实项目公式
端到端机器学习
In this end-to-end tutorial, we will illustrate how to use OptimalFlow (Documentation | GitHub), an Omni-ensemble automated machine learning toolkit, to predict the number of laps a driver will need to complete in an FIA Formula E race. This is a typical regression predictive problem, which impacts the performance of the team’s racing and energy strategy.
在此端到端教程中,我们将说明如何使用Omni-ensemble自动化机器学习工具包 OptimalFlow ( 文档 | GitHub )来预测驾驶员在FIA Formula E比赛中需要完成的圈数。 这是一个典型的回归预测问题,影响团队赛车和能源策略的表现。
Why we use OptimalFlow? You could read another story of its introduction: “An Omni-ensemble Automated Machine Learning — OptimalFlow”.
为什么我们使用OptimalFlow ? 您可以阅读有关它的介绍的另一个故事: “ 全集成自动机器学习-OptimalFlow ” 。
Project background:
项目背景:
The number of laps left in the race can define its strategies, i.e. deciding to drive aggressively and expend more battery, or drive conservatively and save energy. The team always knows the car battery’s status.
比赛中剩下的圈数可以定义其策略,即决定积极驾驶并消耗更多电池,或者保守驾驶并节省能量。 团队始终了解汽车电池的状态。
Available data:
可用数据:
Formula E public data, including historical race timing data by lap for every driver and weather data(wind, rain, etc), on its official website: https://results.fiaformula.com
Formula E的公开数据,包括每位车手的圈速历史比赛计时数据和天气数据(风,雨等),在其官方网站上: https : //results.fiaformula.com
Project simplified for this article:
本文简化的项目:
We will simplify the problem into a goal to predict the total number of laps in a Formula E race. And we will ignore other human intervention factors, like crash possibilities, or safety car’s impact on the race timing, which could influence the variance of features.
我们将把问题简化为一个目标,以预测一级方程式E比赛的总圈数。 而且,我们将忽略其他人为干预因素,例如撞车可能性或安全车对比赛时间的影响,这些因素可能会影响功能的变化。
步骤1:取得资料 (Step 1: Get the data)
Get the public timing and weather data from https://results.fiaformula.com, and keep their hierarchy, saving them in similarly structured folders. We will extract their folder dictionary as the features about Series Season(i.e. 2017–2018), Match Location(i.e. Berlin), Match Type(i.e. FP1), etc.
从https://results.fiaformula.com获取公共时间和天气数据,并保持其层次结构,将其保存在结构类似的文件夹中。 我们将提取他们的文件夹字典作为有关系列季节(即2017–2018),比赛位置(即柏林),比赛类型(即FP1)等功能。
步骤2:资料整合 (Step 2: Data integration)
Before you feed data to OptimalFlow modules, essential data engineering steps are required. There are 3 types of data: analysis data, weather data, and classification data. For this simplified problem, we can consider the analysis data as the race data, and ignore the classification data. Also, we will include the analysis data from Free Practices, Qualify ing Races, Super Pole, and Races.
在将数据提供给OptimalFlow模块之前,需要执行基本的数据工程步骤。 数据有3种类型:分析数据,天气数据和分类数据。 对于这个简化的问题,我们可以将分析数据视为比赛数据,而忽略分类数据。 此外,我们还将包括来自自由练习,排位赛,超级杆和比赛的分析数据。
The next challenge is we found the weather data and the race data are separated, which will be hard for us to find impacts relationship between them. So we need to set up a connection by creating a ‘key’ column/feature.
下一个挑战是我们发现天气数据和比赛数据是分开的,这将使我们很难找到它们之间的影响关系。 因此,我们需要通过创建“键”列/功能来建立连接。
To create the join key, we should quickly go through the raw data, but obviously, there’s no race date information, only lap-by-lap timing data there. But we know the weather data and analysis data are all saved in the same folder directory, which means we could use data’s file directory information as the ‘key’ to merge weather and analysis/race data.
要创建加入**,我们应该快速浏览原始数据,但是很显然,这里没有比赛日期信息,只有逐圈计时数据。 但是我们知道天气数据和分析数据都保存在同一文件夹目录中,这意味着我们可以将数据的文件目录信息用作合并天气和分析/竞赛数据的“关键”。
So combine all weather data to one dataset, and race data appended by year. Meanwhile, saved extract all dataset’s file location information(named it as ‘file_loc’) individually for further merging with weather data.
因此,将所有天气数据合并为一个数据集,并按年份附加种族数据。 同时,保存后分别提取所有数据集的文件位置信息(称为“ file_loc”),以进一步与天气数据合并。
After the previous coding, we will receive yearly contest datasets, and weather integrated dataset.
经过之前的编码,我们将收到年度竞赛数据集和天气综合数据集。
Next, we will find common features in all yearly contest datasets, and combine the yearly contests datasets as one integrated dataset, saved it as “contest_data.csv”.
接下来,我们将在所有年度竞赛数据集中找到共同的特征,并将年度竞赛数据集合并为一个集成数据集,并将其保存为“ contest_data.csv”。
Then we need to use the datasets’ file directory information to create the ‘key’ column to connect integrated weather data and integrated contest data. (Please pardon my wrong spelling with word ‘weather’ in following codes…)
然后,我们需要使用数据集的文件目录信息来创建“关键”列,以连接综合天气数据和综合比赛数据。 (请原谅我在以下代码中使用“天气”一词的错误拼写…)
Thus, we will get mata dataset for further merging step. And you will have “match_key” as the “key” column I mentioned previously.
因此,我们将获得mata数据集以进行进一步的合并步骤。 您将拥有“ match_key”作为我之前提到的“ key”列。
步骤3:探索性数据分析(EDA) (Step 3: Exploratory data analysis (EDA))
In statistics, exploratory data analysis(EDA) is an approach to analyzing data sets to summarize their main characteristics, often with visual methods. A statistical model can be used or not, but primarily EDA is for seeing what the data can tell us beyond the formal modeling or hypothesis testing task.
在统计学中,探索性数据分析(EDA)是一种分析数据集以总结其主要特征的方法,通常使用视觉方法。 可以使用统计模型,也可以不使用统计模型,但是EDA主要用于查看数据可以在形式建模或假设检验任务之外告诉我们的内容。
I prefer to use pandas-profiling library to accelerate this step. But this step just gives us an overall feeling about the integrated datasets, features relationship and understand what features may be relevant to our final prediction goal. So we don’t need to spend too much time in this step, also OptimalFlow’s autoFS(auto feature selection module), and autoPP(auto feature preprocessing module) will help us cover them, in case we miss some insights in the EDA step.
我更喜欢使用pandas分析库来加速此步骤。 但是,此步骤仅使我们对集成数据集,特征关系有了整体了解,并了解哪些特征可能与我们的最终预测目标相关。 因此,我们无需在此步骤中花费太多时间,OptimalFlow的autoFS(自动特征选择模块)和autoPP (自动特征预处理模块)也将帮助我们解决这些问题,以防万一我们在EDA步骤中错过了一些见识。
步骤4:数据汇总 (Step 4: Data aggregation)
We need to predict the total number of laps of a race, so the lap-by-lap contest data and time-by-time weather information are not clear and powerful as dependent factors.
我们需要预测比赛的总圈数,因此圈数竞赛数据和逐时天气信息尚不明确且影响力很大。
So the better idea should be data aggregation. We don’t know which aggregating approach will highly impact the prediction results, so we could try common calculations, such as mean, median, skewness, STD, etc. , and imply them to both integrated weather data and contest data. Here’s the code example of weather data aggregation:
因此,更好的主意应该是数据聚合。 我们不知道哪种聚合方法会严重影响预测结果,因此我们可以尝试使用通用计算,例如均值,中位数,偏度,STD等,并将它们暗示为综合天气数据和竞赛数据。 这是天气数据聚合的代码示例:
Then we will merge contest data with weather data. There’re total 74 features rest in the merged dataset. Each row of records is covering one driver in a specific Formula E event/race, with the aggregated contest and weather features. The output or we usually called “prediction” in our problem is “Total_Lap_Num” column.
然后,我们将比赛数据与天气数据合并。 合并的数据集中共有74个特征。 记录的每一行都包含特定的E方程式赛事/竞赛中的一名车手,并具有汇总的比赛和天气特征。 问题中的输出或通常称为“预测”的输出是“ Total_Lap_Num ”列。
综上所述: (In summary:)
Data preparation is crucially important for machine learning. As the foundation of further modeling steps, it usually needs both data engineers and data scientists' efforts. The domain experience and familiarity with the data source are the key factors that impact the strategy of how to clean and integrate the raw data.
数据准备对于机器学习至关重要。 作为进一步建模步骤的基础,它通常需要数据工程师和数据科学家的共同努力。 领域经验和对数据源的熟悉程度是影响如何清理和集成原始数据的策略的关键因素。
In Part 2 of this tutorial, we will use OptimalFlow library to implement Omni-ensemble automated machine learning.
在本教程的第2部分中,我们将使用OptimalFlow库来实现Omni-ensemble自动化机器学习。
关于我: (About me:)
I am a healthcare & pharmaceutical data scientist and big data Analytics & AI enthusiast. I developed OptimalFlow library to help data scientists building optimal models in an easy way, and automate Machine Learning workflow with simple codes.
我是医疗保健和制药数据科学家以及大数据分析和AI爱好者。 我开发了OptimalFlow库,以帮助数据科学家以简单的方式构建最佳模型,并使用简单的代码使机器学习工作流程自动化。
As a big data insights seeker, process optimizer, and AI professional with years of analytics experience, I use machine learning and problem-solving skills in data science to turn data into actionable insights while providing strategic and quantitative products as solutions for optimal outcomes.
作为具有多年分析经验的大数据洞察力寻求者,流程优化者和AI专业人员,我使用数据科学中的机器学习和问题解决技能将数据转化为可行的洞察力,同时提供战略和定量产品作为最佳结果的解决方案。
端到端机器学习