Kaggle学习 Learn Machine Learning 5.Model Validation模型验证

5.ModelValidation模型验证

本文是Kaggle自助学习下的文章,转回到目录点击这里


This tutorial is part of the LearnMachine Learning series.In this step, you will learn to use model validation to measure the quality ofyour model. Measuring model quality is the key to iteratively improving yourmodels.本教程是Learn Machine Learning 系列的一部分。在这一步中,你将学习使用模型验证来度量模型的质量。测量模型质量是迭代改进模型的关键。

 

What is ModelValidation 什么是模型验证

 

You've built a model.But how good is it?你建立了一个模型。但是它有多好呢?

You'll need to answer this question for almost every model you ever build.In most (though not necessarily all) applications, the relevant measure ofmodel quality is predictive accuracy. In other words, will the model'spredictions be close to what actually happens.你需要为你制作的所有模型回答这个问题。在大多数(但不一定是全部)应用中,模型质量的相关度量是预测准确度(大多数情况下是准确度)。换句话说,模型的预测是否接近实际发生的情况。

Some people try answering this problem by making predictions with their trainingdata. They compare those predictions to the actual target values inthe training data. This approach has a critical shortcoming, which youwill see in a moment (and which you'll subsequently see how to solve). 有些人尝试用他们的训练数据做出预测来回答这个问题。他们将这些预测与训练数据中的实际目标值进行比较。这种方法有一个严重的缺点,你会在一瞬间看到(你将会在下一步了解如何解决这个问题)。(上一节我们就用了测试数据来预测,所以100%)

Even with this simple approach, you'll need to summarize the model qualityinto a form that someone can understand. If you have predicted and actual homevalues for 10000 houses, you will inevitably end up with a mix of good and badpredictions. Looking through such a long list would be pointless.即使采用这种简单的方法,你仍然需要将模型质量汇总为某人可以理解的形式。如果预测了10000套房屋的实际房屋价值,那么你将不可避免地得出好的和坏的预测结果。看这么长的列表是毫无意义的。

There are many metrics for summarizing model quality, but we'll start withone called Mean Absolute Error (also called MAE). Let's break down this metricstarting with the last word, error.总结模型质量有许多指标(类型),但我们将从一个称为平均绝对误差(也称为MAE)的指标开始。让我们从最后一个单词“错误”开始分解这个度量。

The prediction error for each house is: 

error=actual−predicted

每所房子的预测误差是:误差=实际值预测值。

So, if a house cost $150,000 and you predicted it wouldcost $100,000 the error is $50,000.所以,如果一栋房子的价格是15万美元,而你预测它会花费10万美元,那么误差是5万美元。

With the MAE metric, we take the absolute value of each error. Thisconverts each error to a positive number. We then take the average of thoseabsolute errors. This is our measure of model quality. In plain English, it canbe said as

On average, our predictions are off by about X

用MAE度量,我们取每个错误的绝对值,这将每个错误转换为正数。然后我们取这些绝对误差的平均值,作为我们对模型质量的衡量。用简单的英语,可以说是

平均而言,我们的预测大约在X附近(感觉少说当这个平均值很小时,我们的预测在X附近)

We first load the Melbourne data and create X and y. That code isn't shownhere, since you've already seen it a couple times.


    我们首先加载墨尔本的数据,并创建X和Y。这里没有显示这些代码,因为你已经看过几次了。(然后显示瞄一眼他写的吧)

              Kaggle学习 Learn Machine Learning 5.Model Validation模型验证

        请各位了解清楚这里干了些什么事。

 


We then create the Decision treemodel with this code:然后,我们使用以下代码创建决策树模型:

Kaggle学习 Learn Machine Learning 5.Model Validation模型验证


    Thecalculation of mean absolute error in the Melbourne data is:墨尔本数据中平均绝对误差的计算是:

 Kaggle学习 Learn Machine Learning 5.Model Validation模型验证

The Problem with "In-Sample" Scores “样本内”分数的问题

    The measure we just computed can be called an "in-sample"score. We used a single set of houses (called a data sample) for both buildingthe model and for calculating it's MAE score. This is bad.我们刚刚计算的度量可以称为“样本内”得分(测试的用的数据为训练中的数据)。我们使用一组房屋(称为数据样本)来构建模型和计算MAE评分,这是非常非常太糟的。

Imagine that, in the large real estate market, doorcolor is unrelated to home price. However, in the sample of data you used tobuild the model, it may be that all homes with green doors were very expensive.The model's job is to find patterns that predict home prices, so it will seethis pattern, and it will always predict high prices for homes with greendoors.想象一下,在大型房地产市场,门颜色与房价无关。但是,在你用来构建模型的数据样本中,可能所有带绿门的房屋都非常昂贵。而该模型的工作是寻找预测房价的方式,所以它将会看到这种现象,并且它总会预测绿色房屋是高价格的。(这种巧合很难避免,我觉得。)

Since this pattern was originally derived from thetraining data, the model will appear accurate in the training data.由于该模式最初是从训练数据中得出的,因此该模型在测试训练数据时看起来是准确的

But this pattern likely won't hold when the model seesnew data, and the model would be very inaccurate (and cost us lots of money)when we applied it to our real estate business.但是,当模型看到新的数据时,这种模式可能不会成立,而且当我们将它应用到我们的房地产业务时,模型会非常不准确(并且花费我们很多钱)就是说,你这样训练出来的模型可能不会兼容以后的新数据

Even a model capturing only happenstance relationshipsin the data, relationships that will not be repeated when new data, can appearto be very accurate on in-sample accuracy measurements. 即使是只捕获数据中的偶然性关系的模型,当新数据没有重复那个偶然性,在样本中的测量中似乎也是非常精确的。

 

Example 例

      Models'practical value come from making predictions on new data, so we should measureperformance on data that wasn't used to build the model. The moststraightforward way to do this is to exclude some data from the model-buildingprocess, and then use those to test the model's accuracy on data it hasn't seenbefore. This data is called validationdata.模型的实用价值来自于对新数据的预测,因此我们应该测量没有用于构建模型的数据的性能。最直接的方法是将一些数据排除在建模过程之外,然后使用这些数据来测试模型对以前从未见过的数据的准确性。此数据称为validation data验证数据.(已有的一些数据,一些用来训练,一些用来评价这个训练的结果)

 

 

 

The scikit-learnlibrary has a function train_test_split to break up the data into two pieces,so the code to get a validation score looks like this:scikit-learn库有一个函数train_test_split可以将数据分成两部分,由此获得验证分数的代码如下所示:

 

from sklearn.model_selection import train_test_split
 
# split data into training and validation data, for both predictors and target
# The split is based on a random number generator. Supplying a numeric value to
# the random_state argument guarantees we get the same split every time we
# run this script.
train_X, val_X, train_y, val_y = train_test_split(X, y,random_state = 0)
# Define model
melbourne_model = DecisionTreeRegressor()
# Fit model
melbourne_model.fit(train_X, train_y)
 
# get predicted prices on validation data
val_predictions = melbourne_model.predict(val_X)
  print(mean_absolute_error(val_y, val_predictions))

1. 

可能的错误:***没有定义,是因为你忘记了导入包 现在导入了如下包

Kaggle学习 Learn Machine Learning 5.Model Validation模型验证

2.  每次测试出来的结果都是不同的,应该在39000-41000之间,说明我们的错误率很高?(可能是因为选择的X不怎么好)

3.  Fit 时出错,线性回归树是针对数值类型的,你选择的X可能是String类型,也可能有些没有值(这一行是空值,需要筛选)

 

Your Turn 该你啦!

1.   Use the train_test_split command to split up your data. 使用train_test_split 命令分割数据

2.   Fit the model with the trainingdata用训练数据拟合模型

3.   Make predictions with thevalidation predictors使用验证预测器进行预测

4.  Calculate the mean absolute error between your predictions and the actualtarget values for the validation data.计算预测与验证数据的实际目标值之间的平均绝对误差。(一般来说误差永远是正的,这里算的绝对距离,也有直接平方的)

 

Continue 继续

      Nowthat you can measure model performance, you are ready to run some experimentscomparing different models. It's an especially fun part of machine learning.Once you've done the steps above in your notebook, click here to continue.恭喜啊你已经完成了一个完整的分割数据、拟合模型、预测数值、算出误差了。希望你能试试其他数据来看看能否降低误差(很有趣哦~)。如果你想继续了,就点击这里

 

本文是Kaggle自助学习下的文章,转回到目录点击这里