Black Box Machine Learning学习笔记

本文为Bloomberg的第一课,是100天机器学习挑战的Day15学习内容。

100天机器学习挑战汇总文章链接在这儿

目录

1 ML

1.1 什么是ML

1.2 ML问题分类

2 Elements of the ML Pipeline

3 Evaluating a Prediction Function——损失函数

4 Other Sources of Test ≠ Deployment

4.1 Leakage

4.2 Sample Bias

4.3 Nonstationary

5 模型复杂度和过拟合


1 ML

1.1 什么是ML

rule-based问题不是ML:

Black Box Machine Learning学习笔记

ML的解决问题思路是:

Black Box Machine Learning学习笔记

1.2 ML问题分类

分类(hard/soft probabilistic)

多分类(hard/soft probabilistic)

回归

在统计学习中,有两种模型:概率模型和非概率模型;

概率模型:形式为P(x|y),即在学习过程中,y未知,训练后模型得到的输出是x的一系列值的概率;

非概率模型:形式为决策函数,即输入x到输出y的一个映射,且输出唯一;

软分类:使用的是概率模型,输出不同类对应的概率,最后的分类结果取概率最大的类,如多SVM组合分类;

硬分类:使用的是非概率模型,分类结果就是决策函数的决策结果;

参考文章:https://blog.csdn.net/eternity1118_/article/details/51525702

2 Elements of the ML Pipeline

Feature Extraction 特征提取:

Black Box Machine Learning学习笔记

这里给出了一个例子,判断某个字符串是不是邮箱的地址,可以这样做特征提取:

Black Box Machine Learning学习笔记

一个更加systematic的方法是使用one-hot:

Black Box Machine Learning学习笔记

One-hot Encoding:

one-hot encoding: a set of binary features that always has exactly one nonzero value.
categorical variable: a variable that takes one of several discrete possible values:
        NYC Boroughs: “*lyn”, “Bronx”, “Queens”, “Manhattan”, “Staten Island”
Categorical variables can be encoded numerically using one-hot encoding.
        In statistics, called a dummy variable encoding

这一部分的关键概念:

Black Box Machine Learning学习笔记

3 Evaluating a Prediction Function——损失函数

Black Box Machine Learning学习笔记

Prediction function needs to do well on new inputs.

预测需要 在新的输入上表现好——这就引入了test set。

Black Box Machine Learning学习笔记

我们需要在training set和test set之间寻找某种trade-off的平衡。

Black Box Machine Learning学习笔记

加入validation data后:

Validation set is like test set, but used to choose best among many prediction functions.
Test set is just used to evaluate the final chosen prediction function.

k折交叉验证(当数据较少时):

Black Box Machine Learning学习笔记

我在学习交叉验证的时候有个问题,就是k折交叉验证虽然求得了平均的performance estimation,但是得到了k个model(k中不同参数),如何挑选最好的那个?

google到的答案(链接)如下:

简单概括一下就是,k-CV的结果是用来选model的(比如到底用线性回归模型,还是神经网络),而不是选参数的。选好model后,最佳参数就是把全部的data都用来训练得到的那个model的参数。

Let's get some terminology straight, generally when we say 'a model' we refer to a particular method for describing how some input data relates to what we are trying to predict. We don't generally refer to particular instances of that method as different models. So you might say 'I have a linear regression model' but you wouldn't call two different sets of the trained coefficients different models. At least not in the context of model selection.

So, when you do K-fold cross validation, you are testing how well your model is able to get trained by some data and then predict data it hasn't seen. We use cross validation for this because if you train using all the data you have, you have none left for testing. You could do this once, say by using 80% of the data to train and 20% to test, but what if the 20% you happened to pick to test happens to contain a bunch of points that are particularly easy (or particularly hard) to predict? We will not have come up with the best estimate possible of the models ability to learn and predict.

We want to use all of the data. So to continue the above example of an 80/20 split, we would do 5-fold cross validation by training the model 5 times on 80% of the data and testing on 20%. We ensure that each data point ends up in the 20% test set exactly once. We've therefore used every data point we have to contribute to an understanding of how well our model performs the task of learning from some data and predicting some new data.

But the purpose of cross-validation is not to come up with our final model. We don't use these 5 instances of our trained model to do any real prediction. For that we want to use all the data we have to come up with the best model possible. The purpose of cross-validation is model checking, not model building.

Now, say we have two models, say a linear regression model and a neural network. How can we say which model is better? We can do K-fold cross-validation and see which one proves better at predicting the test set points. But once we have used cross-validation to select the better performing model, we train that model (whether it be the linear regression or the neural network) on all the data. We don't use the actual model instances we trained during cross-validation for our final predictive model.

Note that there is a technique called bootstrap aggregation (usually shortened to 'bagging') that does in a way use model instances produced in a way similar to cross-validation to build up an ensemble model, but that is an advanced technique beyond the scope of your question here.

4 Other Sources of Test ≠ Deployment

4.1 Leakage

定义:Information about labels sneaks into features.

一些例子:

identifying cat photos by using the title on the page

including sales commission as a feature when ranking sales leads

using star rating as feature when predicting sentiment of Yelp review

4.2 Sample Bias

定义:Test inputs and deployment inputs have different distributions.

一些例子:

create a model to predict US voting patterns, but phone survey only dials landlines

building a stock forecasting model, but training using a random selection of companies that exist today

US census slightly undercounts certain subpopulations in a way that’s somewhat predictable based on demographic and geographic features.

4.3 Nonstationary

Black Box Machine Learning学习笔记

5 模型复杂度和过拟合

Black Box Machine Learning学习笔记