Coursera | Andrew Ng (03-week1-1.7)—什么时候该改变开发_测试集和指标

该系列仅在原课程基础上部分知识点添加个人学习笔记，或相关推导补充等。如有错误，还请批评指教。在学习了 Andrew Ng 课程的基础上，为了更方便的查阅复习，将其整理成文字。因本人一直在学习英语，所以该系列以英文为主，同时也建议读者以英文为主，中文辅助，以便后期进阶时，为学习相关领域的学术论文做铺垫。- ZJ

Coursera 课程 |deeplearning.ai |网易云课堂

转载请注明作者和出处：ZJ 微信公众号-「SelfImprovementLab」

知乎：https://zhuanlan.zhihu.com/c_147249273

****：http://blog.****.net/junjun_zhao/article/details/79149780

1.7 When to change dev_test sets and metrics 什么时候该改变开发_测试集和指标？

(字幕来源：网易云课堂)

Coursera | Andrew Ng (03-week1-1.7)—什么时候该改变开发_测试集和指标

You’ve seen how sets of dev set and evaluation metricis like placing a target somewhere for your team to aim at.But sometimes partway through a project you might realize you put your target in the wrong place.In that case you should move your target.Let’s take a look at an example.Let’s say you build a cat classifier to try to find lots of pictures of cats to show to your cat loving usersand the metric that you decided to use is classification error.So algorithms A and B have, respectively,3 percent error and 5 percent error,so it seems like Algorithm A is doing better.But let’s say you try out these algorithms, you look at these algorithms and Algorithm A, for some reason, is letting through a lot of the pornographic images.So if you ship Algorithm A the users would see more cat images because you’ll see 3 percent error and identify cats,but it also shows the users some pornographic images which is totally unacceptable both for your company,as well as for your users.In contrast, Algorithm B has 5 percent error so this classifies fewer images but it doesn’t have pornographic images.So from your company’s point of view,as well as from a user acceptance point of view,Algorithm B is actually a much better algorithm because it’s not letting through any pornographic images.

Coursera | Andrew Ng (03-week1-1.7)—什么时候该改变开发_测试集和指标

你已经学过如何设置开发集和评估指标，就像是把目标定在某个位置让你的团队瞄准，但有时候在项目进行途中，你可能意识到目标的位置放错了，这种情况下你应该移动你的目标，我们来看一个例子，假设你在构建一个猫分类器试图找到很多猫的照片，向你的爱猫人士用户展示，你决定使用的指标是分类错误率，所以算法 A 和 B 分别有， 3％错误率和 5％错误率，所以算法 A 似乎做得更好，但我们实际试一下这些算法你观察一下这些算法，算法A 由于某些原因把很多色情图像分类成猫了，如果你部署算法 A 那么用户就会，看到更多猫图因为它识别猫的错误率只有 3%，但它同时也会给用户推送，一些色情图像这是你的公司完全不能接受的，你的用户也完全不能接受，相比之下算法 B 有 5％的错误率这样分类器就得到较少的图像，但它不会推送色情图像，所以从你们公司的角度来看，以及从用户接受的角度来看，算法 B 实际上是一个更好的算法，因为它不让任何色情图像通过。

So, what has happened in this example is that Algorithm A is doing better on evaluation metric.It’s getting 3 percent error but it is actually a worse algorithm.In this case, the evaluation metric plus the dev set it prefers Algorithm A because they’re saying, look,Algorithm A has lower error which is the metric you’re using but you and your users prefer Algorithm B because it’s not letting through pornographic images.So when this happens,when your evaluation metric is no longer correctly rank ordering preferences between algorithms,in this case is mispredicting that Algorithm A is a better algorithm,then that’s a sign that you should change your evaluation metric or perhaps your development set or test set.In this case the misclassification error metric that you’re using can be written as follows: this one over m,a number of examples in your development set,of sum from i equals 1 to m dev,number of examples in this development set of indicator of whether or not the prediction on example i in your development set is not equal to the actual label i,where they use this notation to denote their predicted value.Right. So these are zero.And this indicates a function notation,counts up the number of examples on which this thing inside it’s true.So this formula just counts up the number of misclassified examples.The problem with this evaluation metric is that they treat pornographic and non-pornographic images equally but you really want your classifier to not mislabel pornographic images,like maybe you recognize a pornographic image as cat image and therefore show it to unsuspecting user,therefore very unhappy with unexpectedly seeing porn.

Coursera | Andrew Ng (03-week1-1.7)—什么时候该改变开发_测试集和指标

那么在这个例子中发生的事情就是算法 A，在评估指标上做得更好，它的错误率达到 3% 但实际上是个更糟糕的算法，在这种情况下评估指标加上开发集，它们都倾向于选择算法 A 因为它们会说看，算法A的错误率较低这是你们自己定下来的指标评估出来的，但你和你的用户更倾向于使用算法 B，因为它不会将色情图像分类为猫，所以当这种情况发生时，当你的评估指标，无法正确衡量算法之间的优劣排序时，在这种情况下原来的指标错误地预测算法 A 是更好的算法，这就发出了信号你应该改变评估指标了，或者要改变开发集或测试集，在这种情况下你用的分类错误率指标，可以写成这样 1 除以 m，m是你的开发集例子数，从 i=1 到 m dev 求和，开发集中的例子数还有指示器I，开发集例子y(i)的预计，不符合实际标签i时的指示器，他们用这些符号来表示预测值，对所以这些都是 0，这符号表示一个函数，统计出里面这个表达式为真的例子数，所以这个公式就统计了分类错误的例子，这个评估指标的问题在于，它对色情图片和非色情图片一视同仁，但你其实真的希望你的分类器不会错误标记色情图像，比如说把一张色情图片分类为猫，然后推送给不知情的用户，他们看到色情图片会非常不满。

One way to change this evaluation metric would be if you add the weight term here,we call this w(i) where w(i) is going to be equal to 1 if x(i) is non-porn and maybe 10 or maybe even large number like a 100 if x(i) is porn.So this way you’re giving a much larger weightto examples that are pornographic so that the error term goes up much moreif the algorithm makes a mistake on classifying a pornographic image as a cat image.In this example you giving 10 times bigger weights to classify pornographic images correctly.

Coursera | Andrew Ng (03-week1-1.7)—什么时候该改变开发_测试集和指标

其中一个修改评估指标的方法是这里加个权重项，我们将这个称为 w(i) 其中如果图片 x(i) 不是色情图片 w(i)=1，如果x(i)是色情图片 w(i) 可能就是10 甚至100，这样你赋予了色情图片更大的权重，让算法将色情图分类为猫图时，错误率这个项快速变大，这个例子里，你把色情图片分类成帽这一错误的惩罚权重加大 10 倍。

If you want this normalization constant,technically this becomes sum over i of w(i),so then this error would still be between zero and one.The details of this weighting aren’t important and actually to implement this weighting,you need to actually go through your dev and test sets,so label the pornographic images in your dev and test sets so you can implement this weighting function.But the high level of take away is,if you find that evaluation metric is not giving the correct rank order preference for what is actually better algorithm,then there’s a time to think about defining a new evaluation metric.And this is just one possible way that you could define an evaluation metric.The goal of the evaluation metric is accurately tell you,given two classifiers, which one is better for your application.For the purpose of this video,don’t worry too much about the details of how we define a new error metric,the point is that if you’re not satisfied with your old error metricthen don’t keep coasting with an error metric you’re unsatisfied with,instead try to define a new one that you think better captures your preferences in terms of what’s actually a better algorithm.

Coursera | Andrew Ng (03-week1-1.7)—什么时候该改变开发_测试集和指标

如果你希望得到归一化常数，在技术上就是 w(i)对所有 i 求和，这样错误率仍然在 0 和 1 之间，加权的细节并不重要 实际上要使用这种加权，你必须自己过一遍开发集和测试集，在开发集和测试集里，自己把色情图片标记出来这样你才能使用这个加权函数，但粗略的结论是，如果你的评估指标，无法正确评估好算法的排名，那么就需要花时间定义一个新的评估指标，这是定义评估指标的其中一种可能方式，评估指标的意义在于准确告诉你，已知两个分类器哪一个更适合你的应用，就这个视频的内容而言，我们不需要太注重新错误率指标是怎么定义的，关键在于如果你对旧的错误率指标不满意，那就不要一直沿用你不满意的错误率指标，而应该尝试定义一个新的指标, 能够更加符合你的偏好定义出实际更适合的算法。

One thing you might notice is that so far we’ve only talked about how to define a metric to evaluate classifiers.That is, we’ve defined an evaluation metric that helps us better rank order classifiers when they are performing at varying levels in terms of streaming of porn.And this is actually an example of an orthogonalization whereI think you should take a machine learning problem and break it into distinct steps.One step is to figure out how to define a metric that captures what you want to do,and I would worry separately about how to actually do well on this metric.So think of the machine learning task as two distinct steps.

Coursera | Andrew Ng (03-week1-1.7)—什么时候该改变开发_测试集和指标

你可能注意到了到目前为止我们只讨论了，如何定义一个指标去评估分类器，也就是说我们定义了一个评估指标帮助我们，更好的把分类器排序，能够区分出它们在识别色情图片的不同水平，这实际上是一个正交化的例子，我想你处理机器学习问题时应该把它切分成独立的步骤，一步是弄清楚如何定义一个指标来衡量你想做的事情的表现，然后我们可以分开考虑如何改善系统在这个指标上的表现，你们要把机器学习任务看成两个独立的步骤。

To use the target analogy,the first step is to place the target.So define where you want to aim and then as a completely separate step,this is one knob you can tune which is how do you place the target as a completely separate problem.Think of it as a separate knob to tune in terms of how to do well at this algorithm,how to aim accurately or how to shoot at the target.Defining the metric is step one and you do something else for step two.In terms of shooting at the target,maybe your learning algorithm is optimizing some cost function that looks like this,where you are minimizing sum of losses on your training set.One thing you could do is to also modify this in order to incorporate these weights and maybe end up changing this normalization constant as well.So it just 1 over a sum of w(i).Again, the details of how you define J aren’t important,but the point was with the philosophy of orthogonalization think of placing the target as one stepand aiming and shooting at a target as a distinct step which you do separately.In other words I encourage you to think of,defining the metric as one step and only after you define a metric,figure out how to do well on that metric which might be changing the cost function J that your neural network is optimizing.

Coursera | Andrew Ng (03-week1-1.7)—什么时候该改变开发_测试集和指标

用目标这个比喻，第一步就是设定目标，所以要定义你要瞄准的目标这是完全独立的一步，这是你可以调节的一个旋钮，如何设立目标是一个完全独立的问题，把它看成是一个单独的旋钮可以调试算法表现的旋钮，如何精确瞄准如何命中目标，定义指标是第一步然后第二步要做别的事情，在逼近目标的时候，也许你的学习算法针对某个长这样的成本函数优化，你要最小化训练集上的损失，你可以做的其中一件事是修改这个，为了引入这些权重，也许最后需要修改这个归一化常数，所以这是 1 除以对 w(i) 的求和，再次如何定义 J 并不重要，关键在于正交化的思路，把设立目标定为第一步，然后瞄准和射击目标是独立的第二步，换种说法我鼓励你们，将定义指标看成一步然后在定义了指标之后，你才能想如何优化系统来提高这个指标评分，比如改变你神经网络要优化的成本函数 J。

Before going on, let’s look at just one more example.Let’s say that your two cat classifiers A and B have, respectively,3 percent error and 5 percent error as evaluated on your dev set.Or maybe even on your test set which are images downloaded off the internet,so high quality well framed images.But maybe when you deploy your algorithm product,you find that algorithm B actually looks like it’s performing better,even though it’s doing better on your dev set.And you find that you’ve been training off very nice high quality images downloaded off the Internet but when you deploy those on the mobile app,users are uploading all sorts of pictures, they’re much less framed,you haven’t only covered the cat, the cats have funny facial expressions,maybe images are much blurrier,and when you test out your algorithms you find that Algorithm B is actually doing better.So this would be another example of your metric and dev test sets falling down.The problem is that you’re evaluating onthe dev and test sets a very nice, high resolution,well-framed images but what your users really care about is you have them doing well on images they are uploading,which are maybe less professional shots and blurrier and less well framed.

Coursera | Andrew Ng (03-week1-1.7)—什么时候该改变开发_测试集和指标

在继续之前我们再讲一个例子，假设你的两个猫分类器 A 和 B 分别有，用开发集评估得到 3% 的错误率和 5% 的错误率，或者甚至用在网上下载的图片构成的测试集上，这些是高质量取景框很专业的图像，但也许你在部署算法产品时，你发现算法 B 看起来表现更好，即使它在开发集上表现不错，你发现你一直在用，从网上下载的高质量图片训练，但当你部署到手机应用时，算法作用到用户上传的图片时那些图片取景不专业，没有把猫完整拍下来或者猫的表情很古怪，也许图像很模糊，当你实际测试算法时你发现算法 B 表现其实更好，这是另一个指标和开发集测试集出问题的例子，问题在于你做评估用的是，很漂亮的高分辨率的开发集和测试集，图片取景很专业但你的用户，真正关心的是他们上传的图片能不能被正确识别，那些图片可能是没那么专业的照片有点模糊取景很业余。

So the guideline is,if doing well on your metric and your current dev sets or dev and test sets’ distribution,if that does not correspond to doing well on the application you actually care about,then change your metric and your dev test set.In other words, if we discover that your dev test set has these very high quality images but evaluating on this dev test set is not predictive of how well your app actually performs,because your app needs to deal with lower quality images,then that’s a good time to change your dev test setso that your data better reflects the type of data you actually need to do well on.But the overall guideline is if your current metric and data you are evaluating on doesn’t correspond to doing well on what you actually care about,then change your metrics and/or your dev/test setto better capture what you need your algorithm to actually do well on.Having an evaluation metric and the dev set allows you tomuch more quickly make decisions about, Is Algorithm A or Algorithm B better.It really speeds up how quickly you and your team can iterate.

Coursera | Andrew Ng (03-week1-1.7)—什么时候该改变开发_测试集和指标

所以方针是，如果你在指标上表现很好，在当前开发集或者开发集和测试集分布中表现很好，但你的实际应用程序你真正关注的地方表现不好，那么就需要修改指标或者你的开发测试集，换句话说如果你发现你的开发测试集都是这些高质量图像，但在开发测试集上做的评估，无法预测你的应用实际的表现，因为你的应用处理的是低质量图像，那么就应该改变你的开发测试集，让你的数据更能反映你实际需要处理好的数据，但总体方针就是如果你当前的指标和当前用来评估的数据，和你真正关心必须做好的事情关系不大，那就应该更改你的指标或者你的开发测试集，让它们能更够好地反映你的算法需要处理好的数据，有一个评估指标和开发集让你可以，更快做出决策判断算法 A 还是算法 B 更优，这真的可以加速你和你的团队迭代的速度。

So my recommendation is,even if you can’t define the perfect evaluation metric and dev set,just set something up quickly and use that to drive the speed of your team iterating.And if later down the line you find out that it wasn’t a good one,you have better idea, change it at that time, it’s perfectly okay.But what I recommend against for the most teams is to run for too long without any evaluation metric and dev set because that can slow down the efficiency of what your team can iterate and improve your algorithm.So that is it on when to change your evaluation metric and/or dev and test sets.I hope that these guidelines help you set up your whole team to havea well-defined target that you can iterate efficiently towards improving performance on.

所以我的建议是，即使你无法定义出一个很完美的评估指标和开发集，你直接快速设立出来然后使用它们来驱动你们团队的迭代速度，如果在这之后 你发现选的不好，你有更好的想法那么完全可以马上改，对于大多数团队我建议最好不要，在没有评估指标和开发集时跑太久，因为那样可能会减慢，你的团队迭代和改善算法的速度，本视频讲的是什么时候需要改变你的评估指标和开发测试集，我希望这些方针能让你的整个团队设立一个，明确的目标一个你们可以高效迭代改善性能的目标。

重点总结：

改变开发、测试集和评估指标

在针对某一问题我们设置开发集和评估指标后，这就像把目标定在某个位置，后面的过程就聚焦在该位置上。但有时候在这个项目的过程中，可能会发现目标的位置设置错了，所以要移动改变我们的目标。

example1

假设有两个猫的图片的分类器：

评估指标：分类错误率
算法A：3%错误率
算法B：5%错误率

这样来看，算法 A 的表现更好。但是在实际的测试中，算法A可能因为某些原因，将很多色情图片分类成了猫。所以当我们在线上部署的时候，算法A会给爱猫人士推送更多更准确的猫的图片（因为其误差率只有 3%），但同时也会给用户推送一些色情图片，这是不能忍受的。所以，虽然算法A的错误率很低，但是它却不是一个好的算法。

这个时候我们就需要改变开发集、测试集或者评估指标。

假设开始我们的评估指标如下：

Error=1mdev∑i=1mdevI{y(i)pred≠y(i)}

其中：

w(i)={1如果x(i)不是色情图片10或100如果x(i)是色情图片

这样通过设置权重，当算法将色情图片分类为猫时，误差项会快速变大。

总结来说就是：如果评估指标无法正确评估算法的排名，则需要重新定义一个新的评估指标。

example2

同样针对 example1 中的两个不同的猫图片的分类器 A 和B。

Coursera | Andrew Ng (03-week1-1.7)—什么时候该改变开发_测试集和指标

但实际情况是对，我们一直使用的是网上下载的高质量的图片进行训练；而当部署到手机上时，由于图片的清晰度及拍照水平的原因，当实际测试算法时，会发现算法B的表现其实更好。

如果在训练开发测试的过程中得到的模型效果比较好，但是在实际应用中自己所真正关心的问题效果却不好的时候，就需要改变开发、测试集或者评估指标。

Guideline：

快速定义正确的评估指标来更好的给分类器的好坏进行排序；
优化评估指标，快速迭代。

参考文献：

[1]. 大树先生.吴恩达Coursera深度学习课程 DeepLearning.ai 提炼笔记（3-1）– 机器学习策略（1）

PS: 欢迎扫码关注公众号：「SelfImprovementLab」！专注「深度学习」，「机器学习」，「人工智能」。以及「早起」，「阅读」，「运动」，「英语」「其他」不定期建群打卡互助活动。

Coursera | Andrew Ng (03-week1-1.7)—什么时候该改变开发_测试集和指标

Coursera | Andrew Ng (03-week1-1.7)—什么时候该改变开发_测试集和指标

重点总结：

改变开发、测试集和评估指标

相关推荐