Coursera | Andrew Ng (03-week2-2.1)—进行误差分析
该系列仅在原课程基础上部分知识点添加个人学习笔记,或相关推导补充等。如有错误,还请批评指教。在学习了 Andrew Ng 课程的基础上,为了更方便的查阅复习,将其整理成文字。因本人一直在学习英语,所以该系列以英文为主,同时也建议读者以英文为主,中文辅助,以便后期进阶时,为学习相关领域的学术论文做铺垫。- ZJ
转载请注明作者和出处:ZJ 微信公众号-「SelfImprovementLab」
知乎:https://zhuanlan.zhihu.com/c_147249273
****:http://blog.****.net/junjun_zhao/article/details/79167392
2.1 Carrying out error analysis (进行误差分析)
(字幕来源:网易云课堂)
Hello, and welcome back.If you’re trying to get a learning algorithm to do a task that humans can do.And if your learning algorithm is not yet at the performance of a human.Then manually examining mistakes that your algorithm is making,can give you insights into what to do next.This process is called error analysis.Let’s start with an example.Let’s say you’re working on your cat classifier, and you’ve achieved 90% accuracy,or equivalently 10% error, on your dev set.And let’s say this is much worse than you’re hoping to do.Maybe one of your teammates looks at some of the examples that the algorithm is misclassifying,and notices that it is miscategorizing some dogs as cats.And if you look at these two dogs, maybe they look a little bit like a cat,at least at first glance.So maybe your teammate comes to you with a proposal for how to make the algorithm do better, specifically on dogs, right?You can imagine building a focus effort, maybe to collect more dog pictures,or maybe to design features specific to dogs, or something.In order to make your cat classifier do better on dogs,so it stops misrecognizing these dogs as cats.So the question is,should you go ahead and start a project focus on the dog problem?There could be several months of works you could do in order to make your algorithm make few mistakes on dog pictures.So is that worth your effort?Well, rather than spending a few months doing this,only to risk finding out at the end that it wasn’t that helpful.Here’s an error analysis procedure that can let you very quickly tell whether or not this could be worth your effort.
你好 欢迎回来,如果你希望让学习算法能够胜任人类能做的任务,但你的学习算法还没有达到人类的表现,那么人工检查一下你的算法犯的错误,也许可以让你了解接下来应该做什么,这个过程称为错误分析,我们从一个例子开始讲吧,假设你正在调试猫分类器 然后你取得了 90% 准确率,相当于 10% 错误 在你的开发集上做到这样,这离你希望的目标还有很远,也许你的队员,看了一下算法分类出错的例子,注意到算法将一些狗分类为猫,你看看这两只狗 它们看起来是有点像猫,至少乍一看是,所以也许你的队友给你一个建议,如何针对狗的图片优化算法,试想一下 你可以针对狗 收集更多的狗图,或者设计一些只处理狗的算法功能 之类的,为了让你的猫分类器在狗图上做的更好,让算法不再将狗分类成猫,所以问题在于,你是不是应该去开始做一个项目专门处理狗?这项目可能需要花费几个月的时间 才能让算法,在狗图片上犯更少的错误,这样做值得吗?或者 与其花几个月做这个项目,有可能最后发现这样一点用都没有,这里有个错误分析流程 可以让你很快知道,这个方向是否值得努力。
Here’s what I recommend you do.First, get about, say 100 mislabeled dev set examples,then examine them manually.Just count them up one at a time, to see how many of these mislabeled examples in your dev set are actually pictures of dogs.Now, suppose that it turns out that 5% of your 100 mislabeled dev set examples are pictures of dogs.So, that is, if 5 out of 100 of these mislabeled dev set examples are dogs,what this means is that of the 100 examples.Of a typical set of 100 examples you’re getting wrong,even if you completely solve the dog problem,you will only get 5 out of 100 more correct.Or in other words, if only 5% of your errors are dog pictures,then the best you could easily hope to do,if you spend a lot of time on the dog problem.Is that your error might go down from 10% error, down to 9.5% error, right?So this a 5% relative decrease in error, from 10% down to 9.5%.And so you might reasonably decide that this is not the best use of your time.Or maybe it is, but at least this gives you a ceiling, right?Upper bound on how much you could improve performance by working on the dog problem, right?In machine learning, sometimes we call this the ceiling on performance.Which just means, what’s in the best case?How well could working on the dog problem help you?
这是我建议你做的,首先 收集一下 比如说 100 个错误标记的开发集例子,然后手动检查,一次只看一个 看看,你的开发集里有多少错误标记的例子是狗,现在 假设事实上,你的 100 个错误标记例子中只有 5% 是狗,就是说在 100 个错误标记的开发集例子中 有 5 个是狗,这意味着 100 个例子,在典型的 100 个出错例子中,即使你完全解决了狗的问题,你也只能修正这 100 个错误中的5个,或者换句话说 如果只有 5% 的错误是狗图片,那么如果你在狗的问题上花了很多时间,那么你最多只能希望,你的错误率从 10% 下降到 9.5% 对吧,错误率下降了 5% 那就是 10% 下降到 9.5%,你就可以确定这样花时间不好,或者也许应该花时间 但至少这个分析给出了一个上限,如果你继续处理狗的问题,能够改善算法性能的上限 对吧,在机器学习中 有时我们称之为性能上限,就意味着最好能到哪里,完全解决狗的问题可以对你有多少帮助。
But now, suppose something else happens.Suppose that we look at your 100 mislabeled dev set examples,you find that 50 of them are actually dog images.So 50% of them are dog pictures.Now you could be much more optimistic about spending time on the dog problem.In this case, if you actually solve the dog problem,your error would go down from this 10%, down to potentially 5% error.And you might decide that halving your error could be worth a lot of effort.Focus on reducing the problem of mislabeled dogs.I know that in machine learning,sometimes we speak disparagingly of hand engineering things,or using too much manual insight.But if you’re building applied systems, then this simple counting procedure,error analysis, can save you a lot of time.In terms of deciding what’s the most important,or what’s the most promising direction to focus on.In fact, if you’re looking at 100 mislabeled dev set examples,maybe this is a 5 to 10 minute effort.To manually go through 100 examples,and manually count up how many of them are dogs.And depending on the outcome,whether there’s more like 5%, or 50%, or something else.This, in just 5 to 10 minutes,gives you an estimate of how worthwhile this direction is.And could help you make a much better decision,whether or not to spend the next few monthsfocused on trying to find solutions to solve the problem of mislabeled dogs.
但现在 假设发生了另一件事,假设我们观察一下这 100 个错误标记的开发集例子,你发现实际有 50 张图都是狗,所以有 50% 都是狗的照片,现在花时间去解决狗的问题可能效果就很好,这种情况下 如果你真的解决了狗的问题,那么你的错误率可能就从 10% 下降到 5% 了,然后你可能觉得让错误率减半的方向值得一试,可以集中精力减少错误标记的狗图的问题,我知道在机器学习中,有时候我们很鄙视手工操作,或者使用了太多人为数值,但如果你要搭建应用系统 那这个简单的人工统计步骤,错误分析 可以节省大量时间,可以迅速决定什么是最重要的,或者最有希望的方向,实际上 如果你观察100 个错误标记的开发集例子,也许只需要 5 到10 分钟的时间,亲自看看这 100 个例子,并亲自统计一下有多少是狗,根据结果,看看有没有占到 5% 50% 或者其他东西,这个在 5 到 10分钟之内,就能给你估计这个方向有多少价值,并且可以帮助你做出更好的决定,是不是把未来几个月的时间,投入到解决错误标记的狗图这个问题。
In this slide, we’ll describe using error analysisto evaluate whether or not a single idea, dogs in this case, is worth working on.Sometimes you can also evaluate multiple ideas in parallel doing error analysis.For example, let’s say you have several ideas in improving your cat detector.Maybe you can improve performance on dogs?Or maybe you notice that sometimes, what are called great cats,such as lions, panthers, cheetahs, and so on.That they are being recognized as small cats, or house cats.So you could maybe find a way to work on that.Or maybe you find that some of your images are blurry,and it would be nice if you could design somethingthat just works better on blurry images.And maybe you have some ideas on how to do that.
在本幻灯片中 我们要描述一下 如何使用错误分析,来评估某个想法 这个例子里狗的问题 是否值得解决,有时你在做错误分析时 也可以同时并行评估几个想法,比如 你有几个改善猫检测器的想法,也许你可以改善针对狗图的性能,或者有时候要注意 那些猫科动物,如狮子 豹 猎豹等等,它们经常被分类成小猫 或者家猫,所以你也许可以想办法解决这个错误,或者也许你发现有些图像是模糊的,如果你能设计出一些系统,能够更好地处理模糊图像,也许你有些想法 知道大概怎么处理这些问题。
So if carrying out error analysis to evaluate these three ideas,what I would do is create a table like this.And I usually do this in a spreadsheet 电子表格,but using an ordinary text file will also be okay.And on the left side,this goes through the set of images you plan to look at manually.So this maybe goes from 1 to 100, if you look at 100 pictures.And the columns of this table, of the spreadsheet,will correspond to the ideas you’re evaluating.So the dog problem, the problem of great cats, and blurry images.And I usually also leave space in the spreadsheet to write comments.So remember, during error analysis,you’re just looking at dev set examples that your algorithm has misrecognized.So if you find that the first misrecognized image is a picture of a dog,then I’d put a check mark there.And to help myself remember these images,sometimes I’ll make a note in the comments.So maybe that was a pit bull picture.If the second picture was blurry, then make a note there.If the third one was a lion, on a rainy day, in the zoo that was misrecognized.Then that’s a great cat, and the blurry data.Make a note in the comment section, rainy day at zoo, andit was the rain that made it blurry, and so on.
要进行错误分析来评估这三个想法,我会做的是建立这样一个表格,我通常用电子表格来做,但普通文本文件也可以,在最左边,人工过一遍你想分析的图像集,所以图像可能是从 1 到 100 如果你观察 100 张图的话,电子表格的一列,就对应你要评估的想法,所以狗的问题 猫科动物的问题 模糊图像的问题,我通常也在电子表格中留下空位来写评论,所以记住 在错误分析过程中,你就看看算法识别错误的开发集例子,如果你发现第一张识别错误的图片是狗图,那么我就在那里打个勾,为了帮我自己记住这些图片,有时我会在评论里注释,也许这是一张比特犬的图,如果第二张照片很模糊 也记一下,如果第三张是在下雨天动物园里的狮子 被识别成猫了,这是大型猫科动物 还有图片模糊,在评论部分 写动物园下雨天,是雨天让图像模糊的 之类的。
Then finally, having gone through some set of images,I would count up what percentage of these algorithms.Or what percentage of each of these error categories were attributed to the dog,or great cat, blurry categories.So maybe 8% of these images you examine turn out be dogs,and maybe 43% great cats, and 61% were blurry.So this just means going down each column,and counting up what percentage of images have a check mark in that column.As you’re part way through this process,sometimes you notice other categories of mistakes.So, for example, you might find that Instagram style filter, those fancy image filters,are also messing up your classifier.In that case,it’s actually okay, part way through the process, to add another column like that.For the multi-colored filters, the Instagram filters, and the Snapchat filters.And then go through and count up those as well,and figure out what percentage comes from that new error category.The conclusion of this process gives you an estimate ofhow worthwhile it might be to work on each of these different categories of errors.For example, clearly in this example,a lot of the mistakes we made on blurry images,and quite a lot on were made on great cat images.And so the outcome of this analysis is not that you must work on blurry images.This doesn’t give you a rigid mathematical formula that tells you what to do,but it gives you a sense of the best options to pursue.It also tells you, for example,that no matter how much better you do on dog images, or on Instagram images.You at most improve performance by maybe 8%, or 12%, in these examples.Whereas you can to better on great cat images, or blurry images, the potential improvement.Now there’s a ceiling in terms of how much you could improve performance,is much higher.So depending on how many ideas you have for improving performance on great cats, on blurry images.Maybe you could pick one of the two, or if you have enough personnel on your team,maybe you can have two different teams.Have one work on improving errors on great cats,and a different team work on improving errors on blurry images.
最后 这组图像过了一遍之后,我可以统计这些算法(错误)的百分比,或者这里每个错误类型的百分比 有多少是狗,大猫 或模糊这些错误类型,所以也许你检查的图像中 8% 是狗,可能 43% 属于大猫 61% 属于模糊,这意味着扫过每一列,并统计那一列有多少百分比图像打了勾,在这个步骤做到一半时,有时你可能会发现其他错误类型,比如说你可能发现有 Instagram 滤镜 那些花哨的图像滤镜,干扰了你的分类器,在这种情况下,实际上可以在错误分析途中 增加这样一列,比如多色滤镜 Instagram 滤镜和 Snapchat 滤镜,然后再过一遍 也统计一下那些问题,并确定这个新的错误类型占了多少百分比,这个分析步骤的结果可以给出一个估计,是否值得去处理每个不同的错误类型,例如 在这个例子中,有很多错误来自模糊图片,也有很多错误类型是大猫图片,所以这个分析的结果不是说你一定要处理模糊图片,这个分析没有给你一个严格的数学公式 告诉你应该做什么,但它能让你对应该选择那些手段有个概念,它也告诉你 比如说,不管你对狗图片 或者 Instagram 图片处理得有多好,在这些例子中 你最多只能取得 8% 或者 12% 的性能提升,而在大猫图片这一类型 你可以做得更好,或者模糊图像 这些类型有改进的潜力,这些类型里 性能提高的上限,空间要大得多,所以取决于你有多少改善性能的想法,比如改善大猫图片 或者模糊图片的表现,也许你可以选择其中两个 或者你的团队成员足够多,也许你把团队可以分成两个团队,其中一个想办法改善大猫的识别,另一个团队想办法改善模糊图片的识别。
But this quick counting procedure, which you can often do in at most small numbers of hours.Can really help you make much better prioritization decisions,and understand how promising different approaches are to work on.So to summarize, to carry out error analysis, you should find a set of mislabeled examples, either in your dev set, or in your development set.And look at the mislabeled examples for false positives and false negatives.And just count up the number of errors that fall into various different categories.During this process, you might be inspired to generate new categories of errors,like we saw.If you’re looking through the examples and you saygee, there are a lot of Instagram filters,or Snapchat filters, they’re also messing up my classifier.You can create new categories during that process.But by counting up the fraction of examples that are mislabeled in different ways,often this will help you prioritize.Or give you inspiration for new directions to go in.Now as you’re doing error analysis,sometimes you notice that some of your examples in your dev sets are mislabeled.So what do you do about that?Let’s discuss that in the next video.
但这个快速统计的步骤 你可以经常做,最多需要几小时,就可以真正帮你选出高优先级任务,并了解每种手段对性能有多大提升空间,所以总结一下 进行错误分析 你应该找一组,错误例子 可能在你的开发集里 或者测试集里,观察错误标记的例子 看看假阳性和假阴性,统计属于不同错误类型的错误数量,在这个过程中 你可能会得到启发 归纳出新的错误类型,就像我们看到的那样,如果你过了一遍错误例子 然后说,天 有这么多Instagram滤镜,或 Snapchat滤镜 这些滤镜干扰了我的分类器,你就可以在途中新建一个错误类型,总之 通过统计不同错误标记类型占总数的百分比,可以帮你发现哪些问题需要优先解决,或者给你构思新优化方向的灵感,在做错误分析的时候,有时你会注意到开发集里有些样本被错误标记了,这时应该怎么做呢?我们下一个视频来讨论。
重点总结:
1. 误差分析
当我们在训练一个模型的时候,如一个猫和狗分类模型,最终得到了 90% 的精确度,即有 10% 的错误率。所以我们需要对模型的一些部分做相应调整,才能更好地提升分类的精度。
如果不加分析去做,可能几个月的努力对于提升精度并没有作用。所以一个好的误差分析的流程就相当重要。
收集错误样例:
在开发集(测试集)中,获取大约100个错误标记的例子,并统计其中有多少个是狗。
- 假设一种情况是 100 个数据中,有 5 个样例是狗,那么如果我们对数据集的错误标记做努力去改进模型的精度,那么可以提升的上限就是 5%,即仅仅可以达到 9.5% 的错误率,这有时称为性能上限。 那么这种情况下,可能这样耗时的努力方向就不是很值得的一件事情。
- 另外一种假设是 100 个数据中,有 50 多个样例是狗,那么这种情况下,我们去改进数据集的错误标记,就是一个比较值得的改进方向,可以将模型的精确度提升至 95%。
并行分析:
- 修改那些被分类成猫的狗狗图片标签;
- 修改那些被错误分类的大型猫科动物,如:狮子,豹子等;
- 提升模糊图片的质量。
为了并行的分析,建立表格来进行。以单个错误分类样本为对象,分析每个样本错误分类的原因。
最后,统计错误类型的百分比,这个分析步骤可以给我们一个粗略的估计,让我们大致确定是否值得去处理每个不同的错误类型。
个人理解:
0.误差分析,找出误差,然后分析它。
1.我希望我的算法可以完成人类能够完成的一些任务 。
2.但是算法表现不好,无法达到人类表现。
3.那问题出在哪里了?那我就分析下错误的样例,去看看哪里出了问题,然后对应的进行解决。
4.怎么进行错误分析呢?人工做个电子表格吧,总体来看是省时间的。
5.罗列样本,分类错误原因,备注,等等,逐一样本查看,进行统计归类,最后计算分类占比,看哪些问题占比大?
6.占比小的呢,尽量就不要耗时了,占比大的问题,要想办法解决下,这样才能提高算法性能,怎么解决呢?
7.针对问题,相对应办法解决。
参考文献:
[1]. 大树先生.吴恩达Coursera深度学习课程 DeepLearning.ai 提炼笔记(3-2)– 机器学习策略(2)
PS: 欢迎扫码关注公众号:「SelfImprovementLab」!专注「深度学习」,「机器学习」,「人工智能」。以及 「早起」,「阅读」,「运动」,「英语 」「其他」不定期建群 打卡互助活动。