machine learning yearning 第六章

Your dev and test sets should come from the same distribution

你的开发集和测试集应该来自同一份

你把你手机app上的图片按市场区域分为4类:(1)中国图片(2)美国图片(3)印度图片(4)其他。为了设置一个开发集和测试集,我们随机地把其中的2类作为开发集,另外2类做为测试集。假设来自美国和印度的图片作为开发集,中国和其他地区的图片作为测试集。这样做可以吗?答案是:大写的错!

You have your cat app image data segmented into four regions, based on your largest markets: (i) US, (ii) China, (iii) India, and (iv) Other. To come up with a dev set and a test set, we can randomly assign two of these segments to the dev set, and the other two to the test set, right? Say US and India in the dev set; China and Other in the test set. 

machine learning yearning 第六章

一旦你决定选择的测试集和开发集,你的团队就会花时间在提高开发集的效果上。因此,建议你一开始就选择一个,能够更好的放映你最需要提高的地方的开发集:在4类图象都能运行得很好,而不仅仅是2类。

Once you define the dev and test sets, your team will be focused on improving dev set performance. Thus, the dev set should reflect the task you want most to improve on: To do well on all four geographies, and not only two. 

第二个问题在于:开发集和测试集来自不同的一份数据,就可能会导致你的团队开发了一个在开发集上做得很好的模型,但是在测试集上测试时却发现,效果实在太差了。我曾见过很多人为此浪费了很多时间精力。千万不要让这种状况发生在你身上。

There is a second problem with having different dev and test set distributions: There is a chance that your team will build something that works well on the dev set, only to find that it does poorly on the test set. I’ve seen this result in much frustration and wasted effort. Avoid letting this happen to you. 

举个例子,假设你的团队开发了一个在开发集上表现优越,但在测试集上却不尽人意的系统。如果你的开发集和测试集来源相同,那么你就能够清楚的诊断这个系统哪里出现了纰漏:你可能过度拟合了开发集。最简单的解决方法就是寻找更多的开发集数据。

As an example, suppose your team develops a system that works well on the dev set but not the test set. If your dev and test sets had come from the same distribution, then you would have a very clear diagnosis of what went wrong: You have overfit the dev set. The obvious cure is to get more dev set data. 

但是如果你的开发集和测试集来源不同,那么你可能就蒙圈了。出错的地方可能有能多:

  1. 你过分拟合了开发集
  2. 你的测试集未必难度大,但就是与开发集差太远了。所以一些在开发集上的工作在测试集上变现得并不满意。出现这种情况,你就得花精力在提高开发集上了。
  3. 测试集比开发集难度更大。所以你的算法不能像预期的那样做得很好。如何解决这一问题需要具体情况具体分析。

But if the dev and test sets come from different distributions, then your options are less clear. Several things could have gone wrong: 

  1. You had overfit to the dev set. 
  2. The test set is harder than the dev set. So your algorithm might be doing as well as could be expected, and there’s no further significant improvement is possible. 
  3. The test set is not necessarily harder, but just different, from the dev set. So what works well on the dev set just does not work well on the test set. In this case, a lot of your work to improve dev set performance might be wasted effort.  

研究机器学习的应用是有难度的。互不匹配的测试集和开发集将会干扰你决定到底该提高测试集的质量还是开发集的质量,甚至你会不知道到底是哪里错了,哪一个地方得优先解决。

Working on machine learning applications is hard enough. Having mismatched dev and test sets introduces additional uncertainty about whether improving on the dev set distribution also improves test set performance. Having mismatched dev and test sets makes it harder to figure out what is and isn’t working, and thus makes it harder to prioritize what to work on. 

如果你正在解决第三方基准问题,但是那边的工程师却指定训练集和测试集的来源不同,那么,这时候你得靠运气了,投入多少技术反而收效不高。开发一个能够衍生推广到不同数据的集上,却依旧表现很好的学习算法是一个值得研究的问题。但是,如果你仅需要解决当前的燃眉之急而不是要搞科研的话,建议你选择来自同一份数据的开发集和测试集,相信这样你的团队会更有效率。

If you are working on 3rd party benchmark problem, their creator might have specified dev and test sets that come from different distributions. Luck, rather than skill, will have a greater impact on your performance on such benchmarks compared to if the dev and test sets come from the same distribution. It is an important research problem to develop learning algorithms that’re trained on one distribution and generalize well to another. But if your goal is to make progress on a specific machine learning application rather than make research progress, I  recommend trying to choose dev and test sets that are drawn from the same distribution. This will make your team more efficient.