Course3 - machine learning strategy 2

1 - carrying out error analysis

If the learning algorithm is not yet at the performance of a human, then manually examiming mistakes that the algorithm is making can give us a insight into what to do next. This process is called error analysis.

Let’s say we are working out a cat classifier, and we have achieved 90% accuracy in dev set. Looking at some exampels that the algorithm misclassifying and notice that it miscategorizing some dogs as cat. Now the problem is should we go ahead and start a project focus on the dog problem in order to make few mistakes on dog images? Is that worth our effect? Error analysis can let us vey quickly know whether or not this could be worth the effect.

Get 100 mislabeled dev set examples, then examine them manually, to see how many of these mislabeled examples are actually dogs. Suppose that turns out that 5% of 100 mislabeled dev set examples are dogs. That is means that if we can completely solve the dog problem, the error might hopefully, go down from 10% to 9.5%. The error analysis gives us a ceiling, how much we could improve performance by working on the dog problem. But now suppose something else happens, if we find 50 of the 100 mislabeled images are actually dogs, now spending time on the dog problem might be a promising choice, in the best case, the erro will go down from 10% to 5%.

So maybe this 5 to 10 minute effort to manually count up how many of images are dogs, and depending on the outcome, will give us an estimate of how worthwhile direction is.

Evaluate multiple ideas in parallel:

dog being recognized as cats
great cat
blurry images

create a table like this follow, and looking at the dev set examples that algorithm has misrecognized. Then finally, have gone through all of 100 mislabeled images, we would count up what percetage of each column were attributed to.

This table give us a sense of the best options to pursue, for example, it can tell us no matter how much better we do on the dog images, we at most improve the performance by 8%. Whereas we can do better on the blurry and great cat images, the potential improvement is much higher.

To summarize, by counting up the fraction of examples that are mislabeled in differnet ways often will help us prioritize for direction to going.

2 - cleaning up incorrectly labeled data

Maybe there are some examples with incorrect label in the dataset, we called them incorrectly labeled examples.

It turns out that deep learning algorithm are quite robust to random errors, that is incorrect labeled example, in the training set, so it’s probably okay to just leave the random error, not systematic error, as they are, and not spending too much time fixing them up, so long as the total size of examples big enough. Notice that the systematic error refer to labeler consistently labels white dogs as cats.

How about incorrect labeled examples in the dev set and test set?
during error analysis to add on extra column so that we can also count up the number of examples where the labels was incorrect and also get the fraction of error due to incorrect labels.

In this setting, three number we will look at to try to decide if it’s worth going in and reducing the number of mislabeled examples.

overall dev set error
percentage of errors that due to the incorrect label
errors due to other causes

for example, the three number are 10%, 0.6%(10% $\times$ 6%) and 9.4%(1 - 0.6%), respectively, so 6% is a relatively small fraction of the overall errors, so spend time to fix the incorrect example maybe not the most important thing to do.

take another example, the three number are 2%, 0.6% and 1.4%, when such a high fraction 30% of mistakes on the dev set due to incorrect labels, so it maybe seem much more worthwhile to fix up the incorrect label in the dev set. Since the main purpose of the dev set is to help select the classifier from A and B, so when one has 2.1% error and other has 1.9% error on the dev set, but we don’t trust the dev set anymore to correctly telling us whether the later one is actually better than the former, because 30% of the mistakes are due to the incorrect labels. This is a good reason to fix the incorrect labeled examples in the dev set.

apply the same process to the dev and test set to make sure they continue to come from the same distribution.
train and dev/test set data may come from slightly different distribution.

3 - building your first system quickly, then iterate

if we are thinking of building a new speeh recognition system, there are a lot of direction we could go. There are different choices and different techniques for making the transcript read more fluently. So there are many things we can do to improve the speech recognition system, such as :

moisy background
accented speech
far from microphone
young children’s speech
stuttering

For almost any machine learning application, there could be 50 direction we could be go in, and each of these directions is reasonable and would make the system better, but the challage is how to we pick which one of these to focus on.

if you are starting on building a brand new machine learning system application, the best way is the build your first system quickly and then iterate.

set up dev/test set and metric, this is really deciding where to place the target
building initial system quickly
use bias variance analysis and error analysis to prioritize next step, for example, by error analysis causes we to realize that a lot of the errors are from the speaker very far from the microphone, this will give us a good reason to focus on technique to address this preblem.

All the value of the initial system is having some learned system allows us to analysis bias and variance to try to prioritize what to do next, and allow us to do error analysis to figure out all the directions we can go in which ones are actually the most promising.

4 - training and testing on different distributions

deep learning algorithm have a huge hunger for training data, they just often work best when we can get enough label training data to put into the training set. So more and more people are now training on data that comes from the different distribution than dev and test set. There are some best practices dealing with when dev and test set distribution differ from each other.

Let’s say we are building a mobile app where users will upload pictures taken from their cell phone, and we want to recognize whether the picture is a cat or not. So we can get now two source of data. One which is the distribution of data we really care about from the mobile user. The other source of data we can get is crawl from the web page. Maybe we have gotten 10,000 picture from the users, and 200,000 pictures of cat download off the Internet.What we really care about is that the final system does well on the mobile app distribution of images.

We don’t want to use just 10,000 pictures beacuse it’s end up giving us a relatively small training set. And use the 200,000 images seems helpful, but thoes images is not from the distribution we want.

option 1:

one things we can do is put both of those data together so we now have 210,000 images, and randomly shuffle them into a train, dev and test set with 205,000, 2500, 2500 images, respectively. In this setting, the advantage is all the train, dev and test set data come from the same distribution. But the huge disadvantage is that the dev set of 2500 examples, a lot of it comes from the web page, only 119 will come from mobile app uploads, so now we are aim to target which is not really what we want.
option 2:

the best practice is have the training set include all the 200,000 images from the web and add 5000 images from the users upload. Then for the dev and test set would be all mobile app images. The advantage of this way of splitting up the data into training, dev and test set is that we now aiming the target where we want it to be. So we now are try to building a machine learning system does really well on the mobile app distribution of images. The disadvantage is that now the train distribution is different from dev and set distribution.

Speech activated rearview mirror example:

So how can we can the data to train a speech recognization system for speech activated rearview mirror. In the training set, the data we have maybe include purchased data, data from smart speaker control or voice keyboard, the total size is big, maybe 500,000. For the dev/test set, maybe we have a much small data set that actually came from the speech activated rearview mirror. So the distribution between training set and det/test set will be a large difference.

Now we have seen a couple of example of when allowing training set data to come from a different distribution than the dev/test set allow we have much more training data and it will cause the learning system to perform better.

5 - bias and variance with mismatched data distributions

Eastimat the bias and variance really help us prioritize how to work on next. But the way of analysis bias and variance changes when training set comes from different distribution than the dev/test set.

Keep using cat classification, assume humans get 0% error, so Bsyes error is nearly 0%.

Training error = 1%
Dev error = 10%

if the dev set came from the same distribution as the training set, we would say we have a large variance problem that the algorithms just not generalizing well from training set to dev set. But in the setting where the training set and dev set comes from different distribution, we can no longer safely draw this conclusion.

the problem with bias/variance analysis is that when we go from the training set error to the dev error, two things changes at a time. One is that the algorithm seen the data in the training set but not in the dev set. Two is the distribution in the dev set is different.

It’s diffcult to know of this 9% increase in error, how much it’s because the algorithm do not see the data in the dev set, that is the variance part, And how much of it is because the dev set data just different, for example, become numch more diffcult or much more easier.

In order to figour out this two effects, we define a new piece of data which we will call the training-dev set.

training-dev set: same distribution as training set, but not used for training.

So just as the dev and test set have the same distribution, the training set and training-dev set also have the same distribution. To carry out the error analysis, what we should to do is look at the error on the training set, training-dev set as well as on the dev set.

c a s e 1 : {\begin{cases} Training error: 1% \\ Training-dev error: 9% \\ Dev error: 10% \end{cases}

When we go from training data to training-dev data, this tells us how much variance problem we are suffer from. Now we know even the neural network do well on the training set, but it’s not generalizing well to data in training-dev set which come from the same distribution and have not seen before.

c a s e 2 : {\begin{cases} Training error: 1% \\ Training-dev error: 1.5% \\ Dev error: 10% \end{cases}

Now we have actually a pertty low variance problem, because we went from the training set that the neural network have seen to the training-dev set that has not seen, the error increase only a little bit, but it really jumps when go to the dev set, so this is a data mismatch problem. Because the learning algorithm was not trained on the data from the training-dev and dev set, but it’s work great on the training-dev set, it doesn’t work well on the dev set, the only different between this two set was came from different distribution.

c a s e 3 : {\begin{cases} Human error: 0% \\ Training error: 10% \\ Training-dev error: 11% \\ Dev error: 12% \end{cases}

Avoidable bias because work much worse than human level.

c a s e 4 : {\begin{cases} Human error: 0% \\ Training error: 10% \\ Training-dev error: 11% \\ Dev error: 20% \end{cases}

Avoidable error as well as data mismatch problem.

general principles:

depending on the defference between these errors, we can get a sense of how big is the avoidable bias, the variance, the data mismatch problem. Notice that the gap between dev error and test error denotes the degree of overtuned/overfitting to to the dev set, maybe we need a bigger dev set, because the dev and test set come from the same distribution.

the number on the right most column maybe reasonable, for example a speech recognition task where the training det turn to be much harder than the dev set and test set.

how do we address the data mismatch?

In particular training on the data that comes from different distribution than the dev and test set can get more date and really help the performance of the learning algorithm. But rather than just bias and variance problem, we now have a new potential problem of data mismatch. And not very systematic ways to address data mismatch.

6 - addressing data mismatch

If the training set come from the different distribution than the dev and test set, and if the error analysis show that we have a data mismatch problem, what can we do? Let’s look somethings we can try.

Carry out manual error analysis to try to understand difference between training set and dev set

if we are building a speech-activated rear-view mirror application, listen to the examples in the dev set and figure out how dev set is different than training set， we might find a lot of dev set examples are very noisy and there’s a lot of car noise. Or it’s often misrecognizing street number because a lot of navigational queries which will contains street number. What we do then is to try to find ways to make the training set data more similiar to the dev set. One thing we could do is simulate noisy in-car data or get more data of people speaking out number and add them to the training set.

One of the technique we can use is artificial data synthesis.

One things we could try is take one hour of car noise and repeat it 10,000 times in order to add to 10,000 hour of data record against a quiet background, there is a risk that the learning algorithm will overfit to the one hour of data of car noise.

To summarize, if we think we have a data mismatch problem, the best partice is to do error analysis, look the training set and dev set, to try to gain the insight how these two distribution of data might differ. And then see if we can get the some way to get more training date that looks like a bit more like dev set. One of the way is artificial data synthesis, and this technique does work in boost the performance of many machine learning system. But if we are using artificial data synthesis, just be cautious and bear in mind whether or not we might be accidentally simulating data only from a tiny subset of the space of all possible examples.

7 - Transfer learning

One of the most powerful ideas in deep learning is that sometimes we can take knowledges the neural network has learned from one task and apply the knowledge to a separate task. So for example, maybe you could have a neural network learn to recognize the cats and then use the knowledges or part of the knowledges to help you do a better job reading x-ray scans. This is call transfer learning.

Let’s say we have trained a neural network on image recognition, if we want to take this neural network and adapt or transfer a different task, such as radiology dignosis. What we can do is take the last output layer of the neural network away and delete the weights also feeding into the last output layer, and create a set of randomly initialization weights just for the last ouptut layer, and now output the radiology diagnosis.

So to be concrete, during the first phase of training when training on the images recognition task, and learns to make images recognition prediction. Have trained the neural network, what we next do is to implement the transfer learning using the new data set $X, Y$ (radiology images, diagnosis), and initialize the $W^{[L]}, b^{[L]}$ , and retrain the neural network on new radiology data set.

how retrain the neural network?

if you have a small radiology dataset, you might want to just retrain the weights of the last layer, just $W^{[L]}, b^{[L]}$ and keep the rest of parameters fixed,
if you have enough data, you can retrain all the layers,

So the rule of thumb is if you have a small data set, then just retrain the one last layer, or maybe the last one or two layers; but if you have a lot of data, maybe can retrain all the parameters in the network. And if retrain all the parameters in the neural network, then the initial phase of training on image recognition is called pre-training, and then if we update all the weights afterwards then training on the radiology data that’s called fine tunning.

And what we done in this way is we have taken knowledge learned from image recognition and transfer it to radiology diagnosis. And the reason it can be helpful is that a lot of level features such as detecting edges, curves, … learning from the large image recognition data set, it may help the algorithm do better in radiology diagnosis.

When does transfer learning make sense? Transfer learning make sense when you have a lot of data for the problem you are transferring from and usually relatively less data for the problem yor are transferring to. So when have a 100,000 example for the task image recognition, that’s a lot of data to learn low level features or useful features in the earlier layers in the neural network, but for the radiology task maybe we have just 100 examples. So a lot of knowledges you learn from the recognition can be transfered and can really help you get going with radiology recognition even if you don’t have much data for radiology.

To summarize:

If you trying to learn from task A and transfer some of the knowledges to task B, the transfer learning makes sense when task A and B have the same input, for example, both have the images as input or both have the aodio clip as input. It tends to make sense when we have a lot more data for task A than for task B.**Finally transfer learning will tend to make more sence if you suspect that **low level features from task A could be helpful for learning task B.

task A and B have the same input
have a lot more data for task A than for task B
low level features from task A could be helpful for learning task B

So transfer learning has been most useful, if we are trying to do well on task B usually where we have relatively small data, in this case, you might find a related but different task A where you can get a lot of datas and learn a lot of low level feature from that. So then you can try to do well on the task B.

8 - multi-task learning

In multi-task learning, we start off simultaneously, trying to have one neural network do servel things at the same time. And each of these task helps hopefully all of other task.

Let’s say we are building an autonomous vehicle, building a self driving car. The self driving car would need to detect several differnet things, such as pedestrians, other car, stop signs, traffic lights.

So what should to do is train a neural network to predict the value of $y$ , the neural network input $x$ and output a 4 dimensional vector. The first node in the output layer to predict the pedestrians, the second one to predict car, and so on. To train this neural network, we need to define the loss for the neural network:

cost = \frac{1}{m} \sum_{i = 1}^{m} \sum_{j = 1}^{4} L ({\hat{y}}_{j}^{(i)}, y_{j}^{(i)})

The $L$ is the usual logistic loss.

L ({\hat{y}}_{j}^{(i)}, y_{j}^{(i)}) = - y_{j}^{(i)} l o g ({\hat{y}}_{j}^{(i)}) - (1 - y_{j}^{(i)}) l o g (1 - {\hat{y}}_{j}^{(i)})

The main different between this and softmax regression is that assign a single label to single example, this one example can have multiple labels.

**If training a neural network to minimize the cost function, we are carry out multiple task learning. Because what we are doing is building a single network and solving four problem. **One other thing we could have done is just train four separate neural network for the four problem, respectively. But if some of the earlier features in neural network can be shared between these different type of objects, then we find that training one network to do four things result in better performance than training four completely separate neural network to do four tasks separately. So that’s the power of multi-task learning. Multi-task learning also works even if some of images were label just some of the objects, that is there are some question marks so on. With a dataset like this, we can still train the learning algorithm to do four task at a same time. In partice, we would only sum over the values of $j$ with a 0 and 1, and just omit the term with question mark from summation.

When does the multi-task learning make sense? It’s make sense usually when 3 things hold true.

training on a set of tasks that could benefit from having shared lower-level feature.
usually, amount of data you have for each task is quite similar
- recall from transfer learning, learn from some task A and transfer it to task B. For multi-task learning, if we have 100 task, and 1000 examples of each task, let’s focus on the performance on the 100th task, if you try to do the 100th task in isolution, you would have just 1000 examples to training on, but by training the 99 other tasks, there are 99,000 training examples which could be a boost, to give a lot of knowledges to augment dat of 100th task.
the only times multi-task learning hurts performance compare to training separate neural network is the network is big enouth.

To summarize:
In partice, multi-task learning is used much less often than transfer learning. For transfer learning, when you have a problem you want to solve with a small amount of data, usually we can find a related problem with a lot of data to learn something and transfer that to the new problem. But multi-task learning is just more rare that you have a huge set of tasks you want to do well on, you can train all of them at the same time. So multi-task learning enable you to train one neural network to do many tasks and this can give you better performance than if you were to do those tasks in isolation. Again, multi-task learning is used much less often than transfer learning, one reason for this is that it’s just difficult to set up or find so many different tasks you want to train a single neural network.

9 - what is end-to-end deep learning

Briefly, there have been some data processing systems that require multiple stages of processing. And what end-to-end deep learning does is it can take all those multiple stages and replace it with just a single neural network.

Take speech recognition as example. where the goals is to take an input x such as an audio clip and map it to an output y which is a transcript of the audio clip. Traditionally, speech recognition required many stages of processing. In contrast to this pipeline with a lot of stages, what end-to-end deep learning does is that train a huge neural network and just input the audio clip and directly output the transcript.

One of the challenges of end-to-end deep learning is that you might need a lot of data before it works well. So when we have a small data set, the more traditional pipeline approach actually works even better.

Face recognition:

what is actually done to build a turnstiles is not to just take the raw images and feed it to a neural network to try to figure out a person’s identity. Instead, the best approach to date seems to be a multi-step approach, where first we run a piece of software to detect where is the person’s face, having detected the face, we then zoom in to the part of the image and crop the images so that the face is centered, and then feed the new picture into neural network, the estimate the person’s identity.

Instead of trying to learn everything on one step by breaking the problem down into two simplier steps, first to figure out where if the face, and second is figout out who this actually is result in better performance. Why is it that the two step approach works better?

each one of the two problem is much simpler
have a lot of data for each of the two sub-tasks

So because we don’t have enough data to solve the end-to-end learning problem, but have the enough data to solve the sub-problem, in practice, breaking this down to 2 sub-problem result in better proformance than a pure end-to-end deep learning approach.

For some tasks, the end-to-end learning will performance better, for example, machine translation; but for other problems, maybe it doesn’t work well.

10 - wether to use end-to-end deep learning

Let’s you are building a learning system and try to decide wether or not to use a end-to-end approach. Let’s take a look at the pros and cons of the end-to-end learning.

pros and cons of end-to-end deep learning:

end-to-end learning refer to learning a direct mapping from one end of the system to the other way of the system.

pros:
- let the data speak, whatever is the most appropriate function mapping from x to y, if you have enough data and training a big enough neural network, hopefully the network will figure it out. By having a pure maching learning approach maybe more able to capture whatever the statistics are in the data rather than being force to reflect human preconceptions.
- less hand-designing of components needed.
cons:
- maybe need a large amount of data
- excludes potentially useful hand-designed components. if we don’t have a lot of data then the learning algorithm doesn’t have much insight can gain from the data, and so hand designed components can really be a way for we to inject the knowledges into the algorithm, that’s not always a bad things.

The learning algorithm having two main source of knowledges: data and hand-design. When you have ton of data, it’s less important to hand design things but when you don’t have much data, then having a carefully hand-designed system can actually allow humans to inject a lot of knowledge about the problem into an algorithm and that should be very helpful.

So when you are building a new machine learning system and you are try to decide whether or not to use end-to-end deep learning, the key questions is, do you have sufficient data to learning the function to map from x to y.

Intuitively, if you are try to learn a function form x to y, that is look at the images and recognize the position of the bones in this images, maybe this seem like a relatively simple problem to do, and maybe don’t need too much data for this task. Or give a picture of a person, maybe finding the face of the person in the images doesn’t seem like a hard problem, so maybe don’t need too much data. Whereas in contrast, the function needed to look at the hand and map directly to the age of the child, that seems like a much more complex problem that maybe you need more data if you are applying a pure end-to-end deep learning approach.