COURSE 3 Structuring Machine Learning Projects

Week1

Why ML Strategy

Ideas

  • Collect more data
  • Collect more diverse traning set
  • Train algorithm longer with gradient descent
  • Try Adam instread of gradient descent
  • Try bigger / smaller network
  • Try dropout
  • Add L2 regularization
  • Network architecture
    • activation functions
    • # of hidden units

Orthogonalization

Orthogonalization or orthogonality is a system design property that assures that modifying an instruction or a component of an algorithm will not create or propagate side effects to other components of the system. It becomes easier to verify the algorithms independently from one another, it reduces testing and development time.

Orthogonalization in supervised learning

When a supervised learning system is design, these are the 4assumptions that needs to be true and orthogonal.

  1. Fit training set well in costfunction

    • If it doesn’t fit well, the use of a bigger neural network or switching to a better optimization algorithm might help.
  2. Fit development set well oncost function

    • If it doesn’t fit well,regularization or using bigger training set might help.
  3. Fit test set well on costfunction

    • If it doesn’t fit well, the use of a bigger development set might help
  4. Performs well in real world

    • If it doesn’t perform well, the development test set is not set correctly or the cost function is notevaluating the right thing.

Evaluation Metric

To choose a classifier, a well-defined development set and an evaluation metric speed up the iteration process. There are different metrics to evaluate the performance of a classifier, they are called evaluation matrices. It is important to note that these evaluation matrices must be evaluated on a training set, a development set or on the test set.

Single Number Evaluation Metric

Predicate | Actual 1 0
1 True Positive False Positive
0 False Negative True Negative

- Precision

Precision(%) =True PositiveTrue Positive + False Positive×100

  • Recall

    <Empty Math Block>

    Recall(%) =True PositiveTrue Positive + False Negative×100

The problem with using precision/recall as the evaluation metric is that you are not sure which one is better since in this case, both of them have a good precision et recall.

  • F1-score, a harmonic mean, combine both precision and recall
    F1-score(%) =21p+1r

Satisficing and Optimizating Metrics

The general rule is

  • 1 optimizing metric (you want to do as well as possible)
  • N - 1 satisficing metrics (you want to be satisfice)

Train / Dev / Test Distributions

Setting up the training, development and test sets have a huge impact on productivity. It is important to choose the development and test sets from the same distribution and it must be taken randomly from all the data.

The guideline is

Choose a development set and test set (from the same distribution) to reflect data you expect to get in the future and consider important to do well.

Size of Dev and Test Sets

Old Way of Splitting Data

  • training set (70%) + test set (30%)
  • traning set (60%) + dev set (20%) + test set (20%)

Modern Era - Big Data

  • traning set (98%) + dev set (1%) + test set (1%)

The guideline is

  • Set up the size of the test set to give a high confidence in the overall performance of the system
  • Test set helps evaluate the performance of the final classifier which could be less 30% of the whole data set
  • The dev set has to be big enough to valuate different ideas

When to Change Dev / Test Sets and Metrics

The guideline is

  1. Define correctly an evaluation metric that helps better rank order classifiers
  2. Optimize the evaluation metric

Why Human-level Performance

COURSE 3 Structuring Machine Learning Projects

Machine learning progresses slowly when it surpasses human-level performance.

One of the reason is that human-level performance can be close to Bayes optimal error, especially for natural perception problem (Bayes optimal error is defined as the best possible error. In other words, it means that any functions mapping from x to y can’t surpass a certain level of accuracy.)

Also, when the performance of machine learning is worse than the performance of humans, you can improve it with different tools. They are harder to use once its surpasses human-level performance. These tools are

  • Get labeled data from humans
  • Gain insight from manual error analysis: Why did a person get this right
  • Better analysis of bias / variance

Avoidable bias

If you want to improve the performance of the training set but you can’t do better than the Bayes error otherwise the training set is overfitting. By knowing the Bayes error, it is easier to focus on whether bias or variance avoidance tactics will improve the performance of the model.

Scenario A

There is a 7% gap between the performance of the training set and the human level error. It means that the algorithm isn’t fitting well with the training set since the target is around 1%. To resolve the issue, we use bias reduction technique such as training a bigger neural network or running the training set longer.

Scenario B

The training set is doing good since there is only a 0.5% difference with the human level error. The difference between the training set and the human level error is called avoidable bias. The focus here is to reduce the variance since the difference between the training error and the development error is 2%. To resolve the issue, we use variance reduction technique such as regularization or have a bigger training set.

Understanding Human-level Performance

human error(proxy for Bayes error) <—(avoidable bias)—> training error <—(variance)—> dev error

Scenario A

Classification Error (%)
human error1(proxy for Bayes error)
human error2
human error3
training error
development error

The choice of human-level performance doesn’t have an impact. The avoidable bias is between 4%-4.5% and the variance is 1%. Therefore, the focus should be on bias reduction technique.

Scenario B

Classification Error (%)
human error1(proxy for Bayes error)
human error2
human error3
training error
development error

The choice of human-level performance doesn’t have an impact. The avoidable bias is between 0%-0.5% and the variance is 4%. Therefore, the focus should be on variance reduction technique.

Scenario C

Classification Error (%)
human error1(proxy for Bayes error)
training error
development error

The estimate for Bayes error has to be 0.5% since you can’t go lower than the human-level performance otherwise the training set is overfitting. Also, the avoidable bias is 0.2% and the variance is 0.1%. Therefore, the focus should be on bias reduction technique

Summary of Bias/Variance with Human-level Performance

  • Human-level error – proxy for Bayes error
  • If the difference between human-level error and the training error is bigger than the difference between the training error and the development error. The focus should be on bias reduction technique
  • If the difference between training error and the development error is bigger than the difference between the human-level error and the training error. The focus should be on variance reduction technique

Surpassing Human-level Performance

Scenario A

Classification Error(%)
team of humans
one human
traning error
development error

The Bayes error is 0.5% (surpassing 1.0%), therefore the available bias is 0.1% et the variance is 0.2%.

Scenario B

Classification Error(%)
team of humans
one human
traning error
development error

There is not enough information to know if bias reduction or variance reduction has to be done on the algorithm. It doesn’t mean that the model cannot be improve, it means that the conventional ways to know if bias reduction or variance reduction are not working in this case.

Problems Where ML Significantly Surpasses Human-level Performance, especially with structured data

  • Online advertising
  • Product recommendations
  • Logistrics(predicting transit time)
  • Loan approvals

Improving Your Model Performance

Two fundamental Assumptings of Supervised Learning

  • You can fit the traning set pretty well (avoidable bias is low)
    • train bigger model
    • train longer, better optimization algorithms
    • neural networks architecture/hyperparameters search
  • The training set performance generalizes pretty well to the dev/test set (variance is low)
    • more data
    • regularization
    • neural networks architecture/hyperparameters search

Week2

Carrying Out Error Analysis

Error analysis from the dev set to evaluate a classifier on another class:

  • get ~100 mislabeled dev set examples
  • count up how many belong to another class

Cleaning Up Incorrectly Labeled Data

DL algorithms are quite robust to random errors in the training set

Collecting Incorrect Dev/Test Set Examples

  • apply same process to your dev and test sets to make sure they continue to come from the same distribution
  • consider examining examples your algorithm got right as well as ones it got wrong
  • train and dev/test data may now come from slightly different distributions

Build Your First System Quickly, Then Iterate

The guideline is

  1. set up dev/test set and metric
    • set up a target
  2. build initial system quickly
    • training set: fit the parameters
    • development set: tune the parameters
    • test set: assess the performance
  3. use bias/variance analysis & error analysis to prioritize next steps

Training and Testing on Different Distributions

Example

There are two sources of data used to develop the mobile app.

  1. small, 10 000 pictures uploaded from the mobile application (not professionally)
  2. large, from the web, you downloaded 200 000 pictures

The guideline used is that you have to choose a development set and test set to reflect data you expect to get in the future and consider important to do well.

COURSE 3 Structuring Machine Learning Projects

The advantage of this way of splitting up is that the target is well defined.

The disadvantage is that the training distribution is different from the development and test set distributions.

Bias and Variance with Mismathed Data Distributions

When the training set, development and test sets distributions are different, we have a mismatch data problem.

training-development set has the same distribution as the training set but it is not used for training the neural network

Bayes error <—(avoidable bias)—>traning set error<—(variance)—>development-training set error<—(data mismatched)—>development set error<—(degree of overfitting to the development set)—>test set error

Addressing Data Mismatch

The guideline is

  • perform manual error analysis to understand the error differences between traning, development/test sets. Development should never be done on test set to avoid overfitting
  • Make training data or collect data similar to development and test sets (normal data + noise = synthesized data).

Transfer Learning

Transfer learning refers to using the neural network knowledge for another application.

When to Use Transfer Learning

  • Task A and B have the same input x
  • A lot more data for Task A than Task B
  • Low level features from Task A could be helpful for Task B

The guideline is

  • delete last layer of neural network (the output layer)
  • delete weights feeding into the last output layer of the neural network
  • create a new set of random initialized weights for the last layer only
  • new data set (x, y)

COURSE 3 Structuring Machine Learning Projects

Multi-task Learning

Multi-task learning refers to having one neural network do simultaneously several tasks.

When to Use Multi-task Learning

  • Traning on a set of tasks taht could benefit from having shared lower-level features
  • Usually: Amount of data you have for each task is quite similar
  • Can train a big enough neural network to do well on all tasks

Neural Network Architecture

COURSE 3 Structuring Machine Learning Projects

Loss Function

(ŷ (i),y(i))=1mi=1mj=1ny(y(i)jlogŷ (i)j+(1y(i)j)log1ŷ (i)j)

What Is End-to-End Deep Learning

End-to-end deep learning is the simplification of a processing or learning systems into one neural network. End-to-end deep learning cannot be used for every problem since it needs a lot of labeled data.

Speech Recognition Models

  • The traditional way - small data set

    audio—>extract features—>phonemes—>words—>transcript

  • The hybrid way - medium data set

    audio—>phonemes—>words—>transcript

  • The End-to-End deep learning way - large data set

    audio—>transcript

Whether to Use End-to-End Learning

Pros

  • Let data speak
  • Less hand-desining of components needed

Cons

  • May need large amount of data
  • Excludes potentially useful hand-designed components