一天搞懂深度学习

Lecture I: Introduction of Deep Learning

Three Steps for Deep Learning

define a set of function(Neural Network)
goodness of function
pick the best function

Soft max layer as the output layer.

FAQhow many layers? How many neutons for each layer?

Trial and Error + Intuition(试错＋直觉)

Gradient Descent

pick an initial value for w
- random(good enough)
- RBM pre-train
compute ∂L∂w
w←w−η∂L∂w, where η is called “learning rate”
repeat Until ∂L∂w is approximately small

But gradient descent never guarantee global minima

Modularization(模块化)

Deep → Modularization

Each basic classifier can have sufficient training examples
Sharing by the following classifiers as module
The modulrization is automatically learned from data

Lecture II: Tips for Training DNN

Do not always blame overfitting
- Reason for overfitting: Training data and testing data can be different
- Panacea for Overfitting: Have more training data or Create more training data
Different approaches for different problems
Chossing proper loss
- square error(mse)
  - ∑(yi−yi^)2
- cross entropy(categorical_crosssentropy)
  - −∑(yi^lnyi)
  - When using softmax output layer, choose cross entropy
Mini-batch
- Mini-batch is Faster
  1. Randomly initialize network parameters
  2. Pick the 1 st batch, update parameters once
  3. Pick the 2 nd batch, update parameters once
  4. …
  5. Until all mini-batches have been picked(one epoch finished)
  6. Repeat the above process(2-5)
New activation function
- Vanishing Gradient Problem
- RBM pre-training
- Rectified Linear Unit (ReLU)
- Fast to compute
- Biological reason
- Infinite sigmoid z with different biases
- A Thinner linear network
- A special cases of Maxout
- Vanishing gradient problem
- ReLU - variant
Adaptive Learning Rate
- Popular & Simple Idea: Reduce the learning rate by some factor every few epochs
  - ηt=ηt+1√
- Adagrad
  - Original: w←w−η∂L∂w
  - Adagrad:w←w−ηw∂L∂w,ηw=η∑ti=0(gi)2√
Momentum
- Movement = Negative of ∂L/∂w + Momentum
- Adam = RMSProp (Advanced Adagrad) + Momentum
Early Stopping
Weight Decay
- Original: w←w−η∂L∂w
- Weight Decay: w←0.99w−η∂L∂w
Dropout
- Training:
- Each neuron has p% to dropout
- The structure of the network is changed.
- Using the new network for training
- Testing:
- If the dropout rate at training is p%, all the weights times (1-p)%
- Dropout is a kind of ensemble

Lecture III: Variants of Neural Networks

Convolutional Neural Network (CNN)

一天搞懂深度学习

The convolution is not fully connected
The convolution is sharing weights
Learning: gradient descent

Recurrent Neural Network (RNN)

一天搞懂深度学习

Long Short-term Memory (LSTM)

一天搞懂深度学习

Gated Recurrent Unit (GRU): simpler than LSTM

Lecture IV: Next Wave

Supervised Learning

Ultra Deep Network

Worry about training first!
This ultra deep network have special structure
Ultra deep network is the ensemble of many networks with different depth
Ensemble: 6 layers, 4 layers or 2 layers
FractalNet
Residual Network
Highway Network

Attention Model

Attention-based Model
Attention-based Model v2

Reinforcement Learning

Agent learns to take actions to maximize expected reward.
Difficulties of Reinforcement Learning
- It may be better to sacrifice immediate reward to gain more long-term reward
- Agent’s actions affect the subsequent data it receives

Unsupervised Learning

Image: Realizing what the World Looks Like
Text: Understanding the Meaning of Words
Audio: Learning human language without supervision

一天搞懂深度学习

Lecture I: Introduction of Deep Learning

Three Steps for Deep Learning

Soft max layer as the output layer.

FAQhow many layers? How many neutons for each layer?

Gradient Descent

Modularization(模块化)

Lecture II: Tips for Training DNN

Lecture III: Variants of Neural Networks

Convolutional Neural Network (CNN)

Recurrent Neural Network (RNN)

Long Short-term Memory (LSTM)

Lecture IV: Next Wave

Supervised Learning

Ultra Deep Network

Attention Model

Reinforcement Learning

Unsupervised Learning

相关推荐