一天搞懂深度学习
Lecture I: Introduction of Deep Learning
Three Steps for Deep Learning
- define a set of function(Neural Network)
- goodness of function
- pick the best function
Soft max layer as the output layer.
FAQhow many layers? How many neutons for each layer?
Trial and Error + Intuition(试错+直觉)
Gradient Descent
- pick an initial value for
w - random(good enough)
- RBM pre-train
- compute
∂L∂w w←w−η∂L∂w , whereη is called “learning rate” - repeat Until
∂L∂w is approximately small
But gradient descent never guarantee global minima
Modularization(模块化)
Deep
Each basic classifier can have sufficient training examples
Sharing by the following classifiers as module
The modulrization is automatically learned from data
Lecture II: Tips for Training DNN
-
Do not always blame overfitting
- Reason for overfitting: Training data and testing data can be different
- Panacea for Overfitting: Have more training data or Create more training data
Different approaches for different problems
-
Chossing proper loss
- square error(mse)
-
∑(yi−yi^)2
-
- cross entropy(categorical_crosssentropy)
-
−∑(yi^lnyi) - When using softmax output layer, choose cross entropy
-
- square error(mse)
-
Mini-batch
- Mini-batch is Faster
- Randomly initialize network parameters
- Pick the 1 st batch, update parameters once
- Pick the 2 nd batch, update parameters once
- …
- Until all mini-batches have been picked(one epoch finished)
- Repeat the above process(2-5)
- Mini-batch is Faster
-
New activation function
- Vanishing Gradient Problem
- RBM pre-training
- Rectified Linear Unit (ReLU)
- Fast to compute
- Biological reason
- Infinite sigmoid z with different biases
- A Thinner linear network
- A special cases of Maxout
- Vanishing gradient problem
- ReLU - variant
-
Adaptive Learning Rate
-
Popular & Simple Idea: Reduce the learning rate by some factor every few epochs
-
ηt=ηt+1√
-
-
Adagrad
- Original:
w←w−η∂L∂w - Adagrad:
w←w−ηw∂L∂w,ηw=η∑ti=0(gi)2√
- Original:
-
-
Momentum
- Movement = Negative of
∂L/∂w + Momentum - Adam = RMSProp (Advanced Adagrad) + Momentum
- Movement = Negative of
Early Stopping
-
Weight Decay
- Original:
w←w−η∂L∂w - Weight Decay:
w←0.99w−η∂L∂w
- Original:
-
Dropout
- Training:
- Each neuron has p% to dropout
- The structure of the network is changed.
- Using the new network for training
- Testing:
- If the dropout rate at training is p%, all the weights times (1-p)%
- Dropout is a kind of ensemble
Lecture III: Variants of Neural Networks
Convolutional Neural Network (CNN)
- The convolution is not fully connected
- The convolution is sharing weights
- Learning: gradient descent
Recurrent Neural Network (RNN)
Long Short-term Memory (LSTM)
- Gated Recurrent Unit (GRU): simpler than LSTM
Lecture IV: Next Wave
Supervised Learning
Ultra Deep Network
Worry about training first!
This ultra deep network have special structure
Ultra deep network is the ensemble of many networks with different depth
Ensemble: 6 layers, 4 layers or 2 layers
-
FractalNet
-
Residual Network
-
Highway Network
Attention Model
-
Attention-based Model
-
Attention-based Model v2
Reinforcement Learning
- Agent learns to take actions to maximize expected reward.
- Difficulties of Reinforcement Learning
- It may be better to sacrifice immediate reward to gain more long-term reward
- Agent’s actions affect the subsequent data it receives
Unsupervised Learning
- Image: Realizing what the World Looks Like
- Text: Understanding the Meaning of Words
- Audio: Learning human language without supervision