我已经有两年 ML 经历，这系列课主要用来查缺补漏，会记录一些细节的、自己不知道的东西。

本节内容综述

本节课 杨舒涵 讲解。
第一部分通过生物现象 `赫布理论，引出机器学习相关技术。
接下来，是几个文章的报告，首先是知识蒸馏。
接着是 Memory Aware Synapses: Learning what (not) to forget (MAS) ，第一个无监督的 LLL 方法。
接下来 Learning without Forgetting (LwF) ，第一个把知识蒸馏用在 LLL 的文章。
Large Scale Incremental Learning (BiC)，在类别很多时，如何平衡新老任务。
Few-Shot Class-Incremental Learning (FSCIL) ，也是为了解决 new/old class imbalance ，其作用在小样本上。以 Neural gas (NG) 取代了知识蒸馏；应用了拓扑学和生理神经相关的 Hebbian learning 作为出发点，提出 TOpology-Preserving knowledge InCrementer (TOPIC) 架构。
最后探讨结论 LLL Nowadays & Future(?) 。目前还是主要用在学术界。

小细节

Outline

Hebbian Theory
- Long-term Potentiation (LTP)
- Hebbian Theory
- Competitive Hebbian Learning
Knowledge Distillation
Memory Aware Synapses: Learning what (not) to forget (MAS)
Learning without Forgetting (LwF)
Large Scale Incremental Learning (BiC)
Few-Shot Class-Incremental Learning (FSCIL)
LLL Nowadays & Future(?)

Hebbian Theory

Long-term Potentiation (LTP)

【李宏毅2020 ML/DL】P106 More about Life Long Learning
如上，Long-term Potentiation 是一种神经反应可塑的现象，是生物学概念。

Hebbian Theory

【李宏毅2020 ML/DL】P106 More about Life Long Learning
因此，由赫布理论，得到神经网络权重上的启发。

Competitive Hebbian Learning

【李宏毅2020 ML/DL】P106 More about Life Long Learning
由此，引出 Competitive Hebbian Learning 。

Knowledge Distillation

https://arxiv.org/pdf/1503.02531.pdf

Problem Statement

为什么提出“知识蒸馏”？

训练 model 和使用 model 的需求不同、
model compression

What is “knowledge”?

这里理解为，学习把 input vector 映射到 output vector 。

一般用 $T$ 蒸馏温度来处理分布：
$q_i = \frac{e^{z_i/T}}{\sum_j e^{z_j /T}}$

用大 model 的 soft targets 当初训练小 model 的 ground truth 。

最后评判时，需要把小 model 在 soft targets 上训练后的交叉熵，与 hard targets 训练后的交叉熵的 $1 / T^2$ 倍，得到 overall loss 。这么做，是因为 soft targets 生成过程中蒸馏法求导函数会产生 $1/T^2$ ；而为了保持两个 loss 的影响接近，要如此处理。

Why knowledge distillation works?

可以想象成 $T$ 越大， class probabilities 就越接近，使训练上更加严格，而切换回一般的 softmax 就回归简单模式，效果更佳。

Memory Aware Synapses: Learning what (not) to forget (MAS)

【李宏毅2020 ML/DL】P106 More about Life Long Learning
如上，在目标中增加一项，防止参数变动过大。

其达成的目标如上。

Concepts

【李宏毅2020 ML/DL】P106 More about Life Long Learning
如上，我们希望模型自己学会各个参数是否敏感。

最终，其 $\Omega$ 公式推导如上。

Learned function v.s. Loss

【李宏毅2020 ML/DL】P106 More about Life Long Learning
如上，在“重要性”与损失值相关时，我们可以通过真是标签来查看差别。

但是这样的话，就没法新增无标签数据来训练。

如果我们自己学 learn function （与loss的改变有关），则可以新增无标签数据参与训练。可以观察两次跑出的 learn function 的值的差异。