论文笔记:Evolving Losses for Unsupervised Video Representation Learning

Evolving Losses for Unsupervised Video Representation Learning 论文笔记

Distillation

Knowledge Distillation from: zhihu

Distillate Knowledge from Teacher model Net-T to Student model Net-S.
论文笔记:Evolving Losses for Unsupervised Video Representation Learning

目的:为了精简模型方便部署。

L=αLsoft+βLhard L=\alpha L_{s o f t}+\beta L_{h a r d}

Lsoft=jNpjTlog(qjT), where plT=exp(vi/T)kNexp(vk/T),qiT=exp(zi/T)kNexp(zk/T) L_{s o f t}=-\sum_{j}^{N} p_{j}^{T} \log \left(q_{j}^{T}\right), \text { where } p_{l}^{T}=\frac{\exp \left(v_{i} / T\right)}{\sum_{k}^{N} \exp \left(v_{k} / T\right)}, q_{i}^{T}=\frac{\exp \left(z_{i} / T\right)}{\sum_{k}^{N} \exp \left(z_{k} / T\right)}

Lhard=jNcjlog(qj1), where qi1=exp(zi)jNexp(zj) L_{h a r d}=-\sum_{j}^{N} c_{j} \log \left(q_{j}^{1}\right), \text { where } q_{i}^{1}=\frac{\exp \left(z_{i}\right)}{\sum_{j}^{N} \exp \left(z_{j}\right)}

第一部分是从Teacher 模型中学习,第二部分是从ground truth 中学习

温度的高低改变的是Net-S训练过程中对负标签的关注程度: 温度较低时,对负标签的关注,尤其是那些显著低于平均值的负标签的关注较少;而温度较高时,负标签相关的值会相对增大,Net-S会相对多地关注到负标签。

Main idea: Multiple modalities to multiple tasks

论文笔记:Evolving Losses for Unsupervised Video Representation Learning

Loss Function

L=mtλm,tLm,t+dλdLd \mathcal{L}=\sum_{m} \sum_{t} \lambda_{m, t} \mathcal{L}_{m, t}+\sum_{d} \lambda_{d} \mathcal{L}_{d}

where

λ\lambda is weight

Lm,t\mathcal{L}_{m,t} is loss function of modality mm to task tt

Ld\mathcal{L}_{d} is L2L_2 distance of a layer in the main network MiM_i to another network LiL_i
Ld(Li,Mi)=LiMi2 \mathcal{L}_{d}\left(L_{i}, M_{i}\right)=\left\|L_{i}-M_{i}\right\|_{2}

Evolution Algorithm

Using GA to determine the λ\lambda

Each λm,tλ_{m,t} orλd{λ_d} is constrained to be in [0,1][0,1]

Unsupervised loss function

Zipf Distribution matching (ELo)

cluster centroids {c1,c2,ck} where ciRD\left\{c_{1}, c_{2}, \ldots c_{k}\right\} \text { where } c_{i} \in \mathcal{R}^{D}

Naively assuming all clusters have the same variance, and let 2σ2=12\sigma^2 = 1

we can compute the probability of a feature vector xRDx ∈ R^D belonging to a cluster cic_i as
p(xci)=12σ2πexp((xci)22σ2) p\left(x \mid c_{i}\right)=\frac{1}{\sqrt{2 \sigma^{2} \pi}} \exp \left(-\frac{\left(x-c_{i}\right)^{2}}{2 \sigma^{2}}\right)
Bayes rules:
p(cix)=p(ci)p(xci)jkp(cj)p(xcj)=exp(xci)22σ2j=1kexp(xcj)22σ2=exp(xci)2j=1kexp(xcj)2 \begin{aligned} p\left(c_{i} \mid x\right) &=\frac{p\left(c_{i}\right) p\left(x \mid c_{i}\right)}{\sum_{j}^{k} p\left(c_{j}\right) p\left(x \mid c_{j}\right)}=\frac{\exp -\frac{\left(x-c_{i}\right)^{2}}{2 \sigma^{2}}}{\sum_{j=1}^{k} \exp -\frac{\left(x-c_{j}\right)^{2}}{2 \sigma^{2}}} \\ &=\frac{\exp -\left(x-c_{i}\right)^{2}}{\sum_{j=1}^{k} \exp -\left(x-c_{j}\right)^{2}} \end{aligned}
which is standard softmax function

given the above probability of each video belonging to each cluster, and the Zipf distribution, we compute the prior probability of each class as q(ci)=1/isHk,sq\left(c_{i}\right)=\frac{1 / i^{s}}{H_{k, s}} where HH is kthk_{th} harmonic number and ss is real constant.

p(ci)=1NxVp(cix)p\left(c_{i}\right)=\frac{1}{N} \sum_{x \in V} p\left(c_{i} \mid x\right), the average over all videos in the set.

KL divergence :
KL(pq)=i=1kp(ci)log(p(ci)q(ci)) K L(p \| q)=\sum_{i=1}^{k} p\left(c_{i}\right) \log \left(\frac{p\left(c_{i}\right)}{q\left(c_{i}\right)}\right)
This will be our fitness function.

it poses a prior constraint over the distribution of (learned) video representations in clusters to follow the Zipf distribution.

Loss Evolution

tournament selection and CMA-ES.