《Learning Cross-modal Embeddings for Cooking Recipes and Food Images》阅读笔记

论文地址:

https://www.researchgate.net/publication/320964718_Learning_Cross-Modal_Embeddings_for_Cooking_Recipes_and_Food_Images​www.researchgate.net

来源:CVPR 2017

一、Introduction

文章要做的事情(recipe retreival):

输入:image(sentence)+dataset      输出:sentence(image) rank list

在本文中介绍了Recipe1M数据集,并训练一个食谱和图像联合嵌入的神经网络,应用于图像配方检索任务上。另外,证明通过添加高级分类目标的正则化既提高了检索性能。

二、Recipe1M dataset

本文提出Recipe1M数据集,其中包含一百万个带有相关图像的结构化烹饪配方。

数据集下载地址:

im2recipe​im2recipe.csail.mit.edu

《Learning Cross-modal Embeddings for Cooking Recipes and Food Images》阅读笔记

三、Model

 

《Learning Cross-modal Embeddings for Cooking Recipes and Food Images》阅读笔记

将文本和图像映射到共享的子空间,然后在子空间cosine similarity loss和softmax loss

1.Learning Embeddings

1)Representation of recipes

Ingredients:对由word2vec算法获得的预训练嵌入向量用双向LSTM(成分列表是无序集合,使用双向LSTM模型,该模型同时考虑前向和后向排序。)

想法:可以将Bi-LSTM换成self-attention

Cooking Instructions:将每个指令/句子由a skip-instructions vector表示,然后在这些向量的序列上用LSTM以获得所有指令的表示vector

2) Representation of food images

VGG-16 and Resnet-50 models 深度残差网络,移除最后的softmax分类层

2.Joint Neural Embedding

成分、说明两个编码器的输出向量被拼接,并映射到共享子空间中。通过线性变换将图像向量映射到该空间中。

《Learning Cross-modal Embeddings for Cooking Recipes and Food Images》阅读笔记

(s,g,v):(说明的序列向量,成分的序列向量,相关联的图片向量)

目标函数:1.最大化正配方图像对之间的余弦相似性

2.最小化所有不匹配的配方图像对,直到指定的margin

成分vector: 《Learning Cross-modal Embeddings for Cooking Recipes and Food Images》阅读笔记h_{g}^{s} 说明vector: 《Learning Cross-modal Embeddings for Cooking Recipes and Food Images》阅读笔记h_{s}^{k} after cat: 《Learning Cross-modal Embeddings for Cooking Recipes and Food Images》阅读笔记h_{k}^{R}

图像vector: 《Learning Cross-modal Embeddings for Cooking Recipes and Food Images》阅读笔记\nu_{k}

映射到共同子空间:

《Learning Cross-modal Embeddings for Cooking Recipes and Food Images》阅读笔记

《Learning Cross-modal Embeddings for Cooking Recipes and Food Images》阅读笔记

损失函数:

《Learning Cross-modal Embeddings for Cooking Recipes and Food Images》阅读笔记

y=1为正配对,y=-1为负配对

3.Semantic Regularization(语义正则化)

语义正则化思想:The key idea is that if high-level discriminative weights are shared,

then both of the modalities (recipe and image embeddings) should utilize these weights in a similar way which brings another level of alignment based on discrimination.

We optimize this objective together with our joint embedding loss.

嵌入 《Learning Cross-modal Embeddings for Cooking Recipes and Food Images》阅读笔记\phi^{r} 、 《Learning Cross-modal Embeddings for Cooking Recipes and Food Images》阅读笔记\phi^{v}

《Learning Cross-modal Embeddings for Cooking Recipes and Food Images》阅读笔记

《Learning Cross-modal Embeddings for Cooking Recipes and Food Images》阅读笔记

接下来进行softmax**,获得类别概率。

Wc是学习权重的矩阵,它们在图像和配方嵌入之间共享,以促进它们之间的语义对齐。

语义正则化损失表示为:

《Learning Cross-modal Embeddings for Cooking Recipes and Food Images》阅读笔记

其中Cr,Cv分别是配方和图像的语义类别标签。

注意,如果(φr,φv)是正对,则Cr和Cv是相同的。

目标函数为:

《Learning Cross-modal Embeddings for Cooking Recipes and Food Images》阅读笔记

四、实验结果

《Learning Cross-modal Embeddings for Cooking Recipes and Food Images》阅读笔记\alpha =0.1

《Learning Cross-modal Embeddings for Cooking Recipes and Food Images》阅读笔记\lambda =0.02

《Learning Cross-modal Embeddings for Cooking Recipes and Food Images》阅读笔记

《Learning Cross-modal Embeddings for Cooking Recipes and Food Images》阅读笔记

《Learning Cross-modal Embeddings for Cooking Recipes and Food Images》阅读笔记