您的位置: 首页 > 文章 > 《Learning Cross-modal Embeddings for Cooking Recipes and Food Images》阅读笔记

《Learning Cross-modal Embeddings for Cooking Recipes and Food Images》阅读笔记

分类: 文章 • 2025-02-04 12:41:40

论文地址：

https://www.researchgate.net/publication/320964718_Learning_Cross-Modal_Embeddings_for_Cooking_Recipes_and_Food_Imageswww.researchgate.net

来源：CVPR 2017

一、Introduction

文章要做的事情(recipe retreival)：

输入：image（sentence）+dataset 　　　　　输出：sentence（image） rank list

在本文中介绍了Recipe1M数据集，并训练一个食谱和图像联合嵌入的神经网络，应用于图像配方检索任务上。另外，证明通过添加高级分类目标的正则化既提高了检索性能。

二、Recipe1M dataset

本文提出Recipe1M数据集，其中包含一百万个带有相关图像的结构化烹饪配方。

数据集下载地址：

im2recipeim2recipe.csail.mit.edu

《Learning Cross-modal Embeddings for Cooking Recipes and Food Images》阅读笔记

三、Model

《Learning Cross-modal Embeddings for Cooking Recipes and Food Images》阅读笔记

将文本和图像映射到共享的子空间，然后在子空间cosine similarity loss和softmax loss

1.Learning Embeddings

1）Representation of recipes

Ingredients：对由word2vec算法获得的预训练嵌入向量用双向LSTM（成分列表是无序集合,使用双向LSTM模型，该模型同时考虑前向和后向排序。）

想法：可以将Bi-LSTM换成self-attention

Cooking Instructions：将每个指令/句子由a skip-instructions vector表示，然后在这些向量的序列上用LSTM以获得所有指令的表示vector

2) Representation of food images

VGG-16 and Resnet-50 models 深度残差网络，移除最后的softmax分类层

2.Joint Neural Embedding

成分、说明两个编码器的输出向量被拼接，并映射到共享子空间中。通过线性变换将图像向量映射到该空间中。

《Learning Cross-modal Embeddings for Cooking Recipes and Food Images》阅读笔记

（s,g,v）:（说明的序列向量，成分的序列向量，相关联的图片向量）

目标函数：1.最大化正配方图像对之间的余弦相似性

2.最小化所有不匹配的配方图像对，直到指定的margin

成分vector: 《Learning Cross-modal Embeddings for Cooking Recipes and Food Images》阅读笔记 h_{g}^{s} 说明vector: h_{s}^{k} after cat： h_{k}^{R}

图像vector: 《Learning Cross-modal Embeddings for Cooking Recipes and Food Images》阅读笔记 \nu_{k}

映射到共同子空间：

《Learning Cross-modal Embeddings for Cooking Recipes and Food Images》阅读笔记

《Learning Cross-modal Embeddings for Cooking Recipes and Food Images》阅读笔记

损失函数：

《Learning Cross-modal Embeddings for Cooking Recipes and Food Images》阅读笔记

y=1为正配对，y=-1为负配对

3.Semantic Regularization(语义正则化)

语义正则化思想：The key idea is that if high-level discriminative weights are shared,

then both of the modalities (recipe and image embeddings) should utilize these weights in a similar way which brings another level of alignment based on discrimination.

We optimize this objective together with our joint embedding loss.

嵌入《Learning Cross-modal Embeddings for Cooking Recipes and Food Images》阅读笔记 \phi^{r} 、 \phi^{v}

《Learning Cross-modal Embeddings for Cooking Recipes and Food Images》阅读笔记

《Learning Cross-modal Embeddings for Cooking Recipes and Food Images》阅读笔记

接下来进行softmax**，获得类别概率。

Wc是学习权重的矩阵，它们在图像和配方嵌入之间共享，以促进它们之间的语义对齐。

语义正则化损失表示为：

《Learning Cross-modal Embeddings for Cooking Recipes and Food Images》阅读笔记

其中Cr，Cv分别是配方和图像的语义类别标签。

注意，如果（φr，φv）是正对，则Cr和Cv是相同的。

目标函数为：

《Learning Cross-modal Embeddings for Cooking Recipes and Food Images》阅读笔记

四、实验结果

《Learning Cross-modal Embeddings for Cooking Recipes and Food Images》阅读笔记 \alpha =0.1

《Learning Cross-modal Embeddings for Cooking Recipes and Food Images》阅读笔记 \lambda =0.02

《Learning Cross-modal Embeddings for Cooking Recipes and Food Images》阅读笔记

《Learning Cross-modal Embeddings for Cooking Recipes and Food Images》阅读笔记

《Learning Cross-modal Embeddings for Cooking Recipes and Food Images》阅读笔记