文章目录

Seeing Voices and Hearing Faces: Cross-modal Biometric Matching

Seeing Voices and Hearing Faces: Cross-modal Biometric Matching

code: https://github.com/a-nagrani/SVHF-Net
project URL: http://www. robots.ox.ac.uk/˜vgg/research/CMBiometrics

Summary

The author wants to infer how we combine voice with a face. In this paper, the author does many work base on VGGFace and VoxCeleb database. Its main contributions can be summarized as follow :

introduce CNN for binary or multi-way’s face matching with audio.
Using different audio to identify the dynamic speaker.
the author discovers that CNN matches human performance on easy examples (different gender). But it exceeds human judgment in complex examples. (face has the same gender, age, and nationality)

摘要 (中文)

我们引入了一项看似不可能的任务：仅给某人讲话的音频片段，确定讲话者是两个面部图像中的哪个。在本文中，我们研究了这一问题以及许多相关的跨模式任务，旨在回答以下问题：我们可以从关于面部的声音中推断出多少，反之亦然？我们使用公开可用的数据集，从静态图像（VGGFace）和音频的说话者识别（VoxCeleb）中使用公开的数据集，“在野外”研究此任务。这些为交叉模式匹配的静态和动态测试提供了培训和测试方案。我们做出了以下贡献：（i）我们介绍了用于二进制和多路交叉模式人脸和音频匹配的CNN架构，（ii）比较了动态测试（可提供视频信息，但音频不是来自同一视频，而是经过静态测试（只有一个静止图像可用），并且（iii）我们使用人工测试作为基准来校准任务的难度。我们展示了CNN确实可以在静态和动态场景中都经过训练来解决此任务，并且甚至在给定声音的情况下对人脸进行10次分类的可能性也大大超过了。CNN在简单示例（例如，两张面孔上的性别不同）上与人类表现相匹配，但在更具挑战性的示例（例如，具有相同性别，年龄和国籍的面孔）上，其表现优于人类。甚至在给定声音的情况下对人脸进行10种分类的机会都大大超过了。CNN在简单示例（例如，两张面孔上的性别不同）上与人类表现相匹配，但在更具挑战性的示例（例如，具有相同性别，年龄和国籍的面孔）上，其表现优于人类。甚至在给定声音的情况下对人脸进行10种分类的机会都大大超过了。CNN在简单示例（例如，两张面孔上的性别不同）上与人类表现相匹配，但在更具挑战性的示例（例如，具有相同性别，年龄和国籍的面孔）上，其表现优于人类。

paper carefully review : Seeing Voices and Hearing Faces: Cross-modal Biometric Matching

Research Objective

The author aims to explore whether we can judge people only by audio. Just giving an audio clip of a voice, determine which of two or more face images or videos it corresponds to. Note, the voice and face video are not acquired simultaneously, so methods of active speaker detection that may rely on synchronisation of the audio and lip motion,e.g. [11] cannot be employed here.

Background and Problems

Background
- Recommender sysytems focus on top-N recommendation. The previous menthods can be divided into latent space methods and neihbourhood-based methods.(user-based and item-based) In recently years, many researches using moview description instead of focusing users past item.
- age, gender, ethnicity/accent influence both the facial appearance and voice.
- Besides the above static properties, Sheffert and Olson [40] suggested that visual information about a person’s particular idiosyncratic speaking style is related to the speaker’s auditory attributes.
previous methods brief introduction
- not state, maybe it is newest research.
Problem Statement
- not state in introduction

Related work

Human Perception Studies:
- Broad consensus Research exploring cross pattern matching of faces and voices In the case of human participants, matching is only possible when dynamic visual information about the pronunciation patterns is available [19, 26, 37].
Problem Statement
- It is worth noting that the difficulty of the task is highly dependent on the specific stimuli sets provided.
Face Recognition and Speaker Identification:
- we note that the recent advent of deep CNNs with large datasets has considerably advanced the state-of-the-art in both face recognition [21, 36, 46, 47] and speaker recognition [14, 33, 39, 45].
Problem Statement

Unfortunately, while these recognition models have proven remarkably effective at representation learning from a single modality, the alignment of learned representations across the modalities is less developed.

Cross-modal Matching
- Cross-modal matching has received considerable attention using visual data and text (natural language). Methods have been developed to establish mappings from images [16, 20, 23, 25, 50] and videos [49] to textual descriptions (e.g. captioning), generating visual models from text [51, 57] and solving visual question answering problems [1, 29, 31].
Problem Statement
- In cross-modal matching between video and audio however, work is limited, particularly in the field of biometrics (person or speaker recognition).

summary： only one research had done relvant work[38]. But it is not in big dataset and not still face images.

Method(s)

Methods
- (1) The static 3-stream CNN architecture consisting of two face sub-networks and one voice network.
- (2) a 5-stream dynamic-fusion architecture with two extra streams as dynamic feature subnetworks, and finally.
- (3) the N-way classification architecture which can deal with any number of face inputs at test time due to the concept of query pooling.

paper carefully review : Seeing Voices and Hearing Faces: Cross-modal Biometric Matching

Input :
- Voices: 512 ⇥ 300 for three seconds of speech.
- Static Faces: an RGB image, which has been cropped from a source image to contain only the region of the image surrounding a face. size is 224 ⇥ 224
- Dynamic Faces:including 3D convolutions [18], optical flow [41] and dynamic images [6] which have proven to be particularly effective in the context of human action recognition.
Architectures
- Static Architecture:Our base architecture comprises two face sub-networks and one voice sub-network. Both the face and voice streams use the VGG-M architecture [10].
- Dynamic-Fusion Architecture:The features computed for each face (RGB + dynamic) are combined after the final fully connected layer in each stream through summation.
- N-way Classification Architecture:One approach to resolving this issue would be to concatenate the voice to each face stream separately. And the author add a mean pooling layer to each face stream which calculates
  the ‘mean face’ of all the faces in a particular query, thereby making each stream context aware.

Evaluation And Experiment

DtataSet distributed:

VGGFace
VoxCeleb

Train/Test Split:
All speakers whose names start with ‘A’ or ‘B’ are reserved for validation, while speakers with names starting with ‘C’, ‘D’, ‘E’ are reserved for testing.
Gender, Nationality and Age (GNA) Variation:
- We use these labels to construct a more challenging test set, wherein each triplet contains speakers of the same gender, broad age bracket
Training Protocol
- batchsize , optimizer.
- pre-trained weights from the VGGFace and VoxCeleb models
- data augmentation techniques used on the ImageNet classification task by [42] (i.e. random cropping, flipping, colour shift). For the audio segments, we change the speed of each segment by choosing a random speed ratio between 0.95 to 1.05.
- Networks are trained for 10 epochs, or until validation error stops decreasing, whichever is sooner.
Method
- Static Matching: make use of both still images from VGGFace and frames extracted from the videos in the VoxCeleb dataset during training. When processing frames extracted from the VoxCeleb videos, and the author ensure that the audio segments and frames in a single triplet are not sourced from the same video.
- Dynamic Matching: we experiment with different methods for extracting dynamic information from a face-track.
- N-way Classification : pass
Metrics : the author define two metrics to evaluate performance; Identification Accuracy and Marginal Accuracy.
- identification accuracy
- marginal accuracy
Baselines : there are no prior baselines to compare to, So the author just according human self perception.
Results :
- Static and Dynamic Matching: Static and dynamic cases in table 2. The results for the dynamic task are better than those for the static task (by more than 3% for the V-F case).
  - N-way classification:
Analysis:
- Comparison to the Human Benchmark: On the more challenging test set with GNAvariation removed, however, human performance is significantly lower.
Marginal Accuracies: some face-voice combinations are significantly more discriminative than others.
Ablation Analysis : We experiment with three different methods of incorporating dynamic features in our architecture.
- As seen from figure 5, it is harder to discern latent variables like age, gender, ethnicity in these images, while mouth motion is clearly encoded. Using these dynamic images alone, we still achieve an accuracy of 77%, suggesting that the network may be able to exploit dynamic cross-modal biometrics.

Conclusion

main controbution

n this paper, we have introduced the novel task of crossmodal matching between faces and voices, and proposed a corresponding CNN architecture to address it.
The results of the experiments strongly suggest the existence of cross-modal biometric information, leading to the conclusion that perhaps our faces are more similar to our voices than we think.

week point

not reflected .

further work

not reflected .

Reference(optional)

a blog explain this essay in chinese

Arouse for me

this paper has fundanmatoal experiment as root, I need to study it. Let our experiment more reasonly.
This area is not too many preious work, and its dataset is not big. Maybe 30G. I can do it if the author publish code is able to running.
Here has a new research compare this experiment and finally got a more excellent score in across-modal matching task . here is paper introduction by Chinese.

paper carefully review : Seeing Voices and Hearing Faces: Cross-modal Biometric Matching

文章目录

Seeing Voices and Hearing Faces: Cross-modal Biometric Matching

Summary

摘要 (中文)

Research Objective

Background and Problems

Related work

Method(s)

Evaluation And Experiment

Conclusion

Reference(optional)

Arouse for me

相关推荐