Audio-visual identity grounding for enabling cross media search
Automatically searching for media clips in large heterogeneous datasets is an inherently difficult challenge, and nearly impossibly so when searching across distinct media types (e.g. finding audio clips that match an image). In this paper we introduce the exploitation of identity grounding for enabling this cross media search and exploration capability. Through the use of grounding we leverage one media channel (e.g. visual identity) as a noisy label for training a model in a different channel (e.g. audio speaker model). Finally, we demonstrate this search capability using images from the Labeled Faces in the Wild (LFW) dataset to query audio files that have been extracted from the YouTube Faces (YTF) dataset.