Publications

Refine Results

(Filters Applied) Clear All

Multi-modal audio, video and physiological sensor learning for continuous emotion prediction

Summary

The automatic determination of emotional state from multimedia content is an inherently challenging problem with a broad range of applications including biomedical diagnostics, multimedia retrieval, and human computer interfaces. The Audio Video Emotion Challenge (AVEC) 2016 provides a well-defined framework for developing and rigorously evaluating innovative approaches for estimating the arousal and valence states of emotion as a function of time. It presents the opportunity for investigating multimodal solutions that include audio, video, and physiological sensor signals. This paper provides an overview of our AVEC Emotion Challenge system, which uses multi-feature learning and fusion across all available modalities. It includes a number of technical contributions, including the development of novel high- and low-level features for modeling emotion in the audio, video, and physiological channels. Low-level features include modeling arousal in audio with minimal prosodic-based descriptors. High-level features are derived from supervised and unsupervised machine learning approaches based on sparse coding and deep learning. Finally, a state space estimation approach is applied for score fusion that demonstrates the importance of exploiting the time-series nature of the arousal and valence states. The resulting system outperforms the baseline systems [10] on the test evaluation set with an achieved Concordant Correlation Coefficient (CCC) for arousal of 0.770 vs 0.702 (baseline) and for valence of 0.687 vs 0.638. Future work will focus on exploiting the time-varying nature of individual channels in the multi-modal framework.
READ LESS

Summary

The automatic determination of emotional state from multimedia content is an inherently challenging problem with a broad range of applications including biomedical diagnostics, multimedia retrieval, and human computer interfaces. The Audio Video Emotion Challenge (AVEC) 2016 provides a well-defined framework for developing and rigorously evaluating innovative approaches for estimating the...

READ MORE

Detecting depression using vocal, facial and semantic communication cues

Summary

Major depressive disorder (MDD) is known to result in neurophysiological and neurocognitive changes that affect control of motor, linguistic, and cognitive functions. MDD's impact on these processes is reflected in an individual's communication via coupled mechanisms: vocal articulation, facial gesturing and choice of content to convey in a dialogue. In particular, MDD-induced neurophysiological changes are associated with a decline in dynamics and coordination of speech and facial motor control, while neurocognitive changes influence dialogue semantics. In this paper, biomarkers are derived from all of these modalities, drawing first from previously developed neurophysiologically motivated speech and facial coordination and timing features. In addition, a novel indicator of lower vocal tract constriction in articulation is incorporated that relates to vocal projection. Semantic features are analyzed for subject/avatar dialogue content using a sparse coded lexical embedding space, and for contextual clues related to the subject's present or past depression status. The features and depression classification system were developed for the 6th International Audio/Video Emotion Challenge (AVEC), which provides data consisting of audio, video-based facial action units, and transcribed text of individuals communicating with the human-controlled avatar. A clinical Patient Health Questionnaire (PHQ) score and binary depression decision are provided for each participant. PHQ predictions were obtained by fusing outputs from a Gaussian staircase regressor for each feature set, with results on the development set of mean F1=0.81, RMSE=5.31, and MAE=3.34. These compare favorably to the challenge baseline development results of mean F1=0.73, RMSE=6.62, and MAE=5.52. On test set evaluation, our system obtained a mean F1=0.70, which is similar to the challenge baseline test result. Future work calls for consideration of joint feature analyses across modalities in an effort to detect neurological disorders based on the interplay of motor, linguistic, affective, and cognitive components of communication.
READ LESS

Summary

Major depressive disorder (MDD) is known to result in neurophysiological and neurocognitive changes that affect control of motor, linguistic, and cognitive functions. MDD's impact on these processes is reflected in an individual's communication via coupled mechanisms: vocal articulation, facial gesturing and choice of content to convey in a dialogue. In...

READ MORE

How deep neural networks can improve emotion recognition on video data

Published in:
ICIP: 2016 IEEE Int. Conf. on Image Processing, 25-28 September 2016.

Summary

We consider the task of dimensional emotion recognition on video data using deep learning. While several previous methods have shown the benefits of training temporal neural network models such as recurrent neural networks (RNNs) on hand-crafted features, few works have considered combining convolutional neural networks (CNNs) with RNNs. In this work, we present a system that performs emotion recognition on video data using both CNNs and RNNs, and we also analyze how much each neural network component contributes to the system's overall performance. We present our findings on videos from the Audio/Visual+Emotion Challenge (AV+EC2015). In our experiments, we analyze the effects of several hyperparameters on overall performance while also achieving superior performance to the baseline and other competing methods.
READ LESS

Summary

We consider the task of dimensional emotion recognition on video data using deep learning. While several previous methods have shown the benefits of training temporal neural network models such as recurrent neural networks (RNNs) on hand-crafted features, few works have considered combining convolutional neural networks (CNNs) with RNNs. In this...

READ MORE

Relation of automatically extracted formant trajectories with intelligibility loss and speaking rate decline in amyotrophic lateral sclerosis

Summary

Effective monitoring of bulbar disease progression in persons with amyotrophic lateral sclerosis (ALS) requires rapid, objective, automatic assessment of speech loss. The purpose of this work was to identify acoustic features that aid in predicting intelligibility loss and speaking rate decline in individuals with ALS. Features were derived from statistics of the first (F1) and second (F2) formant frequency trajectories and their first and second derivatives. Motivated by a possible link between components of formant dynamics and specific articulator movements, these features were also computed for low-pass and high-pass filtered formant trajectories. When compared to clinician-rated intelligibility and speaking rate assessments, F2 features, particularly mean F2 speed and a novel feature, mean F2 acceleration, were most strongly correlated with intelligibility and speaking rate, respectively (Spearman correlations > 0.70, p < 0.0001). These features also yielded the best predictions in regression experiments (r > 0.60, p < 0.0001). Comparable results were achieved using low-pass filtered F2 trajectory features, with higher correlations and lower prediction errors achieved for speaking rate over intelligibility. These findings suggest information can be exploited in specific frequency components of formant trajectories, with implications for automatic monitoring of ALS.
READ LESS

Summary

Effective monitoring of bulbar disease progression in persons with amyotrophic lateral sclerosis (ALS) requires rapid, objective, automatic assessment of speech loss. The purpose of this work was to identify acoustic features that aid in predicting intelligibility loss and speaking rate decline in individuals with ALS. Features were derived from statistics...

READ MORE

Relating estimated cyclic spectral peak frequency to measured epilarynx length using magnetic resonance imaging

Published in:
INTERSPEECH 2016: 16th Annual Conf. of the Int. Speech Communication Assoc., 8-12 September 2016.

Summary

The epilarynx plays an important role in speech production, carrying information about the individual speaker and manner of articulation. However, precise acoustic behavior of this lower vocal tract structure is difficult to establish. Focusing on acoustics observable in natural speech, recent spectral processing techniques isolate a unique resonance with characteristics of the epilarynx previously shown via simulation, specifically cyclicity (i.e. energy differences between the closed and open phases of the glottal cycle) in a 3-5kHz region observed across vowels. Using Magnetic Resonance Imaging (MRI), the present work relates this estimated cyclic peak frequency to measured epilarynx length. Assuming a simple quarter wavelength relationship, the cavity length estimated from the cyclic peak frequency is shown to be directly proportional (linear fit slope =1.1) and highly correlated (p = 0.85, pval<10^?4) to the measured epilarynx length across speakers. Results are discussed, as are implications in speech science and application domains.
READ LESS

Summary

The epilarynx plays an important role in speech production, carrying information about the individual speaker and manner of articulation. However, precise acoustic behavior of this lower vocal tract structure is difficult to establish. Focusing on acoustics observable in natural speech, recent spectral processing techniques isolate a unique resonance with characteristics...

READ MORE

A vocal modulation model with application to predicting depression severity

Published in:
13th IEEE Int. Conf. on Wearable and Implantable Body Sensor Networks, BSN 2016, 14-17 June 2016.

Summary

Speech provides a potential simple and noninvasive "on-body" means to identify and monitor neurological diseases. Here we develop a model for a class of vocal biomarkers exploiting modulations in speech, focusing on Major Depressive Disorder (MDD) as an application area. Two model components contribute to the envelope of the speech waveform: amplitude modulation (AM) from respiratory muscles, and AM from interaction between vocal tract resonances (formants) and frequency modulation in vocal fold harmonics. Based on the model framework, we test three methods to extract envelopes capturing these modulations of the third formant for synthesized sustained vowels. Using subsequent modulation features derived from the model, we predict MDD severity scores with a Gaussian Mixture Model. Performing global optimization over classifier parameters and number of principal components, we evaluate performance of the features by examining the root-mean-squared error (RMSE), mean absolute error (MAE), and Spearman correlation between the actual and predicted MDD scores. We achieved RMSE and MAE values 10.32 and 8.46, respectively (Spearman correlation=0.487, p<0.001), relative to a baseline RMSE of 11.86 and MAE of 10.05, obtained by predicting the mean MDD severity score. Ultimately, our model provides a framework for detecting and monitoring vocal modulations that could also be applied to other neurological diseases.
READ LESS

Summary

Speech provides a potential simple and noninvasive "on-body" means to identify and monitor neurological diseases. Here we develop a model for a class of vocal biomarkers exploiting modulations in speech, focusing on Major Depressive Disorder (MDD) as an application area. Two model components contribute to the envelope of the speech...

READ MORE

Iris biometric security challenges and possible solutions: for your eyes only? Using the iris as a key

Summary

Biometrics were originally developed for identification, such as for criminal investigations. More recently, biometrics have been also utilized for authentication. Most biometric authentication systems today match a user's biometric reading against a stored reference template generated during enrollment. If the reading and the template are sufficiently close, the authentication is considered successful and the user is authorized to access protected resources. This binary matching approach has major inherent vulnerabilities. An alternative approach to biometric authentication proposes to use fuzzy extractors (also known as biometric cryptosystems), which derive cryptographic keys from noisy sources, such as biometrics. In theory, this approach is much more robust and can enable cryptographic authorization. Unfortunately, for many biometrics that provide high-quality identification, fuzzy extractors provide no security guarantees. This gap arises in part because of an objective mismatch. The quality of a biometric identification is typically measured using false match rate (FMR) versus false nonmatch rate (FNMR). As a result, biometrics have been extensively optimized for this metric. However, this metric says little about the suitability of a biometric for key derivation. In this article, we illustrate a metric that can be used to optimize biometrics for authentication. Using iris biometrics as an example, we explore possible directions for improving processing and representation according to this metric. Finally, we discuss why strong biometric authentication remains a challenging problem and propose some possible future directions for addressing these challenges.
READ LESS

Summary

Biometrics were originally developed for identification, such as for criminal investigations. More recently, biometrics have been also utilized for authentication. Most biometric authentication systems today match a user's biometric reading against a stored reference template generated during enrollment. If the reading and the template are sufficiently close, the authentication is...

READ MORE

Robust face recognition-based search and retrieval across image stills and video

Author:
Published in:
HST 2015, IEEE Int. Symp. on Technologies for Homeland Security, 14-16 April 2015.

Summary

Significant progress has been made in addressing face recognition channel, sensor, and session effects in both still images and video. These effects include the classic PIE (pose, illumination, expression) variation, as well as variations in other characteristics such as age and facial hair. While much progress has been made, there has been little formal work in characterizing and compensating for the intrinsic differences between faces in still images and video frames. These differences include that faces in still images tend to have neutral expressions and frontal poses, while faces in videos tend to have more natural expressions and poses. Typically faces in videos are also blurrier, have lower resolution, and are framed differently than faces in still images. Addressing these issues is important when comparing face images between still images and video frames. Also, face recognition systems for video applications often rely on legacy face corpora of still images and associated meta data (e.g. identifying information, landmarks) for development, which are not formally compensated for when applied to the video domain. In this paper we will evaluate the impact of channel effects on face recognition across still images and video frames for the search and retrieval task. We will also introduce a novel face recognition approach for addressing the performance gap across these two respective channels. The datasets and evaluation protocols from the Labeled Faces in the Wild (LFW) still image and YouTube Faces (YTF) video corpora will be used for the comparative characterization and evaluation. Since the identities of subjects in the YTF corpora are a subset of those in the LFW corpora, this enables an apples-to-apples comparison of in-corpus and cross-corpora face comparisons.
READ LESS

Summary

Significant progress has been made in addressing face recognition channel, sensor, and session effects in both still images and video. These effects include the classic PIE (pose, illumination, expression) variation, as well as variations in other characteristics such as age and facial hair. While much progress has been made, there...

READ MORE

Joint audio-visual mining of uncooperatively collected video: FY14 Line-Supported Information, Computation, and Exploitation Program

Summary

The rate at which video is being created and gathered is rapidly accelerating as access to means of production and distribution expand. This rate of increase, however, is greatly outpacing the development of content-based tools to help users sift through this unstructured, multimedia data. The need for such technologies becomes more acute when considering their potential value in critical, media-rich government applications such as Seized Media Analysis, Social Media Forensics, and Foreign Media Monitoring. A fundamental challenge in developing technologies in these application areas is that they are typically in low-resource data domains. Low-resource domains are ones where the lack of ground-truth labels and statistical support prevent the direct application of traditional machine learning approaches. To help bridge this capability gap, the Joint Audio and Visual Mining of Uncooperatively Collected Video ICE Line Program (2236-1301) is developing new technologies for better content-based search, summarization, and browsing of large collections of unstructured, uncooperatively collected multimedia. In particular, this effort seeks to improve capabilities in video understanding by jointly exploiting time aligned audio, visual, and text information, an approach which has been underutilized in both the academic and commercial communities. Exploiting subtle connections between and across multiple modalities in low-resource multimedia data helps enable deeper video understanding, and in some cases provides new capability where it previously didn't exist. This report outlines work done in Fiscal Year 2014 (FY14) by the cross-divisional, interdisciplinary team tasked to meet these objectives. In the following sections, we highlight technologies developed in FY14 to support efficient Query-by-Example, Attribute, Keyword Search and Cross-Media Exploration and Summarization. Additionally, we preview work proposed for Fiscal Year 2015 as well as summarize our external sponsor interactions and publications/presentations.
READ LESS

Summary

The rate at which video is being created and gathered is rapidly accelerating as access to means of production and distribution expand. This rate of increase, however, is greatly outpacing the development of content-based tools to help users sift through this unstructured, multimedia data. The need for such technologies becomes...

READ MORE

Audio-visual identity grounding for enabling cross media search

Author:
Published in:
IEEE Computer Vision and Pattern Recognition Big Data Workshop, 23 June 2014.

Summary

Automatically searching for media clips in large heterogeneous datasets is an inherently difficult challenge, and nearly impossibly so when searching across distinct media types (e.g. finding audio clips that match an image). In this paper we introduce the exploitation of identity grounding for enabling this cross media search and exploration capability. Through the use of grounding we leverage one media channel (e.g. visual identity) as a noisy label for training a model in a different channel (e.g. audio speaker model). Finally, we demonstrate this search capability using images from the Labeled Faces in the Wild (LFW) dataset to query audio files that have been extracted from the YouTube Faces (YTF) dataset.
READ LESS

Summary

Automatically searching for media clips in large heterogeneous datasets is an inherently difficult challenge, and nearly impossibly so when searching across distinct media types (e.g. finding audio clips that match an image). In this paper we introduce the exploitation of identity grounding for enabling this cross media search and exploration...

READ MORE

Showing Results

1-10 of 17