Publications

Refine Results

(Filters Applied) Clear All

Linear prediction modulation filtering for speaker recognition of reverberant speech

Published in:
Odyssey 2012, The Speaker and Language Recognition Workshop, 25-28 June 2012.

Summary

This paper proposes a framework for spectral enhancement of reverberant speech based on inversion of the modulation transfer function. All-pole modeling of modulation spectra of clean and degraded speech are utilized to derive the linear prediction inverse modulation transfer function (LP-IMTF) solution as a low-order IIR filter in the modulation envelope domain. By considering spectral estimation under speech presence uncertainty, speech presence probabilities are derived for the case of reverberation. Aside from enhancement, the LP-IMTF framework allows for blind estimation of reverberation time by extracting a minimum phase approximation of the short-time spectral channel impulse response. The proposed speech enhancement method is used as a front-end processing step for speaker recognition. When applied to the microphone condition of the NISTSRE 2010 with artificially added reverberation, the proposed spectral enhancement method yields significant improvements across a variety of performance metrics.
READ LESS

Summary

This paper proposes a framework for spectral enhancement of reverberant speech based on inversion of the modulation transfer function. All-pole modeling of modulation spectra of clean and degraded speech are utilized to derive the linear prediction inverse modulation transfer function (LP-IMTF) solution as a low-order IIR filter in the modulation...

READ MORE

The MITLL NIST LRE 2011 language recognition system

Summary

This paper presents a description of the MIT Lincoln Laboratory (MITLL) language recognition system developed for the NIST 2011 Language Recognition Evaluation (LRE). The submitted system consisted of a fusion of four core classifiers, three based on spectral similarity and one based on tokenization. Additional system improvements were achieved following the submission deadline. In a major departure from previous evaluations, the 2011 LRE task focused on closed-set pairwise performance so as to emphasize a system's ability to distinguish confusable language pairs. Results are presented for the 24-language confusable pair task at test utterance durations of 30, 10, and 3 seconds. Results are also shown using the standard detection metrics (DET, minDCF) and it is demonstrated the previous metrics adequately cover difficult pair performance. On the 30 s 24-language confusable pair task, the submitted and post-evaluation systems achieved average costs of 0.079 and 0.070 and standard detection costs of 0.038 and 0.033.
READ LESS

Summary

This paper presents a description of the MIT Lincoln Laboratory (MITLL) language recognition system developed for the NIST 2011 Language Recognition Evaluation (LRE). The submitted system consisted of a fusion of four core classifiers, three based on spectral similarity and one based on tokenization. Additional system improvements were achieved following...

READ MORE

A stochastic system for large network growth

Published in:
IEEE Signal Process. Lett., Vol. 19, No. 6, June 2012, pp. 356-359.

Summary

This letter proposes a new model for preferential attachment in dynamic directed networks. This model consists of a linear time-invariant system that uses past observations to predict future attachment rates, and an innovation noise process that induces growth on vertices that previously had no attachments. Analyzing a large citation network in this context, we show that the proposed model fits the data better than existing preferential attachment models. An analysis of the noise in the dataset reveals power-law degree distributions often seen in large networks, and polynomial decay with respect to age in the probability of citing yet-uncited documents.
READ LESS

Summary

This letter proposes a new model for preferential attachment in dynamic directed networks. This model consists of a linear time-invariant system that uses past observations to predict future attachment rates, and an innovation noise process that induces growth on vertices that previously had no attachments. Analyzing a large citation network...

READ MORE

FY11 Line-Supported Bio-Next Program - Multi-modal Early Detection Interactive Classifier (MEDIC) for mild traumatic brain injury (mTBI) triage

Summary

The Multi-modal Early Detection Interactive Classifier (MEDIC) is a triage system designed to enable rapid assessment of mild traumatic brain injury (mTBI) when access to expert diagnosis is limited as in a battlefield setting. MEDIC is based on supervised classification that requires three fundamental components to function correctly; these are data, features, and truth. The MEDIC system can act as a data collection device in addition to being an assessment tool. Therefore, it enables a solution to one of the fundamental challenges in understanding mTBI: the lack of useful data. The vision of MEDIC is to fuse results from stimulus tests in each of four modalitites - auditory, occular, vocal, and intracranial pressure - and provide them to a classifier. With appropriate data for training, the MEDIC classifier is expected to provide an immediate decision of whether the subject has a strong likelihood of having sustained an mTBI and therefore requires an expert diagnosis from a neurologist. The tests within each modalitity were designed to balance the capacity of objective assessment and the maturity of the underlying technology against the ability to distinguish injured from non-injured subjects according to published results. Selection of existing modalities and underlying features represents the best available, low cost, portable technology with a reasonable chance of success.
READ LESS

Summary

The Multi-modal Early Detection Interactive Classifier (MEDIC) is a triage system designed to enable rapid assessment of mild traumatic brain injury (mTBI) when access to expert diagnosis is limited as in a battlefield setting. MEDIC is based on supervised classification that requires three fundamental components to function correctly; these are...

READ MORE

Autoregressive HMM speech synthesis

Author:
Published in:
Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, ICASSP, 25-30 March 2012, pp. 4021-4.

Summary

Autoregressive HMM modeling of spectral features has been proposed as a replacement for standard HMM speech synthesis. The merits of the approach are explored, and methods for enforcing stability of the estimated predictor coefficients are presented. It appears that rather than directly estimating autoregressive HMM parameters, greater synthesis accuracy is obtained by estimating the autoregressive HMM parameters by using a more traditional HMM recognition system to compute state-level posterior probabilities that are then used to accumulate statistics to estimate predictor coefficients. The result is a simplified mathematical framework that requires no modeling of derivatives and still provides smooth synthesis without unnatural spectral discontinuities. The resulting synthesis algorithm involves no matrix solves and may be formulated causally, and appears to result in quality very similar to that of more traditional HMM synthesis approaches. This paper describes the implementation of a complete Autoregressive HMM LVCSR system and its application for synthesis, and describes the preliminary synthesis results.
READ LESS

Summary

Autoregressive HMM modeling of spectral features has been proposed as a replacement for standard HMM speech synthesis. The merits of the approach are explored, and methods for enforcing stability of the estimated predictor coefficients are presented. It appears that rather than directly estimating autoregressive HMM parameters, greater synthesis accuracy is...

READ MORE

Goodness-of-fit statistics for anomaly detection in Chung-Lu random graphs

Published in:
ICASSP 2012, Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, 25-30 March 2012, pp. 3265-8.

Summary

Anomaly detection in graphs is a relevant problem in numerous applications. When determining whether an observation is anomalous with respect to the model of typical behavior, the notion of "goodness of fit" is important. This notion, however, is not well understood in the context of graph data. In this paper, we propose three goodness-of-fit statistics for Chung-Lu random graphs, and analyze their efficacy in discriminating graphs generated by the Chung-Lu model from those with anomalous topologies. In the results of a Monte Carlo simulation, we see that the most powerful statistic for anomaly detection depends on the type of anomaly, suggesting that a hybrid statistic would be the most powerful.
READ LESS

Summary

Anomaly detection in graphs is a relevant problem in numerous applications. When determining whether an observation is anomalous with respect to the model of typical behavior, the notion of "goodness of fit" is important. This notion, however, is not well understood in the context of graph data. In this paper...

READ MORE

Topic identification based extrinsic evaluation of summarization techniques applied to conversational speech

Published in:
Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, ICASSP, 25-30 March 2012, pp. 5073-6.

Summary

Document summarization algorithms are most commonly evaluated according to the intrinsic quality of the summaries they produce. An alternate approach is to examine the extrinsic utility of a summary, measured by the ability of the summary to aid a human in the completion of a specific task. In this paper, we use topic identification as a proxy for relevancy determination in the context of an information retrieval task, and a summary is deemed effective if it enables a user to determine the topical content of a retrieved document. We utilize Amazon's Mechanical Turk service to perform a large-scale human study contrasting four different summarization systems applied to conversational speech from the Fisher Corpus. We show that these results appear to be correlated with the performance of an automated topic identification system, and argue that this automated system can act as a low-cost proxy for a human evaluation during the development stages of a summarization system.
READ LESS

Summary

Document summarization algorithms are most commonly evaluated according to the intrinsic quality of the summaries they produce. An alternate approach is to examine the extrinsic utility of a summary, measured by the ability of the summary to aid a human in the completion of a specific task. In this paper...

READ MORE

Topic modeling for spoken documents using only phonetic information

Published in:
ASRU 2011, IEEE Workshop on Automatic Speech Recognition & Understanding, 11-15 December 2011, pp. 395-400.

Summary

This paper explores both supervised and unsupervised topic modeling for spoken audio documents using only phonetic information. In cases where word-based recognition is unavailable or infeasible, phonetic information can be used to indirectly learn and capture information provided by topically relevant lexical items. In some situations, a lack of transcribed data can prevent supervised training of a same-language phonetic recognition system. In these cases, phonetic recognition can use cross-language models or self-organizing units (SOUs) learned in a completely unsupervised fashion. This paper presents recent improvements in topic modeling using only phonetic information. We present new results using recently developed techniques for discriminative training for topic identification used in conjunction with recent improvements in SOU learning. A preliminary examination of the use of unsupervised latent topic modeling for unsupervised discovery of topics and topically relevant lexical items from phonetic information is also presented.
READ LESS

Summary

This paper explores both supervised and unsupervised topic modeling for spoken audio documents using only phonetic information. In cases where word-based recognition is unavailable or infeasible, phonetic information can be used to indirectly learn and capture information provided by topically relevant lexical items. In some situations, a lack of transcribed...

READ MORE

Investigating acoustic correlates of human vocal fold vibratory phase asymmetry through modeling and laryngeal high-speed videoendoscopy

Published in:
J. Acoust. Soc. Am., Vol. 130, No. 6, December 2011, pp. 3999-4009.

Summary

Vocal fold vibratory asymmetry is often associated with inefficient sound production through its impact on source spectral tilt. This association is investigated in both a computational voice production model and a group of 47 human subjects. The model provides indirect control over the degree of left-right phase asymmetry within a nonlinear source-filter framework, and high-speed videoendoscopy provides in vivo measures of vocal fold vibratory asymmetry. Source spectral tilt measures are estimated from the inverse-filtered spectrum of the simulated and recorded radiated acoustic pressure. As expected, model simulations indicated that increasing left-right phase asymmetry induces steeper spectral tilt. Subject data, however, reveal that none of the vibratory asymmetry measures correlates with spectral tilt measures. Probing further into physiological correlates of spectral tilt that might be affected by asymmetry, the glottal area waveform is parameterized to obtain measures of the open phase (open/plateau quotient) and closing phase (speed/closing quotient). Subjects' left-right phase asymmetry exhibits low, but statistically significant, correlations with speed quotient (r=0.45) and closing quotient (r=-0.39). Results call for future studies into the effect of asymmetric vocal fold vibrartion on glottal airflow and the associated impact on voice source spectral properties and vocal efficiency.
READ LESS

Summary

Vocal fold vibratory asymmetry is often associated with inefficient sound production through its impact on source spectral tilt. This association is investigated in both a computational voice production model and a group of 47 human subjects. The model provides indirect control over the degree of left-right phase asymmetry within a...

READ MORE

Face recognition despite missing information

Published in:
HST 2011, IEEE Int. Conf. on Technologies for Homeland Security, 15-17 November 2011, pp. 475-480.

Summary

Missing or degraded information continues to be a significant practical challenge facing automatic face representation and recognition. Generally, existing approaches seek either to generatively invert the degradation process or find discriminative representations that are immune to it. Ideally, the solution to this problem exists between these two perspectives. To this end, in this paper we show the efficacy of using probabilistic linear subspace modes (in particular, variational probabilistic PCA) for both modeling and recognizing facial data under disguise or occlusion. From a discriminative perspective, we verify the efficacy of this approach for attenuating the effect of missing data due to disguise and non-linear speculars in several verification experiments. From a generative view, we show its usefulness in not only estimating missing information but also understanding facial covariates for image reconstruction. In addition, we present a least-squares connection to the maximum likelihood solution under missing data and show its intuitive connection to the geometry of the subspace learning problem.
READ LESS

Summary

Missing or degraded information continues to be a significant practical challenge facing automatic face representation and recognition. Generally, existing approaches seek either to generatively invert the degradation process or find discriminative representations that are immune to it. Ideally, the solution to this problem exists between these two perspectives. To this...

READ MORE