Publications

Refine Results

(Filters Applied) Clear All

Talking Head Detection by Likelihood-Ratio Test(220.2 KB)

Published in:
Second Workshop on Speech, Language, Audio in Multimedia

Summary

Detecting accurately when a person whose face is visible in an audio-visual medium is the audible speaker is an enabling technology with a number of useful applications. The likelihood-ratio test formulation and feature signal processing employed here allow the use of high-dimensional feature sets in the audio and visual domain, and the approach appears to have good detection performance for AV segments as short as a few seconds.
READ LESS

Summary

Detecting accurately when a person whose face is visible in an audio-visual medium is the audible speaker is an enabling technology with a number of useful applications. The likelihood-ratio test formulation and feature signal processing employed here allow the use of high-dimensional feature sets in the audio and visual domain...

READ MORE

Talking head detection by likelihood-ratio test

Published in:
2nd Int. Workshop on Speech, Language and Audio in Multimedia, SLAM, 11-12 September 2014.

Summary

Detecting accurately when a person whose face is visible in an audio-visual medium is the audible speaker is an enabling technology with a number of useful applications. These include fused audio/visual speaker recognition, AV (audio/visual) segmentation and diarization as well as AV synchronization. The likelihood-ratio test formulation and feature signal processing employed here allow the use of high-dimensional feature sets in the audio and visual domain, and the approach appears to have good detection performance for AV segments as short as a few seconds. Computation costs for the resulting algorithm are modest, typically much less than the front-end face detection system. While the resulting system requires model training, only true condition training (i.e. video where the talking speaker is audible) is required.
READ LESS

Summary

Detecting accurately when a person whose face is visible in an audio-visual medium is the audible speaker is an enabling technology with a number of useful applications. These include fused audio/visual speaker recognition, AV (audio/visual) segmentation and diarization as well as AV synchronization. The likelihood-ratio test formulation and feature signal...

READ MORE

Autoregressive HMM speech synthesis

Author:
Published in:
Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, ICASSP, 25-30 March 2012, pp. 4021-4.

Summary

Autoregressive HMM modeling of spectral features has been proposed as a replacement for standard HMM speech synthesis. The merits of the approach are explored, and methods for enforcing stability of the estimated predictor coefficients are presented. It appears that rather than directly estimating autoregressive HMM parameters, greater synthesis accuracy is obtained by estimating the autoregressive HMM parameters by using a more traditional HMM recognition system to compute state-level posterior probabilities that are then used to accumulate statistics to estimate predictor coefficients. The result is a simplified mathematical framework that requires no modeling of derivatives and still provides smooth synthesis without unnatural spectral discontinuities. The resulting synthesis algorithm involves no matrix solves and may be formulated causally, and appears to result in quality very similar to that of more traditional HMM synthesis approaches. This paper describes the implementation of a complete Autoregressive HMM LVCSR system and its application for synthesis, and describes the preliminary synthesis results.
READ LESS

Summary

Autoregressive HMM modeling of spectral features has been proposed as a replacement for standard HMM speech synthesis. The merits of the approach are explored, and methods for enforcing stability of the estimated predictor coefficients are presented. It appears that rather than directly estimating autoregressive HMM parameters, greater synthesis accuracy is...

READ MORE

Kalman filter based speech synthesis

Author:
Published in:
Proc. IEEE Int. Conf. on Acoustics, Speech, and Signal Processing, ICASSP, 15 March 2010, pp. 4618-4621.

Summary

Preliminary results are reported from a very simple speech-synthesis system based on clustered-diphone Kalman Filter based modeling of line-spectral frequency based features. Parameters were estimated using maximum-likelihood EM training, with a constraint enforced that prevented eigenvalue magnitudes in the transition matrix from exceeding 1. Frames of training data were assigned diphone unit labels by forced alignment with an HMM recognition system. The HMM cluster tree was also used for Kalman Filter unit cluster assignments. The result is a simple synthesis system that has few parameters, synthesizes intelligible speech without audible discontinuities, and that can be adapted using MLLR techniques to support synthesis of a broad panoply of speakers from a single base model with small amounts of training data. The result is interesting for embedded synthesis applications.
READ LESS

Summary

Preliminary results are reported from a very simple speech-synthesis system based on clustered-diphone Kalman Filter based modeling of line-spectral frequency based features. Parameters were estimated using maximum-likelihood EM training, with a constraint enforced that prevented eigenvalue magnitudes in the transition matrix from exceeding 1. Frames of training data were assigned...

READ MORE

Nuisance attribute projection

Published in:
Chapter in Speech Communication, May 2007.

Summary

Cross-channel degradation is one of the significant challenges facing speaker recognition systems. We study this problem in the support vector machine (SVM) context and nuisance variable compensation in high-dimensional spaces more generally. We present an approach to nuisance variable compensation by removing nuisance attribute-related dimensions in the SVM expansion space via projections. Training to remove these dimensions is accomplished via an eigenvalue problem. The eigenvalue problem attempts to reduce multisession variation for the same speaker, reduce different channel effects, and increase "distance" between different speakers. Experiments show significant improvement in performance for the cross-channel case.
READ LESS

Summary

Cross-channel degradation is one of the significant challenges facing speaker recognition systems. We study this problem in the support vector machine (SVM) context and nuisance variable compensation in high-dimensional spaces more generally. We present an approach to nuisance variable compensation by removing nuisance attribute-related dimensions in the SVM expansion space...

READ MORE

The 2004 MIT Lincoln Laboratory speaker recognition system

Published in:
Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, ICASSP, Vol. 1, 19-23 March 2005, pp. I-177 - I-180.

Summary

The MIT Lincoln Laboratory submission for the 2004 NIST Speaker Recognition Evaluation (SRE) was built upon seven core systems using speaker information from short-term acoustics, pitch and duration prosodic behavior, and phoneme and word usage. These different levels of information were modeled and classified using Gaussian Mixture Models, Support Vector Machines and N-gram language models and were combined using a single layer perception fuser. The 2004 SRE used a new multi-lingual, multi-channel speech corpus that provided a challenging speaker detection task for the above systems. In this paper we describe the core systems used and provide an overview of their performance on the 2004 SRE detection tasks.
READ LESS

Summary

The MIT Lincoln Laboratory submission for the 2004 NIST Speaker Recognition Evaluation (SRE) was built upon seven core systems using speaker information from short-term acoustics, pitch and duration prosodic behavior, and phoneme and word usage. These different levels of information were modeled and classified using Gaussian Mixture Models, Support Vector...

READ MORE

Channel compensation for SVM speaker recognition

Published in:
Odyssey, The Speaker and Language Recognition Workshop, 31 May - 3 June 2004.

Summary

One of the major remaining challenges to improving accuracy in state-of-the-art speaker recognition algorithms is reducing the impact of channel and handset variations on system performance. For Gaussian Mixture Model based speaker recognition systems, a variety of channel-adaptation techniques are known and available for adapting models between different channel conditions, but for the much more recent Support Vector Machine (SVM) based approaches to this problem, much less is known about the best way to handle this issue. In this paper we explore techniques that are specific to the SVM framework in order to derive fully non-linear channel compensations. The result is a system that is less sensitive to specific kinds of labeled channel variations observed in training.
READ LESS

Summary

One of the major remaining challenges to improving accuracy in state-of-the-art speaker recognition algorithms is reducing the impact of channel and handset variations on system performance. For Gaussian Mixture Model based speaker recognition systems, a variety of channel-adaptation techniques are known and available for adapting models between different channel conditions...

READ MORE

Beyond cepstra: exploiting high-level information in speaker recognition

Summary

Traditionally speaker recognition techniques have focused on using short-term, low-level acoustic information such as cepstra features extracted over 20-30 ms windows of speech. But speech is a complex behavior conveying more information about the speaker than merely the sounds that are characteristic of his vocal apparatus. This higher-level information includes speaker-specific prosodics, pronunciations, word usage and conversational style. In this paper, we review some of the techniques to extract and apply these sources of high-level information with results from the NIST 2003 Extended Data Task.
READ LESS

Summary

Traditionally speaker recognition techniques have focused on using short-term, low-level acoustic information such as cepstra features extracted over 20-30 ms windows of speech. But speech is a complex behavior conveying more information about the speaker than merely the sounds that are characteristic of his vocal apparatus. This higher-level information includes...

READ MORE

Showing Results

1-8 of 8