Publications

Refine Results

(Filters Applied) Clear All

Speaker recognition from coded speech in matched and mismatched conditions

Published in:
Proc. 2001: A Speaker Odyssey, The Speaker Recognition Workshop, 18-22 June 2001, pp. 115-20.

Summary

We investigate the effect of speech coding on automatic speaker recognition when training and testing conditions are matched and mismatched. Experiments use standard speech coding algorithms (GSM, G.729, G.723, MELP) and a speaker recognition system based on Gaussian mixture models adapted from a universal background model. There is little loss in recognition performance for toll quality speech coders and slightly more loss when lower quality speech coders are used. Speaker recognition from coded speech using handset dependent score normalization is examined, and we find that this significantly improves performance, particularly when there is a mismatch between training and testing conditions.
READ LESS

Summary

We investigate the effect of speech coding on automatic speaker recognition when training and testing conditions are matched and mismatched. Experiments use standard speech coding algorithms (GSM, G.729, G.723, MELP) and a speaker recognition system based on Gaussian mixture models adapted from a universal background model. There is little loss...

READ MORE

Speaker indexing in large audio databases using anchor models

Published in:
Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, ICASSP, Vol. 1, 7-11 May 2001, pp. 429-432.

Summary

This paper introduces the technique of anchor modeling in the applications of speaker detection and speaker indexing. The anchor modeling algorithm is refined by pruning the number of models needed. The system is applied to the speaker detection problem where its performance is shown to fall short of the state-of-the-art Gaussian Mixture Model with Universal Background Model (GMM-UBM) system. However, it is further shown that its computational efficiency lends itself to speaker indexing for searching large audio databases for desired speakers. Here, excessive computation may prohibit the use of the GMM-UBM recognition system. Finally, the paper presents a method for cascading anchor model and GMM-UBM detectors for speaker indexing. This approach benefits from the efficiency of anchor modeling and high accuracy of GMM-UBM recognition.
READ LESS

Summary

This paper introduces the technique of anchor modeling in the applications of speaker detection and speaker indexing. The anchor modeling algorithm is refined by pruning the number of models needed. The system is applied to the speaker detection problem where its performance is shown to fall short of the state-of-the-art...

READ MORE

Interlingua-based broad-coverage Korean-to-English translation in CCLINC

Published in:
Proc. First Int. Conf. on Human Language Technology, 18-21 March 2001.

Summary

At MIT Lincoln Laboratory, we have been developing a Korean-to-English machine translation system CCLINC (Common Coalition Language System at Lincoln Laboratory). The CCLINC Korean-to-English translation system consists of two core modules, language understanding and generation modules mediated by a language neutral meaning representation called a semantic frame. The key features of the system include: (i) Robust efficient parsing of Korean (a verb final language with overt case markers, relatively free word order, and frequent omissions of arguments). (ii) High quality translation via word sense disambiguation and accurate word order generation of the target language. (iii) Rapid system development and porting to new domains via knowledge-based automated acquisition of grammars. Having been trained on Korean newspaper articles on "missiles" and "chemical biological warfare," the system produces the translation output sufficient for content understanding of the original document.
READ LESS

Summary

At MIT Lincoln Laboratory, we have been developing a Korean-to-English machine translation system CCLINC (Common Coalition Language System at Lincoln Laboratory). The CCLINC Korean-to-English translation system consists of two core modules, language understanding and generation modules mediated by a language neutral meaning representation called a semantic frame. The key features...

READ MORE

The use of dynamic segment scoring for language-independent question answering

Published in:
Proc. 1st Int. Conf. on Human Language Technology Research, HLT, 18-21 March 2001.

Summary

This paper presents a novel language-independent question/answering (Q/A) system based on natural language processing techniques, shallow query understanding, dynamic sliding window techniques, and statistical proximity distribution matching techniques. The performance of the proposed system using the latest Text REtrieval Conference (TREC-8) data was comparable to results reported by the top TREC-8 contenders.
READ LESS

Summary

This paper presents a novel language-independent question/answering (Q/A) system based on natural language processing techniques, shallow query understanding, dynamic sliding window techniques, and statistical proximity distribution matching techniques. The performance of the proposed system using the latest Text REtrieval Conference (TREC-8) data was comparable to results reported by the top...

READ MORE

The Lincoln speaker recognition system: NIST EVAL2000

Published in:
6th Int. Conf. on Spoken Language, ICSLP, 16-20 October 2000.

Summary

This paper presents an overview of the Lincoln Laboratory systems fielded for the 2000 NIST speaker recognition evaluation (SRE00). In addition to the standard one-speaker detection tasks, this year's evaluation, as in 1999, included multi-speaker spokes dealing with detection, tracking and segmentation. The design approach for the Lincoln system in SRE00 was to develop a set of core one-speaker detection and multi-speaker clustering tools that could be applied to all the tasks. This paper will describe these core systems, how they are applied to the SRE00 tasks and the results they produce. Additionally, a new channel normalization technique known as handset-dependent test-score norm (HTnorm) is introduced.
READ LESS

Summary

This paper presents an overview of the Lincoln Laboratory systems fielded for the 2000 NIST speaker recognition evaluation (SRE00). In addition to the standard one-speaker detection tasks, this year's evaluation, as in 1999, included multi-speaker spokes dealing with detection, tracking and segmentation. The design approach for the Lincoln system in...

READ MORE

Estimation of handset nonlinearity with application to speaker recognition

Published in:
IEEE Trans. Speech Audio Process., Vol. 8, No. 5, September 2000, pp. 567-584.

Summary

A method is described for estimating telephone handset nonlinearity by matching the spectral magnitude of the distorted signal to the output of a nonlinear channel model, driven by an undistorted reference. This "magnitude-only" representation allows the model to directly match unwanted speech formants that arise over nonlinear channels and that are a potential source of degradation in speaker and speech recognition algorithms. As such, the method is particularly suited to algorithms that use only spectral magnitude information. The distortion model consists of a memoryless nonlinearity sandwiched between two finite-length linear filters. Nonlinearities considered include arbitrary finite-order polynomials and parametric sigmoidal functionals derived from a carbon-button handset model. Minimization of a mean-squared spectral magnitude distance with respect to model parameters relies on iterative estimation via a gradient descent technique. Initial work has demonstrated the importance of addressing handset nonlinearity, in addition to linear distortion, in speaker recognition over telephone channels. A nonlinear handset "mapping" applied to training or testing data to reduce mismatch between different types of handset microphone outputs, improves speaker verification performance relative to linear compensation only. Finally, a method is proposed to merge the mapper strategy with a method of likelihood score normalization (hnorm) for further mismatch reduction and speaker verification performance improvement.
READ LESS

Summary

A method is described for estimating telephone handset nonlinearity by matching the spectral magnitude of the distorted signal to the output of a nonlinear channel model, driven by an undistorted reference. This "magnitude-only" representation allows the model to directly match unwanted speech formants that arise over nonlinear channels and that...

READ MORE

Speaker recognition using G.729 speech codec parameters

Published in:
Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, ICASSP, Vol. II, 5-9 June 2000, pp. 1089-1092.

Summary

Experiments in Gaussian-mixture-model speaker recognition from mel-filter bank energies (MFBs) of the G.729 codec all-pole spectral envelope, showed significant performance loss relative to the standard mel-cepstral coefficients of G.729 synthesized (coded) speech. In this paper, we investigate two approaches to recover speaker recognition performance from G.729 parameters, rather than deriving cepstra from MFBs of an all-pole spectrum. Specifically, the G.729 LSFs are converted to "direct" cepstral coefficients for which there exists a one-to-one correspondence with the LSFs. The G.729 residual is also considered; in particular, appending G.729 pitch as a single parameter to the direct cepstral coefficients gives further performance gain. The second nonparametric approach uses the original MFB paradigm, but adds harmonic striations to the G.729 all-pole spectral envelope. Although obtaining considerable performance gains with these methods, we have yet to match the performance of G.729 synthesized speech, motivating the need for representing additional fine structure of the G.729 residual.
READ LESS

Summary

Experiments in Gaussian-mixture-model speaker recognition from mel-filter bank energies (MFBs) of the G.729 codec all-pole spectral envelope, showed significant performance loss relative to the standard mel-cepstral coefficients of G.729 synthesized (coded) speech. In this paper, we investigate two approaches to recover speaker recognition performance from G.729 parameters, rather than deriving...

READ MORE

The NIST Speaker Recognition Evaluation - overview, methodology, systems, results, perspective

Published in:
Speech Commun., Vol. 31, Nos. 2-3, June 2000, pp. 225-254.

Summary

This paper, based on three presentations made in 1998 at the RLA2C Workshop in Avignon, discusses the evaluation of speaker recognition systems from several perspectives. A general discussion of the speaker recognition task and the challenges and issues involved in its evaluation is offered. The NIST evaluations in this area and specifically the 1998 evaluation, its objectives, protocols and test data, are described. The algorithms used by the systems that were developed for this evaluation are summarized, compared and contrasted. Overall performance results of this evaluation are presented by means of detection error trade-off (DET) curves. These show the performance trade-off of missed detections and false alarms for each system and the effects on performance of training condition, test segment duration, the speakers' sex and the match or mismatch of training and test handsets. Several factors that were found to have an impact on performance, including pitch frequency, handset type and noise, are discussed and DET curves showing their effects are presented. The paper concludes with some perspective on the history of this technology and where it may be going.
READ LESS

Summary

This paper, based on three presentations made in 1998 at the RLA2C Workshop in Avignon, discusses the evaluation of speaker recognition systems from several perspectives. A general discussion of the speaker recognition task and the challenges and issues involved in its evaluation is offered. The NIST evaluations in this area...

READ MORE

Information Survivability for Mobile Wireless Systems

Published in:
Lincoln Laboratory Journal, Vol. 12, No. 1, pp. 65-80.

Summary

Mobile wireless networks are more vulnerable to cyber attack and more difficult to defend than conventional wired networks. In discussing security and survivability issues in mobile wireless networks, we focus here on group communication, as applied to multimedia conferencing. The need to conserve resources in wireless networks encourages the use of multicast protocols for group communication, which introduces additional security concerns. We point out the need for rate-adaptation techniques to simultaneously support multiple receivers that each experience different network conditions. The security properties associated with a number of approaches to rate adaptation are compared. We also identify several security issues for reliable group communication, providing examples of denial-of-service attacks and describing appropriate security measures to guard against such attacks. We examine the costs of these security measures in terms of network efficiency and computational overhead. Finally, we introduce a survivability approach called dynamically deployed protocols, in which the effects of an information attack are mitigated by dynamically switching to a new protocol to evade the attack. We suggest that this dynamic protocol deployment can be achieved effectively by transmission of in-line mobile code.
READ LESS

Summary

Mobile wireless networks are more vulnerable to cyber attack and more difficult to defend than conventional wired networks. In discussing security and survivability issues in mobile wireless networks, we focus here on group communication, as applied to multimedia conferencing. The need to conserve resources in wireless networks encourages the use...

READ MORE

Approaches to speaker detection and tracking in conversational speech

Published in:
Digit. Signal Process., Vol. 10, No. 1, January/April/July, 2000, pp. 93-112. (Fifth Annual NIST Speaker Recognition Workshop, 3-4 June 1999.)

Summary

Two approaches to detecting and tracking speakers in multispeaker audio are described. Both approaches use an adapted Gaussian mixture model, universal background model (GMM-UBM) speaker detection system as the core speaker recognition engine. In one approach, the individual log-likelihood ratio scores, which are produced on a frame-by-frame basis by the GMM-UBM system, are used to first partition the speech file into speaker homogenous regions and then to create scores for these regions. We refer to this approach as internal segmentation. Another approach uses an external segmentation algorithm, based on blind clustering, to partition the speech file into speaker homogenous regions. The adapted GMM-UBM system then scores each of these regions as in the single-speaker recognition case. We show that the external segmentation system outperforms the internal segmentation system for both detection and tracking. In addition, we show how different components of the detection and tracking algorithms contribute to the overall system performance.
READ LESS

Summary

Two approaches to detecting and tracking speakers in multispeaker audio are described. Both approaches use an adapted Gaussian mixture model, universal background model (GMM-UBM) speaker detection system as the core speaker recognition engine. In one approach, the individual log-likelihood ratio scores, which are produced on a frame-by-frame basis by the...

READ MORE