Publications

Refine Results

(Filters Applied) Clear All

R&D Areas

R&D Groups

Year

Items per page

Channel robust speaker verification via feature mapping

April 6, 2003

Conference Paper

Author:

Douglas A. Reynolds

Published in:

Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, ICASSP, Vol. II, 6-10 April 2003, pp. II-53 - II-56.

Topic:

speaker recognition

R&D area:

Cyber Security and Information Sciences

R&D group:

Artificial Intelligence Technology and Systems

Summary

In speaker recognition applications, channel variability is a major cause of errors. Techniques in the feature, model and score domains have been applied to mitigate channel effects. In this paper we present a new feature mapping technique that maps feature vectors into a channel independent space. The feature mapping learns mapping parameters from a set of channel-dependent models derived for a channel-dependent models derived from a channel-independent model via MAP adaptation. The technique is developed primarily for speaker verification, but can be applied for feature normalization in speech recognition applications. Results are presented on NIST landline and cellular telephone speech corpora where it is shown that feature mapping provides significant performance improvements over baseline systems and similar performance to Hnorm and Speaker-Model-Synthesis (SMS).

READ LESS

Summary

Channel robust speaker verification via feature mapping

Conditional pronunciation modeling in speaker detection

April 6, 2003

Conference Paper

Author:

David Klusacek

…

Published in:

Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, ICASSP, 6-10 April 2003.

Topic:

speaker recognition

R&D area:

Cyber Security and Information Sciences

R&D group:

Artificial Intelligence Technology and Systems

Summary

In this paper, we present a conditional pronunciation modeling method for the speaker detection task that does not rely on acoustic vectors. Aiming at exploiting higher-level information carried by the speech signal, it uses time-aligned streams of phones and phonemes to model a speaker's specific Pronunciation. Our system uses phonemes drawn from a lexicon of pronunciations of words recognized by an automatic speech recognition system to generate the phoneme stream and an open-loop phone recognizer to generate a phone stream. The phoneme and phone streams are aligned at the frame level and conditional probabilities of a phone, given a phoneme, are estimated using co-occurrence counts. A likelihood detector is then applied to these probabilities. Performance is measured using the NIST Extended Data paradigm and the Switchboard-I corpus. Using 8 training conversations for enrollment, a 2.1% equal error rate was achieved. Extensions and alternatives, as well as fusion experiments, are presented and discussed.

READ LESS

Summary

Conditional pronunciation modeling in speaker detection

Phonetic speaker recognition using maximum-likelihood binary-decision tree models

April 6, 2003

Conference Paper

Author:

Jiri Navratil

…

Published in:

Proc. IEEE Int. Conf. on Acoustics, Speech, and Signal Processing, ICASSP, Vol. 4, 6-10 April 2003.

Topic:

speaker recognition

R&D area:

Cyber Security and Information Sciences

R&D group:

Artificial Intelligence Technology and Systems

Summary

Recent work in phonetic speaker recognition has shown that modeling phone sequences using n-grams is a viable and effective approach to speaker recognition, primarily aiming at capturing speaker-dependent pronunciation and also word usage. This paper describes a method involving binary-tree-structured statistical models for extending the phonetic context beyond that of standard n-grams (particularly bigrams) by exploiting statistical dependencies within a longer sequence window without exponentially increasing the model complexity, as is the case with n-grams. Two ways of dealing with data sparsity are also studied, namely, model adaptation and a recursive bottom-up smoothing of symbol distributions. Results obtained under a variety of experimental conditions using the NIST 2001 Speaker Recognition Extended Data Task indicate consistent improvements in equal-error rate performance as compared to standard bigram models. The described approach confirms the relevance of long phonetic context in phonetic speaker recognition and represents an intermediate stage between short phone context and word-level modeling without the need for any lexical knowledge, which suggests its language independence.

READ LESS

Summary

Phonetic speaker recognition using maximum-likelihood binary-decision tree models

The SuperSID project : exploiting high-level information for high-accuracy speaker recognition

April 6, 2003

Conference Paper

Author:

Douglas A. Reynolds

…

Published in:

Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, ICASSP, Vol. 4, 6-10 April 2003, pp. IV-784 - IV-787.

Topic:

speaker recognition

R&D area:

Cyber Security and Information Sciences

R&D group:

Artificial Intelligence Technology and Systems

Summary

The area of automatic speaker recognition has been dominated by systems using only short-term, low-level acoustic information, such as cepstral features. While these systems have indeed produced very low error rates, they ignore other levels of information beyond low-level acoustics that convey speaker information. Recently published work has shown examples that such high-level information can be used successfully in automatic speaker recognition systems and has the potential to improve accuracy and add robustness. For the 2002 JHU CLSP summer workshop, the SuperSID project was undertaken to exploit these high-level information sources and dramatically increase speaker recognition accuracy on a defined NIST evaluation corpus and task. This paper provides an overview of the structures, data, task, tools, and accomplishments of this project. Wide ranging approaches using pronunciation models, prosodic dynamics, pitch and duration features, phone streams, and conversational interactions were explored and developed. In this paper we show how these novel features and classifiers indeed provide complementary information and can be fused together to drive down the equal error rate on the 2001 NIS extended data task to 0.2% - a 71% relative reduction in error over the previous state of the art.

READ LESS

Summary

The SuperSID project : exploiting high-level information for high-accuracy speaker recognition

Using prosodic and conversational features for high-performance speaker recognition : report from JHU WS'02

April 6, 2003

Conference Paper

Author:

B. Peskin

…

Published in:

Proc. IEEE Int. Conf. on Acoustics, speech, and Signal Processing, ICASSP, Vol. IV, 6-10 April 2003, pp. IV-792 - IV-795.

Topic:

speaker recognition

R&D area:

Cyber Security and Information Sciences

R&D group:

Artificial Intelligence Technology and Systems

Summary

While there has been a long tradition of research seeking to use prosodic features, especially pitch, in speaker recognition systems, results have generally been disappointing when such features are used in isolation and only modest improvements have been set when used in conjunction with traditional cepstral GMM systems. In contrast, we report here on work from the JHU 2002 Summer Workshop exploring a range of prosodic features, using as testbed NIST's 2001 Extended Data task. We examined a variety of modeling techniques, such as n-gram models of turn-level prosodic features and simple vectors of summary statistics per conversation side scored by kth nearest-neighbor classifiers. We found that purely prosodic models were able to achieve equal error rates of under 10%, and yielded significant gains when combined with more traditional systems. We also report on exploratory work on "conversational" features, capturing properties of the interaction across conversion sides, such as turn-taking patterns.

READ LESS

Summary

Using prosodic and conversational features for high-performance speaker recognition : report from JHU WS'02

Phonetic speaker recognition with support vector machines

January 1, 2003

Conference Paper

Author:

William M. Campbell

…

Published in:

Adv. in Neural Information Processing Systems 16, 2003 Conf., 8-13 December 2003, p. 1377-1384.

Topic:

speaker recognition

R&D area:

Cyber Security and Information Sciences

R&D group:

Artificial Intelligence Technology and Systems

Summary

A recent area of significant progress in speaker recognition is the use of high level features-idiolect, phonetic relations, prosody, discourse structure, etc. A speaker not only has a distinctive acoustic sound but uses language in a characteristic manner. Large corpora of speech data available in recent years allow experimentation with long term statistics of phone patterns, word patterns, etc. of an individual. We propose the use of support vector machines and term frequency analysis of phone sequences to model a given speaker. To this end, we explore techniques for text categorization applied to the problem. We derive a new kernel based upon a linearization of likelihood ratio scoring. We introduce a new phone-based SVM speaker recognition approach that halves the error rate of conventional phone-based approaches.

READ LESS

Summary

Phonetic speaker recognition with support vector machines

Modeling prosodic dynamics for speaker recognition

January 1, 2003

Conference Paper

Author:

A. G. Adami

…

Published in:

Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, ICASSP, Vol. 4, 6-10 April 2003, pp. IV-788 - IV-791.

Topic:

speaker recognition

R&D area:

Cyber Security and Information Sciences

R&D group:

Artificial Intelligence Technology and Systems

Summary

Most current state-of-the-art automatic speaker recognition systems extract speaker-dependent features by looking at short-term spectral information. This approach ignores long-term information that can convey supra-segmental information, such as prosodics and speaking style. We propose two approaches that use the fundamental frequency and energy trajectories to capture long-term information. The first approach uses bigram models to model the dynamics of the fundamental frequency and energy trajectories for each speaker. The second approach uses the fundamental frequency trajectories of a pre-defined set of works as the speaker templates and then, using dynamic time warping, computes the distance between templates and the works from the test message. The results presented in this work are on Switchboard 1 using the NIS extended date evaluation design. We show that these approaches can achieve an equal error rate of 3.7% which is a 77% relative improvement over a system based on short-term pitch and energy features alone.

READ LESS

Summary

Modeling prosodic dynamics for speaker recognition

2-D processing of speech with application to pitch estimation

September 16, 2002

Conference Paper

Author:

Thomas F. Quatieri

Published in:

7th Int. Conf. on Spoken Language Processing, ICSLP 2002, 16-20 September 2002.

Topic:

speech enhancement

R&D area:

Cyber Security and Information Sciences

R&D group:

Artificial Intelligence Technology and Systems

Summary

In this paper, we introduce a new approach to two-dimensional (2-D) processing of the one-dimensional (1-D) speech signal in the time-frequency plane. Specifically, we obtain the shortspace 2-D Fourier transform magnitude of a narrowband spectrogram of the signal and show that this 2-D transformation maps harmonically-related signal components to a concentrated entity in the new 2-D plane. We refer to this series of operations as the "grating compression transform" (GCT), consistent with sine-wave grating patterns in the spectrogram reduced to smeared impulses. The GCT forms the basis of a speech pitch estimator that uses the radial distance to the largest peak in the GCT plane. Using an average magnitude difference between pitch-contour estimates, the GCT-based pitch estimator is shown to compare favorably to a sine-wave-based pitch estimator for all-voiced speech in additive white noise. An extension to a basis for two-speaker pitch estimation is also proposed.

READ LESS

Summary

2-D processing of speech with application to pitch estimation

Approaches to language identification using Gaussian mixture models and shifted delta cepstral features

September 16, 2002

Conference Paper

Author:

Pedro A. Torres-Carrasquillo

…

Published in:

Proc. Int. Conf. on Spoken Language Processing, INTERSPEECH, 16-20 September 2002, pp. 33-36, 82-92.

Topic:

language recognition

R&D area:

Cyber Security and Information Sciences

R&D group:

Artificial Intelligence Technology and Systems

Summary

Published results indicate that automatic language identification (LID) systems that rely on multiple-language phone recognition and n-gram language modeling produce the best performance in formal LID evaluations. By contrast, Gaussian mixture model (GMM) systems, which measure acoustic characteristics, are far more efficient computationally but have tended to provide inferior levels of performance. This paper describes two GMM-based approaches to language identification that use shifted delta cepstra (SDC) feature vectors to achieve LID performance comparable to that of the best phone-based systems. The approaches include both acoustic scoring and a recently developed GMM tokenization system that is based on a variation of phonetic recognition and language modeling. System performance is evaluated on both the CallFriend and OGI corpora.

READ LESS

Summary

Approaches to language identification using Gaussian mixture models and shifted delta cepstral features

An overview of automatic speaker recognition technology

May 13, 2002

Conference Paper

Author:

Douglas A. Reynolds

Published in:

Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, ICASSP, Vol. IV, 13-17 May 2002, pp. IV-4072 - IV-4075.

Topic:

speech processing

R&D area:

Cyber Security and Information Sciences

R&D group:

Artificial Intelligence Technology and Systems

Summary

In this paper we provide a brief overview of the area of speaker recognition, describing applications, underlying techniques and some indications, of performance. Following this overview we will discuss some of the strengths and weaknesses of current speaker recognition technologies and outline some potential future trends in research, development and applications conducting other speech interactions (background verification). As speaker and speech recognition system merge and speech recognition accuracy improves, the distinction between text- independent and -dependent applications will decrease. Of the two basic tasks, text-dependent speaker verification is currently

READ LESS

Summary

An overview of automatic speaker recognition technology

Publications

Refine Results

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Showing Results