Publication Abstract

Klusacek, D., Navratil, J., Reynolds, D. A., and Campbell, J. P., Conditional Pronunciation Modeling in Speaker Detection. Special Session on Exploiting High-Level Information for High-Performance Speaker Recognition. In Proc. International Conference on Acoustics, Speech, and Signal Processing in Hong Kong, IEEE, April 2003.*

Abstract

In this paper, we present a conditional pronunciation modeling method for the speaker detection task that does not rely on acoustic vectors. Aiming at exploiting higher-level information carried by the speech signal, it uses time-aligned streams of phones and phonemes to model a speaker's specific pronunciation. Our system uses phonemes drawn from a lexicon of pronunciations of words recognized by an automatic speech recognition system to generate the phoneme stream and an open-loop phone recognizer to generate a phone stream. The phoneme and phone streams are aligned at the frame level and conditional probabilities of a phone, given a phoneme, are estimated using co-occurrence counts. A likelihood detector is then applied to these probabilities.

Performance is measured using the NIST Extended Data paradigm and the Switchboard-I corpus. Using 8 training conversations for enrollment, a 2.1% equal error rate was achieved. Extensions and alternatives, as well as fusion experiments, are presented and discussed.

top of page