Publication Abstract

Jin, Q., Navratil, J., Reynolds, D. A., Campbell, J. P., Andrews, W. D., Abramson, J. S., Combining Cross-Stream and Time Dimensions in Phonetic Speaker Recognition. Special Session on Exploiting High-Level Information for High-Performance Speaker Recognition. In Proc. International Conference on Acoustics, Speech, and Signal Processing in Hong Kong, IEEE, April 2003.*

Abstract

Recent studies show that phonetic sequences from multiple languages can provide effective features for speaker recognition. So far, only pronunciation dynamics in the time dimension, i.e., n-gram modeling on each of the phone sequences, have been examined. In the JHU 2002 Summer Workshop, we explored modeling the statistical pronunciation dynamics across streams in multiple languages (cross-stream dimension) as an additional component to the time dimension. We found that bigram modeling in the cross-stream dimension achieves improved performance over that in the time dimension on the NIST 2001 Speaker Recognition Evaluation Extended Data Task. Moreover, a linear combination of information from both dimensions at the score level further improves the performance, showing that the two dimensions contain complementary information.

top of page