Summary
Preliminary results are reported from a very simple speech-synthesis system based on clustered-diphone Kalman Filter based modeling of line-spectral frequency based features. Parameters were estimated using maximum-likelihood EM training, with a constraint enforced that prevented eigenvalue magnitudes in the transition matrix from exceeding 1. Frames of training data were assigned diphone unit labels by forced alignment with an HMM recognition system. The HMM cluster tree was also used for Kalman Filter unit cluster assignments. The result is a simple synthesis system that has few parameters, synthesizes intelligible speech without audible discontinuities, and that can be adapted using MLLR techniques to support synthesis of a broad panoply of speakers from a single base model with small amounts of training data. The result is interesting for embedded synthesis applications.