Unsupervised topic clustering of switchboard speech messages

May 7, 1996

Conference Paper

Author:

Beth Carlson Kittel

Published in:

Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, ICASSP, Vol. 1, 7-10 May 1996, pp. 315-318.

R&D Area:

Cyber Security and Information Sciences

R&D Group:

Artificial Intelligence Technology and Systems

Unsupervised topic clustering of switchboard speech messages

Summary

This paper presents a statistical technique which can be used to automatically group speech data records based on the similarity of their content. A tree-based clustering algorithm is used to generate a hierarchical structure for the corpus. This structure can then be used to guide the search for similar material in data from other corpora. The SWITCHBOARD Speech Corpus was used to demonstrate these techniques, since it provides sets of speech files which are nominally on the same topic. Excellent automatic clustering was achieved on the truth text transcripts provided with the SWITCHBOARD corpus, with an average cluster purity of 97.3%. Degraded clustering was achieved using the output transcriptions of a speech recognizer, with a clustering purity of 61.4%.