Summary
This paper presents a statistical technique which can be used to automatically group speech data records based on the similarity of their content. A tree-based clustering algorithm is used to generate a hierarchical structure for the corpus. This structure can then be used to guide the search for similar material in data from other corpora. The SWITCHBOARD Speech Corpus was used to demonstrate these techniques, since it provides sets of speech files which are nominally on the same topic. Excellent automatic clustering was achieved on the truth text transcripts provided with the SWITCHBOARD corpus, with an average cluster purity of 97.3%. Degraded clustering was achieved using the output transcriptions of a speech recognizer, with a clustering purity of 61.4%.