Latent topic modeling for audio corpus summarization

August 27, 2011

Conference Paper

Author:

Timothy J. Hazen

Published in:

INTERSPEECH 2011, 27-31 August 2011, pp. 913-916.

R&D Area:

Cyber Security and Information Sciences

R&D Group:

Artificial Intelligence Technology and Systems

Latent topic modeling for audio corpus summarization

Summary

This work presents techniques for automatically summarizing the topical content of an audio corpus. Probabilistic latent semantic analysis (PLSA) is used to learn a set of latent topics in an unsupervised fashion. These latent topics are ranked by their relative importance in the corpus and a summary of each topic is generated from signature words that aptly describe the content of that topic. This paper presents techniques for producing a high quality summarization. An example summarization of conversational data from the Fisher corpus that demonstrates the effectiveness of our approach is presented and evaluated.