In this paper we examine various ways to derive confidence measures for a language identification system, using phone recognition followed by language models, and describe the application of an evaluation metric for measuring the "goodness" of the different confidence measures. Experiments are conducted on the 1996 NIST Language Identification Evaluation corpus (derived from the Callfriend corpus of conversational telephone speech). The system is trained on the NIST 96 development data and evaluated on the NIST 96 evaluation data. Results indicate that we are able to predict the performance of a system and quantitatively evaluate how well the prediction holds on new data.