Twitter language identification of similar languages and dialects without ground truth

April 3, 2017

Conference Paper

Author:

Jennifer A. Williams

…

Cagri K. Dagli

Published in:

Proc. 4th Workshop on NLP for Similar Languages, Varieties and Dialects, 3 April 2017, pp. 73-83.

R&D Area:

Cyber Security and Information Sciences

R&D Group:

Artificial Intelligence Technology and Systems

Twitter language identification of similar languages and dialects without ground truth

Summary

We present a new method to bootstrap filter Twitter language ID labels in our dataset for automatic language identification (LID). Our method combines geolocation, original Twitter LID labels, and Amazon Mechanical Turk to resolve missing and unreliable labels. We are the first to compare LID classification performance using the MIRA algorithm and langid.py. We show classifier performance on different versions of our dataset with high accuracy using only Twitter data, without ground truth, and very few training examples. We also show how Platt Scaling can be use to calibrate MIRA classifier output values into a probability distribution over candidate classes, making the output more intuitive. Our method allows for fine-grained distinctions between similar languages and dialects and allows us to rediscover the language composition of our Twitter dataset.