Using K-means in SVR-based text difficulty estimation

September 20, 2019

Conference Paper

Author:

Raymond H. Budd

…

Published in:

8th ISCA Workshop on Speech and Language Technology in Education, SLaTE, 20-21 September 2019.

R&D Area:

Cyber Security and Information Sciences

R&D Group:

Artificial Intelligence Technology and Systems

Using K-means in SVR-based text difficulty estimation

Summary

A challenge for second language learners, educators, and test creators is the identification of authentic materials at the right level of difficulty. In this work, we present an approach to automatically measure text difficulty, integrated into Auto-ILR, a web-based system that helps find text material at the right level for learners in 18 languages. The Auto-ILR subscription service scans web feeds, extracts article content, evaluates the difficulty, and notifies users of documents that match their skill level. Difficulty is measured on the standard ILR scale with language-specific support vector machine regression (SVR) models built from vectors incorporating length features, term frequencies, relative entropy, and K-means clustering.