Publications

Refine Results

(Filters Applied) Clear All

Detection and characterization of human trafficking networks using unsupervised scalable text template matching

Summary

Human trafficking is a form of modern-day slavery affecting an estimated 40 million victims worldwide, primarily through the commercial sexual exploitation of women and children. In the last decade, the advertising of victims has moved from the streets to websites on the Internet, providing greater efficiency and anonymity for sex traffickers. This shift has allowed traffickers to list their victims in multiple geographic areas simultaneously, while also improving operational security by using multiple methods of electronic communication with buyers; complicating the ability of law enforcement to disrupt these illicit organizations. In this paper, we address this issue and present a novel unsupervised and scalable template matching algorithm for analyzing and detecting complex organizations operating on adult service websites. The algorithm uses only the advertisement content to uncover signature patterns in text that are indicative of organized activities and organizational structure. We apply this method to a large corpus of adult service advertisements retrieved from backpage.com, and show that the networks identified through the algorithm match well with surrogate truth data derived from phone number networks in the same corpus. Further exploration of the results show that the proposed method provides deeper insights into the complex structures of sex trafficking organizations, not possible through networks derived from phone numbers alone. This method provides a powerful new capability for law enforcement to more completely identify and gather evidence about trafficking networks and their operations.
READ LESS

Summary

Human trafficking is a form of modern-day slavery affecting an estimated 40 million victims worldwide, primarily through the commercial sexual exploitation of women and children. In the last decade, the advertising of victims has moved from the streets to websites on the Internet, providing greater efficiency and anonymity for sex...

READ MORE

LLTools: machine learning for human language processing

Summary

Machine learning methods in Human Language Technology have reached a stage of maturity where widespread use is both possible and desirable. The MIT Lincoln Laboratory LLTools software suite provides a step towards this goal by providing a set of easily accessible frameworks for incorporating speech, text, and entity resolution components into larger applications. For the speech processing component, the pySLGR (Speaker, Language, Gender Recognition) tool provides signal processing, standard feature analysis, speech utterance embedding, and machine learning modeling methods in Python. The text processing component in LLTools extracts semantically meaningful insights from unstructured data via entity extraction, topic modeling, and document classification. The entity resolution component in LLTools provides approximate string matching, author recognition and graph-based methods for identifying and linking different instances of the same real-world entity. We show through two applications that LLTools can be used to rapidly create and train research prototypes for human language processing.
READ LESS

Summary

Machine learning methods in Human Language Technology have reached a stage of maturity where widespread use is both possible and desirable. The MIT Lincoln Laboratory LLTools software suite provides a step towards this goal by providing a set of easily accessible frameworks for incorporating speech, text, and entity resolution components...

READ MORE

Predicting and analyzing factors in patent litigation

Published in:
30th Conf. on Neural Information Processing System, NIPS 2016, 5-10 December 2016.

Summary

Patent litigation is an expensive and time-consuming process. To minimize its impact on the participants in the patent lifecycle, automatic determination of litigation potential is a compelling machine learning application. In this paper, we consider preliminary methods for the prediction of a patent being involved in litigation using metadata, content, and graph features. Metadata features are top-level easily-extractable features, i.e., assignee, number of claims, etc. The content feature performs lexical analysis of the claims associated to a patent. Graph features use relational learning to summarize patent references. We apply our methods on US patents using a labeled data set. Prior work has focused on metadata-only features, but we show that both graph and content features have significant predictive capability. Additionally, fusing all features results in improved performance. We also perform a preliminary examination of some of the qualitative factors that may have significant importance in patent litigation.
READ LESS

Summary

Patent litigation is an expensive and time-consuming process. To minimize its impact on the participants in the patent lifecycle, automatic determination of litigation potential is a compelling machine learning application. In this paper, we consider preliminary methods for the prediction of a patent being involved in litigation using metadata, content...

READ MORE

Making #sense of #unstructured text data

Published in:
30th Conf. on Neural Info. Processing Syst., NIPS 2016, 5-10 December 2016.

Summary

Automatic extraction of intelligent and useful information from data is one of the main goals in data science. Traditional approaches have focused on learning from structured features, i.e., information in a relational database. However, most of the data encountered in practice are unstructured (i.e., social media posts, forums, emails and web logs); they do not have a predefined schema or format. In this work, we examine unsupervised methods for processing unstructured text data, extracting relevant information, and transforming it into structured information that can then be leveraged in various applications such as graph analysis and matching entities across different platforms. Various efforts have been proposed to develop algorithms for processing unstructured text data. At a top level, text can be either summarized by document level features (i.e., language, topic, genre, etc.) or analyzed at a word or sub-word level. Text analytics can be unsupervised, semi-supervised, or supervised. In this work, we focus on word analysis and unsupervised methods. Unsupervised (or semi-supervised) methods require less human annotation and can easily fulfill the role of automatic analysis. For text analysis, we focus on methods for finding relevant words in the text. Specifically, we look at social media data and attempt to predict hashtags for users' posts. The resulting hashtags can be used for downstream processing such as graph analysis. Automatic hashtag annotation is closely related to automatic tag extraction and keyword extraction. Techniques for hashtags extraction include topic analysis, supervised classifiers, machine translation methods, and collaborative filtering. Methods for keyword extraction include graph-based and topical analysis of text.
READ LESS

Summary

Automatic extraction of intelligent and useful information from data is one of the main goals in data science. Traditional approaches have focused on learning from structured features, i.e., information in a relational database. However, most of the data encountered in practice are unstructured (i.e., social media posts, forums, emails and...

READ MORE

Matching community structure across online social networks

Author:
Published in:
arXiv, 3 August 2016.

Summary

The discovery of community structure in networks is a problem of considerable interest in recent years. In online social networks, often times, users are simultaneously involved in multiple social media sites, some of which share common social relationships. It is of great interest to uncover a shared community structure across these networks. However, in reality, users typically identify themselves with different usernames across social media sites. This creates a great difficulty in detecting the community structure. In this paper, we explore several approaches for community detection across online social networks with limited knowledge of username alignment across the networks. We refer to the known alignment of usernames as seeds. We investigate strategies for seed selection and its impact on networks with a different fraction of overlapping vertices. The goal is to study the interplay between network topologies and seed selection strategies, and to understand how it affects the detected community structure. We also propose several measures to assess the performance of community detection and use them to measure the quality of the detected communities in both Twitter-Twitter networks and Twitter-Instagram networks.
READ LESS

Summary

The discovery of community structure in networks is a problem of considerable interest in recent years. In online social networks, often times, users are simultaneously involved in multiple social media sites, some of which share common social relationships. It is of great interest to uncover a shared community structure across...

READ MORE

Cross-domain entity resolution in social media

Summary

The challenge of associating entities across multiple domains is a key problem in social media understanding. Successful cross-domain entity resolution provides integration of information from multiple sites to create a complete picture of user and community activities, characteristics, and trends. In this work, we examine the problem of entity resolution across Twitter and Instagram using general techniques. Our methods fall into three categories: profile, content, and graph based. For the profile-based methods, we consider techniques based on approximate string matching. For content-based methods, we perform author identification. Finally, for graph-based methods, we apply novel cross-domain community detection methods and generate neighborhood-based features. The three categories of methods are applied to a large graph of users in Twitter and Instagram to understand challenges, determine performance, and understand fusion of multiple methods. Final results demonstrate an equal error rate less than 1%.
READ LESS

Summary

The challenge of associating entities across multiple domains is a key problem in social media understanding. Successful cross-domain entity resolution provides integration of information from multiple sites to create a complete picture of user and community activities, characteristics, and trends. In this work, we examine the problem of entity resolution...

READ MORE

Showing Results

1-6 of 6