Publications

Refine Results

(Filters Applied) Clear All

Global pattern search at scale

Summary

In recent years, data collection has far outpaced the tools for data analysis in the area of non-traditional GEOINT analysis. Traditional tools are designed to analyze small-scale numerical data, but there are few good interactive tools for processing large amounts of unstructured data such as raw text. In addition to the complexities of data processing, presenting the data in a way that is meaningful to the end user poses another challenge. In our work, we focused on analyzing a corpus of 35,000 news articles and creating an interactive geovisualization tool to reveal patterns to human analysts. Our comprehensive tool, Global Pattern Search at Scale (GPSS), addresses three major problems in data analysis: free text analysis, high volumes of data, and interactive visualization. GPSS uses an Accumulo database for high-volume data storage, and a matrix of word counts and event detection algorithms to process the free text. For visualization, the tool displays an interactive web application to the user, featuring a map overlaid with document clusters and events, search and filtering options, a timeline, and a word cloud. In addition, the GPSS tool can be easily adapted to process and understand other large free-text datasets.
READ LESS

Summary

In recent years, data collection has far outpaced the tools for data analysis in the area of non-traditional GEOINT analysis. Traditional tools are designed to analyze small-scale numerical data, but there are few good interactive tools for processing large amounts of unstructured data such as raw text. In addition to...

READ MORE

Spectral anomaly detection in very large graphs: Models, noise, and computational complexity(92.92 KB)

Published in:
Proceedings of Seminar 14461: High-performance Graph Algorithms and Applications in Computational Science, Wadern, Germany

Summary

Anomaly detection in massive networks has numerous theoretical and computational challenges, especially as the behavior to be detected becomes small in comparison to the larger network. This presentation focuses on recent results in three key technical areas, specifically geared toward spectral methods for detection.
READ LESS

Summary

Anomaly detection in massive networks has numerous theoretical and computational challenges, especially as the behavior to be detected becomes small in comparison to the larger network. This presentation focuses on recent results in three key technical areas, specifically geared toward spectral methods for detection.

READ MORE

Sparse matrix partitioning for parallel eigenanalysis of large static and dynamic graphs

Published in:
HPEC 2014: IEEE Conf. on High Performance Extreme Computing, 9-11 September 2014.

Summary

Numerous applications focus on the analysis of entities and the connections between them, and such data are naturally represented as graphs. In particular, the detection of a small subset of vertices with anomalous coordinated connectivity is of broad interest, for problems such as detecting strange traffic in a computer network or unknown communities in a social network. These problems become more difficult as the background graph grows larger and noisier and the coordination patterns become more subtle. In this paper, we discuss the computational challenges of a statistical framework designed to address this cross-mission challenge. The statistical framework is based on spectral analysis of the graph data, and three partitioning methods are evaluated for computing the principal eigenvector of the graph's residuals matrix. While a standard one-dimensional partitioning technique enables this computation for up to four billion vertices, the communication overhead prevents this method from being used for even larger graphs. Recent two-dimensional partitioning methods are shown to have much more favorable scaling properties. A data-dependent partitioning method, which has the best scaling performance, is also shown to improve computation time even as a graph changes over time, allowing amortization of the upfront cost.
READ LESS

Summary

Numerous applications focus on the analysis of entities and the connections between them, and such data are naturally represented as graphs. In particular, the detection of a small subset of vertices with anomalous coordinated connectivity is of broad interest, for problems such as detecting strange traffic in a computer network...

READ MORE

Spectral subgraph detection with corrupt observations

Published in:
Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, ICASSP, 4-9 May 2014.

Summary

Recent work on signal detection in graph-based data focuses on classical detection when the signal and noise are both in the form of discrete entities and their relationships. In practice, the relationships of interest may not be directly observable, or may be observed through a noisy mechanism. The effects of imperfect observations add another layer of difficulty to the detection problem, beyond the effects of typical random fluctuations in the background graph. This paper analyzes the impact on detection performance of several error and corruption mechanisms for graph data. In relatively simple scenarios, the change in signal and noise power is analyzed, and this is demonstrated empirically in more complicated models. It is shown that, with enough side information, it is possible to fully recover performance equivalent to working with uncorrupted data using a Bayesian approach, and a simpler cost-optimization approach is shown to provide a substantial benefit as well.
READ LESS

Summary

Recent work on signal detection in graph-based data focuses on classical detection when the signal and noise are both in the form of discrete entities and their relationships. In practice, the relationships of interest may not be directly observable, or may be observed through a noisy mechanism. The effects of...

READ MORE

Effective parallel computation of eigenpairs to detect anomalies in very large graphs

Published in:
SIAM Conference on Parallel Processing for Scientific Computing

Summary

The computational driver for an important class of graph analysis algorithms is the computation of leading eigenvectors of matrix representations of the graph. In this presentation, we discuss the challenges of calculating eigenvectors of modularity matrices derived from very large graphs (upwards of a billion vertices) and demonstrate the scaling properties of parallel eigensolvers when applied to these matrices.
READ LESS

Summary

The computational driver for an important class of graph analysis algorithms is the computation of leading eigenvectors of matrix representations of the graph. In this presentation, we discuss the challenges of calculating eigenvectors of modularity matrices derived from very large graphs (upwards of a billion vertices) and demonstrate the scaling...

READ MORE

Very large graphs for information extraction (VLG) - summary of first-year proof-of-concept study

Summary

In numerous application domains relevant to the Department of Defense and the Intelligence Community, data of interest take the form of entities and the relationships between them, and these data are commonly represented as graphs. Under the Very Large Graphs for Information Extraction effort--a one-year proof-of-concept study--MIT LL developed novel techniques for anomalous subgraph detection, building on tools in the signal processing research literature. This report documents the technical results of this effort. Two datasets--a snapshot of Thompson Reuters? Web of Science database and a stream of web proxy logs--were parsed, and graphs were constructed from the raw data. From the phenomena in these datasets, several algorithms were developed to model the dynamic graph behavior, including a preferential attachment mechanism with memory, a streaming filter to model a graph as a weighted average of its past connections, and a generalized linear model for graphs where connection probabilities are determined by additional side information or metadata. A set of metrics was also constructed to facilitate comparison of techniques. The study culminated in a demonstration of the algorithms on the datasets of interest, in addition to simulated data. Performance in terms of detection, estimation, and computational burden was measured according to the metrics. Among the highlights of this demonstration were the detection of emerging coauthor clusters in the Web of Science data, detection of botnet activity in the web proxy data after 15 minutes (which took 10 days to detect using state-of-the-practice techniques), and demonstration of the core algorithm on a simulated 1-billion-vertex graph using a commodity computing cluster.
READ LESS

Summary

In numerous application domains relevant to the Department of Defense and the Intelligence Community, data of interest take the form of entities and the relationships between them, and these data are commonly represented as graphs. Under the Very Large Graphs for Information Extraction effort--a one-year proof-of-concept study--MIT LL developed novel...

READ MORE

Efficient anomaly detection in dynamic, attributed graphs: emerging phenomena and big data

Published in:
ISI 2013: IEEE Int. Conf. on Intelligence and Security Informatics, 4-7 June 2013.

Summary

When working with large-scale network data, the interconnected entities often have additional descriptive information. This additional metadata may provide insight that can be exploited for detection of anomalous events. In this paper, we use a generalized linear model for random attributed graphs to model connection probabilities using vertex metadata. For a class of such models, we show that an approximation to the exact model yields an exploitable structure in the edge probabilities, allowing for efficient scaling of a spectral framework for anomaly detection through analysis of graph residuals, and a fast and simple procedure for estimating the model parameters. In simulation, we demonstrate that taking into account both attributes and dynamics in this analysis has a much more significant impact on the detection of an emerging anomaly than accounting for either dynamics or attributes alone. We also present an analysis of a large, dynamic citation graph, demonstrating that taking additional document metadata into account emphasizes parts of the graph that would not be considered significant otherwise.
READ LESS

Summary

When working with large-scale network data, the interconnected entities often have additional descriptive information. This additional metadata may provide insight that can be exploited for detection of anomalous events. In this paper, we use a generalized linear model for random attributed graphs to model connection probabilities using vertex metadata. For...

READ MORE

Sparse volterra systems: theory and practice

Published in:
Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, ICASSP, 25-31 May 2013.

Summary

Nonlinear effects limit analog circuit performance, causing both in-band and out-of-band distortion. The classical Volterra series provides an accurate model of many nonlinear systems, but the number of parameters grows extremely quickly as the memory depth and polynomial order are increased. Recently, concepts from compressed sensing have been applied to nonlinear system modeling in order to address this issue. This work investigates the theory and practice of applying compressed sensing techniques to nonlinear system identification under the constraints of typical radio frequency (RF) laboratories. The main theoretical result shows that these techniques are capable of identifying sparse Memory Polynomials using only single-tone training signals rather than pseudorandom noise. Empirical results using laboratory measurements of an RF receiver show that sparse Generalized Memory Polynomials can also be recovered from two-tone signals.
READ LESS

Summary

Nonlinear effects limit analog circuit performance, causing both in-band and out-of-band distortion. The classical Volterra series provides an accurate model of many nonlinear systems, but the number of parameters grows extremely quickly as the memory depth and polynomial order are increased. Recently, concepts from compressed sensing have been applied to...

READ MORE

Detection theory for graphs

Summary

Graphs are fast emerging as a common data structure used in many scientific and engineering fields. While a wide variety of techniques exist to analyze graph datasets, practitioners currently lack a signal processing theory akin to that of detection and estimation in the classical setting of vector spaces with Gaussian noise. Using practical detection examples involving large, random "background" graphs and noisy real-world datasets, the authors present a novel graph analytics framework that allows for uncued analysis of very large datasets. This framework combines traditional computer science techniques with signal processing in the context of graph data, creating a new research area at the intersection of the two fields.
READ LESS

Summary

Graphs are fast emerging as a common data structure used in many scientific and engineering fields. While a wide variety of techniques exist to analyze graph datasets, practitioners currently lack a signal processing theory akin to that of detection and estimation in the classical setting of vector spaces with Gaussian...

READ MORE

Characterization of traffic and structure in the U.S. airport network

Summary

In this paper we seek to characterize traffic in the U.S. air transportation system, and to subsequently develop improved models of traffic demand. We model the air traffic within the U.S. national airspace system as dynamic weighted network. We employ techniques advanced by work in complex networks over the past several years in characterizing the structure and dynamics of the U.S. airport network. We show that the airport network is more dynamic over successive days than has been previously reported. The network has some properties that appear stationary over time, while others exhibit a high degree of variation. We characterize the network and its dynamics using structural measures such as degree distributions and clustering coefficients. We employ spectral analysis to show that dominant eigenvectors of the network are nearly stationary with time. We use this observation to suggest how low dimensional models of traffic demand in the airport network can be fashioned.
READ LESS

Summary

In this paper we seek to characterize traffic in the U.S. air transportation system, and to subsequently develop improved models of traffic demand. We model the air traffic within the U.S. national airspace system as dynamic weighted network. We employ techniques advanced by work in complex networks over the past...

READ MORE