Publications

Refine Results

(Filters Applied) Clear All

R&D Areas

R&D Groups

Year

Items per page

Tagged As

high performance computing Clear filter

Improving big data visual analytics with interactive virtual reality

September 15, 2015

Conference Paper

Author:

Andrew Moran

…

Published in:

HPEC 2015: IEEE Conf. on High Performance Extreme Computing, 15-17 September 2015.

Topic:

high performance computing

R&D area:

Cyber Security and Information Sciences

R&D group:

Secure Resilient Systems and Technology

Summary

For decades, the growth and volume of digital data collection has made it challenging to digest large volumes of information and extract underlying structure. Coined 'Big Data', massive amounts of information has quite often been gathered inconsistently (e.g from many sources, of various forms, at different rates, etc.). These factors impede the practices of not only processing data, but also analyzing and displaying it in an efficient manner to the user. Many efforts have been completed in the data mining and visual analytics community to create effective ways to further improve analysis and achieve the knowledge desired for better understanding. Our approach for improved big data visual analytics is two-fold, focusing on both visualization and interaction. Given geo-tagged information, we are exploring the benefits of visualizing datasets in the original geospatial domain by utilizing a virtual reality platform. After running proven analytics on the data, we intend to represent the information in a more realistic 3D setting, where analysts can achieve an enhanced situational awareness and rely on familiar perceptions to draw in-depth conclusions on the dataset. In addition, developing a human-computer interface that responds to natural user actions and inputs creates a more intuitive environment. Tasks can be performed to manipulate the dataset and allow users to dive deeper upon request, adhering to desired demands and intentions. Due to the volume and popularity of social media, we developed a 3D tool visualizing Twitter on MIT's campus for analysis. Utilizing emerging technologies of today to create a fully immersive tool that promotes visualization and interaction can help ease the process of understanding and representing big data.

READ LESS

Summary

Improving big data visual analytics with interactive virtual reality

Sampling large graphs for anticipatory analytics

September 15, 2015

Conference Paper

Author:

Lauren Milechin

…

Published in:

HPEC 2015: IEEE Conf. on High Performance Extreme Computing, 15-17 September 2015.

Topic:

big data

R&D area:

Cyber Security and Information Sciences

R&D group:

Secure Resilient Systems and Technology

Summary

The characteristics of Big Data - often dubbed the 3V's for volume, velocity, and variety - will continue to outpace the ability of computational systems to process, store, and transmit meaningful results. Traditional techniques for dealing with large datasets often include the purchase of larger systems, greater human-in-the-loop involvement, or more complex algorithms. We are investigating the use of sampling to mitigate these challenges, specifically sampling large graphs. Often, large datasets can be represented as graphs where data entries may be edges, and vertices may be attributes of the data. In particular, we present the results of sampling for the task of link prediction. Link prediction is a process to estimate the probability of a new edge forming between two vertices of a graph, and it has numerous application areas in understanding social or biological networks. In this paper we propose a series of techniques for the sampling of large datasets. In order to quantify the effect of these techniques, we present the quality of link prediction tasks on sampled graphs, and the time saved in calculating link prediction statistics on these sampled graphs.

READ LESS

Summary

Sampling large graphs for anticipatory analytics

Computing on Masked Data to improve the security of big data

April 14, 2015

Conference Paper

Author:

Vijay N. Gadepally

…

Published in:

HST 2015, IEEE Int. Conf. on Technologies for Homeland Security, 14-16 April 2015.

Topic:

big data

R&D area:

Cyber Security and Information Sciences

R&D group:

Secure Resilient Systems and Technology

Summary

Organizations that make use of large quantities of information require the ability to store and process data from central locations so that the product can be shared or distributed across a heterogeneous group of users. However, recent events underscore the need for improving the security of data stored in such untrusted servers or databases. Advances in cryptographic techniques and database technologies provide the necessary security functionality but rely on a computational model in which the cloud is used solely for storage and retrieval. Much of big data computation and analytics make use of signal processing fundamentals for computation. As the trend of moving data storage and computation to the cloud increases, homeland security missions should understand the impact of security on key signal processing kernels such as correlation or thresholding. In this article, we propose a tool called Computing on Masked Data (CMD), which combines advances in database technologies and cryptographic tools to provide a low overhead mechanism to offload certain mathematical operations securely to the cloud. This article describes the design and development of the CMD tool.

READ LESS

Summary

Computing on Masked Data to improve the security of big data

Rapid sequence identification of potential pathogens using techniques from sparse linear algebra

April 14, 2015

Conference Paper

Author:

Stephanie A. Dodson

…

Published in:

HST 2015, IEEE Int. Conf. on Technologies for Homeland Security, 14-16 April 2015.

Topic:

high performance computing

R&D area:

R&D group:

Summary

The decreasing costs and increasing speed and accuracy of DNA sample collection, preparation, and sequencing has rapidly produced an enormous volume of genetic data. However, fast and accurate analysis of the samples remains a bottleneck. Here we present D4RAGenS, a genetic sequence identification algorithm that exhibits the Big Data handling and computational power of the Dynamic Distributed Dimensional Data Model (D4M). The method leverages linear algebra and statistical properties to increase computational performance while retaining accuracy by subsampling the data. Two run modes, Fast and Wise, yield speed and precision tradeoffs, with applications in biodefense and medical diagnostics. The D4RAGenS analysis algorithm is tested over several datasets, including three utilized for the Defense Threat Reduction Agency (DTRA) metagenomic algorithm contest.

READ LESS

Summary

Rapid sequence identification of potential pathogens using techniques from sparse linear algebra

Genetic sequence matching using D4M big data approaches

September 9, 2014

Conference Paper

Author:

Darrell O. Ricke

…

Published in:

HPEC 2014: IEEE Conf. on High Performance Extreme Computing, 9-11 September 2014.

Topic:

high performance computing

R&D area:

R&D group:

Summary

Recent technological advances in Next Generation Sequencing tools have led to increasing speeds of DNA sample collection, preparation, and sequencing. One instrument can produce over 600 Gb of genetic sequence data in a single run. This creates new opportunities to efficiently handle the increasing workload. We propose a new method of fast genetic sequence analysis using the Dynamic Distributed Dimensional Data Model (D4M) - an associative array environment for MATLAB developed at MIT Lincoln Laboratory. Based on mathematical and statistical properties, the method leverages big data techniques and the implementation of an Apache Acculumo database to accelerate computations one-hundred fold over other methods. Comparisons of the D4M method with the current gold-standard for sequence analysis, BLAST, show the two are comparable in the alignments they find. This paper will present an overview of the D4M genetic sequence algorithm and statistical comparisons with BLAST.

READ LESS

Summary

Genetic sequence matching using D4M big data approaches

Achieving 100,000,000 database inserts per second using Accumulo and D4M

September 9, 2014

Conference Paper

Author:

Jeremy Kepner

…

Published in:

HPEC 2014: IEEE Conf. on High Performance Extreme Computing, 9-11 September 2014.

Topic:

big data

R&D area:

Cyber Security and Information Sciences

R&D group:

Secure Resilient Systems and Technology

Summary

The Apache Accumulo database is an open source relaxed consistency database that is widely used for government applications. Accumulo is designed to deliver high performance on unstructured data such as graphs of network data. This paper tests the performance of Accumulo using data from the Graph500 benchmark. The Dynamic Distributed Dimensional Data Model (D4M) software is used to implement the benchmark on a 216-node cluster running the MIT SuperCloud software stack. A peak performance of over 100,000,000 database inserts per second was achieved which is 100x larger than the highest previously published value for any other database. The performance scales linearly with the number of ingest clients, number of database servers, and data size. The performance was achieved by adapting several supercomputing techniques to this application: distributed arrays, domain decomposition, adaptive load balancing, and single-program-multiple-data programming.

READ LESS

Summary

Achieving 100,000,000 database inserts per second using Accumulo and D4M

Big Data dimensional analysis

September 9, 2014

Conference Paper

Author:

Vijay N. Gadepally

…

Jeremy Kepner

Published in:

HPEC 2014: IEEE Conf. on High Performance Extreme Computing, 9-11 September 2014.

Topic:

high performance computing

R&D area:

Cyber Security and Information Sciences

R&D group:

Secure Resilient Systems and Technology

Summary

The ability to collect and analyze large amounts of data is a growing problem within the scientific community. The growing gap between data and users calls for innovative tools that address the challenges faced by big data volume, velocity and variety. One of the main challenges associated with big data variety is automatically understanding the underlying structures and patterns of the data. Such an understanding is required as a pre-requisite to the application of advanced analytics to the data. Further, big data sets often contain anomalies and errors that are difficult to know a priori. Current approaches to understanding data structure are drawn from the traditional database ontology design. These approaches are effective, but often require too much human involvement to be effective for the volume, velocity and variety of data encountered by big data systems. Dimensional Data Analysis (DDA) is a proposed technique that allows big data analysts to quickly understand the overall structure of a big dataset, determine anomalies. DDA exploits structures that exist in a wide class of data to quickly determine the nature of the data and its statistical anomalies. DDA leverages existing schemas that are employed in big data databases today. This paper presents DDA, applies it to a number of data sets, and measures its performance. The overhead of DDA is low and can be applied to existing big data systems without greatly impacting their computing requirements.

READ LESS

Summary

Big Data dimensional analysis

LLGrid: supercomputer for sensor processing

April 15, 2013

Book Chapter

Author:

Jeremy Kepner

…

Published in:

Chapter 24 in Contemporary High Performance Computing, April 2013.

Topic:

supercomputing

R&D area:

Cyber Security and Information Sciences

R&D group:

Secure Resilient Systems and Technology

Summary

MIT Lincoln Laboratory is a federally funded research and development center that applies advanced technology to problems of national interest. Research and development activities focus on long-term technology development as well as rapid system prototyping and demonstration. A key part of this mission is to develop and deploy advanced sensor systems. Developing the algorithms for these systems requires interactive access to large scale computing and data storage. Deploying these systems requires that the computing and storage capabilities are transportable and energy efficient. The LLGrid system of supercomputers allows hundreds of researchers simultaneous interactive access to large amounts of processing and storage for development and testing of their sensor processing algorithms. The requirements of the LLGrid user base are as diverse as the sensors they are developing: sonar, radar, infrared, optical, hyperspectral, video, bio and cyber. However, there are two common elements: delivering large amounts of data interactively to many processors and high level user interfaces that require minimal user training. The LLGrid software stack provides these capabilities on dozens of LLGrid computing clusters across Lincoln Laboratory. LLGrid systems range from very small (a few nodes) to very large (40+ racks).

READ LESS

Summary

LLGrid: supercomputer for sensor processing

Cluster-based 3D reconstruction of aerial video

September 10, 2012

Conference Paper

Author:

Scott M. Sawyer

…

Nadya T. Bliss

Published in:

HPEC 2012: IEEE Conf. on High Performance Extreme Computing, 10-12 September 2012.

Topic:

high performance computing

R&D area:

Cyber Security and Information Sciences

R&D group:

Artificial Intelligence Technology and Systems

Summary

Large-scale 3D scene reconstruction using Structure from Motion (SfM) continues to be very computationally challenging despite much active research in the area. We propose an efficient, scalable processing chain designed for cluster computing and suitable for use on aerial video. The sparse bundle adjustment step, which is iterative and difficult to parallelize, is accomplished by partitioning the input image set, generating independent point clouds in parallel, and then fusing the clouds and combining duplicate points. We compare this processing chain to a leading parallel SfM implementation, which exploits fine-grained parallelism in various matrix operations and is not designed to scale beyond a multi-core workstation with GPU. We show our cluster-based approach offers significant improvement in scalability and runtime while producing comparable point cloud density and more accurate point location estimates.

READ LESS

Summary

Cluster-based 3D reconstruction of aerial video

Benchmarking parallel eigen decomposition for residuals analysis of very large graphs

September 10, 2012

Conference Paper

Author:

Edward M. Rutledge

…

Published in:

HPEC 2012: IEEE Conf. on High Performance Extreme Computing, 10-12 September 2012.

Topic:

big data

R&D area:

Cyber Security and Information Sciences

R&D group:

Artificial Intelligence Technology and Systems

Summary

Graph analysis is used in many domains, from the social sciences to physics and engineering. The computational driver for one important class of graph analysis algorithms is the computation of leading eigenvectors of matrix representations of a graph. This paper explores the computational implications of performing an eigen decomposition of a directed graph's symmetrized modularity matrix using commodity cluster hardware and freely available eigensolver software, for graphs with 1 million to 1 billion vertices, and 8 million to 8 billion edges. Working with graphs of these sizes, parallel eigensolvers are of particular interest. Our results suggest that graph analysis approaches based on eigen space analysis of graph residuals are feasible even for graphs of these sizes.

READ LESS

Summary

Benchmarking parallel eigen decomposition for residuals analysis of very large graphs

Publications

Refine Results

Tagged As

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Showing Results