Publications

Refine Results

(Filters Applied) Clear All

Big data strategies for data center infrastructure management using a 3D gaming platform

Summary

High Performance Computing (HPC) is intrinsically linked to effective Data Center Infrastructure Management (DCIM). Cloud services and HPC have become key components in Department of Defense and corporate Information Technology competitive strategies in the global and commercial spaces. As a result, the reliance on consistent, reliable Data Center space is more critical than ever. The costs and complexity of providing quality DCIM are constantly being tested and evaluated by the United States Government and companies such as Google, Microsoft and Facebook. This paper will demonstrate a system where Big Data strategies and 3D gaming technology is leveraged to successfully monitor and analyze multiple HPC systems and a lights-out modular HP EcoPOD 240a Data Center on a singular platform. Big Data technology and a 3D gaming platform enables the relative real time monitoring of 5000 environmental sensors, more than 3500 IT data points and display visual analytics of the overall operating condition of the Data Center from a command center over 100 miles away. In addition, the Big Data model allows for in depth analysis of historical trends and conditions to optimize operations achieving even greater efficiencies and reliability.
READ LESS

Summary

High Performance Computing (HPC) is intrinsically linked to effective Data Center Infrastructure Management (DCIM). Cloud services and HPC have become key components in Department of Defense and corporate Information Technology competitive strategies in the global and commercial spaces. As a result, the reliance on consistent, reliable Data Center space is...

READ MORE

Portable Map-Reduce utility for MIT SuperCloud environment

Summary

The MIT Map-Reduce utility has been developed and deployed on the MIT SuperCloud to support scientists and engineers at MIT Lincoln Laboratory. With the MIT Map-Reduce utility, users can deploy their applications quickly onto the MIT SuperCloud infrastructure. The MIT Map-Reduce utility can work with any applications without the need for any modifications. For improved performance, the MIT Map-Reduce utility provides an option to consolidate multiple input data files per compute task as a single stream of input with minimal changes to the target application. This enables users to reduce the computational overhead associated with the cost of multiple application starting up when dealing with more than one piece of input data per compute task. With a small change in a sample MATLAB image processing application, we have observed approximately 12x speed up by reducing the application startup overhead. Currently the MIT Map-Reduce utility can work with several schedulers such as SLURM, Grid Engine and LSF.
READ LESS

Summary

The MIT Map-Reduce utility has been developed and deployed on the MIT SuperCloud to support scientists and engineers at MIT Lincoln Laboratory. With the MIT Map-Reduce utility, users can deploy their applications quickly onto the MIT SuperCloud infrastructure. The MIT Map-Reduce utility can work with any applications without the need...

READ MORE

Global pattern search at scale

Summary

In recent years, data collection has far outpaced the tools for data analysis in the area of non-traditional GEOINT analysis. Traditional tools are designed to analyze small-scale numerical data, but there are few good interactive tools for processing large amounts of unstructured data such as raw text. In addition to the complexities of data processing, presenting the data in a way that is meaningful to the end user poses another challenge. In our work, we focused on analyzing a corpus of 35,000 news articles and creating an interactive geovisualization tool to reveal patterns to human analysts. Our comprehensive tool, Global Pattern Search at Scale (GPSS), addresses three major problems in data analysis: free text analysis, high volumes of data, and interactive visualization. GPSS uses an Accumulo database for high-volume data storage, and a matrix of word counts and event detection algorithms to process the free text. For visualization, the tool displays an interactive web application to the user, featuring a map overlaid with document clusters and events, search and filtering options, a timeline, and a word cloud. In addition, the GPSS tool can be easily adapted to process and understand other large free-text datasets.
READ LESS

Summary

In recent years, data collection has far outpaced the tools for data analysis in the area of non-traditional GEOINT analysis. Traditional tools are designed to analyze small-scale numerical data, but there are few good interactive tools for processing large amounts of unstructured data such as raw text. In addition to...

READ MORE

Achieving 100,000,000 database inserts per second using Accumulo and D4M

Summary

The Apache Accumulo database is an open source relaxed consistency database that is widely used for government applications. Accumulo is designed to deliver high performance on unstructured data such as graphs of network data. This paper tests the performance of Accumulo using data from the Graph500 benchmark. The Dynamic Distributed Dimensional Data Model (D4M) software is used to implement the benchmark on a 216-node cluster running the MIT SuperCloud software stack. A peak performance of over 100,000,000 database inserts per second was achieved which is 100x larger than the highest previously published value for any other database. The performance scales linearly with the number of ingest clients, number of database servers, and data size. The performance was achieved by adapting several supercomputing techniques to this application: distributed arrays, domain decomposition, adaptive load balancing, and single-program-multiple-data programming.
READ LESS

Summary

The Apache Accumulo database is an open source relaxed consistency database that is widely used for government applications. Accumulo is designed to deliver high performance on unstructured data such as graphs of network data. This paper tests the performance of Accumulo using data from the Graph500 benchmark. The Dynamic Distributed...

READ MORE

D4M 2.0 Schema: a general purpose high performance schema for the Accumulo database

Summary

Non-traditional, relaxed consistency, triple store databases are the backbone of many web companies (e.g., Google Big Table, Amazon Dynamo, and Facebook Cassandra). The Apache Accumulo database is a high performance open source relaxed consistency database that is widely used for government applications. Obtaining the full benefits of Accumulo requires using novel schemas. The Dynamic Distributed Dimensional Data Model (D4M) [http://www.mit.edu/~kepner/D4M] provides a uniform mathematical framework based on associative arrays that encompasses both traditional (i.e., SQL) and non-traditional databases. For non-traditional databases D4M naturally leads to a general purpose schema that can be used to fully index and rapidly query every unique string in a dataset. The D4M 2.0 Schema has been applied with little or no customization to cyber, bioinformatics, scientific citation, free text, and social media data. The D4M 2.0 Schema is simple, requires minimal parsing, and achieves the highest published Accumulo ingest rates. The benefits of the D4M 2.0 Schema are independent of the D4M interface. Any interface to Accumulo can achieve these benefits by using the D4M 2.0 Schema.
READ LESS

Summary

Non-traditional, relaxed consistency, triple store databases are the backbone of many web companies (e.g., Google Big Table, Amazon Dynamo, and Facebook Cassandra). The Apache Accumulo database is a high performance open source relaxed consistency database that is widely used for government applications. Obtaining the full benefits of Accumulo requires using...

READ MORE

LLSuperCloud: sharing HPC systems for diverse rapid prototyping

Summary

The supercomputing and enterprise computing arenas come from very different lineages. However, the advent of commodity computing servers has brought the two arenas closer than they have ever been. Within enterprise computing, commodity computing servers have resulted in the development of a wide range of new cloud capabilities: elastic computing, virtualization, and data hosting. Similarly, the supercomputing community has developed new capabilities in heterogeneous, massively parallel hardware and software. Merging the benefits of enterprise clouds and supercomputing has been a challenging goal. Significant effort has been expended in trying to deploy supercomputing capabilities on cloud computing systems. These efforts have resulted in unreliable, low performance solutions, which requires enormous expertise to maintain. LLSuperCloud provides a novel solution to the problem of merging enterprise cloud and supercomputing technology. More specifically LLSuperCloud reverses the traditional paradigm of attempting to deploy supercomputing capabilities on a cloud and instead deploys cloud capabilities on a supercomputer. The result is a system that can handle heterogeneous, massively parallel workloads while also providing high performance elastic computing, virtualization, and databases. The benefits of LLSuperCloud are highlighted using a mixed workload of C MPI, parallel MATLAB, Java, databases, and virtualized web services.
READ LESS

Summary

The supercomputing and enterprise computing arenas come from very different lineages. However, the advent of commodity computing servers has brought the two arenas closer than they have ever been. Within enterprise computing, commodity computing servers have resulted in the development of a wide range of new cloud capabilities: elastic computing...

READ MORE

D4M 2.0 Schema: a general purpose high performance schema for the Accumulo database

Summary

Non-traditional, relaxed consistency, triple store databases are the backbone of many web companies (e.g., Google Big Table, Amazon Dynamo, and Facebook Cassandra). The Apache Accumulo database is a high performance open source relaxed consistency database that is widely used for government applications. Obtaining the full benefits of Accumulo requires using novel schemas. The Dynamic Distributed Dimensional Data Model (D4M) [http://www.mit.edu/~kepner/D4M] provides a uniform mathematical framework based on associative arrays that encompasses both traditional (i.e., SQL) and non-traditional databases. For non-traditional databases D4M naturally leads to a general purpose schema that can be used to fully index and rapidly query every unique string in a dataset. The D4M 2.0 Schema has been applied with little or no customization to cyber, bioinformatics, scientific citation, free text, and social media data. The D4M 2.0 Schema is simple, requires minimal parsing, and achieves the highest published Accumulo ingest rates. The benefits of the D4M 2.0 Schema are independent of the D4M interface. Any interface to Accumulo can achieve these benefits by using the D4M 2.0 Schema.
READ LESS

Summary

Non-traditional, relaxed consistency, triple store databases are the backbone of many web companies (e.g., Google Big Table, Amazon Dynamo, and Facebook Cassandra). The Apache Accumulo database is a high performance open source relaxed consistency database that is widely used for government applications. Obtaining the full benefits of Accumulo requires using...

READ MORE

LLGrid: supercomputer for sensor processing

Summary

MIT Lincoln Laboratory is a federally funded research and development center that applies advanced technology to problems of national interest. Research and development activities focus on long-term technology development as well as rapid system prototyping and demonstration. A key part of this mission is to develop and deploy advanced sensor systems. Developing the algorithms for these systems requires interactive access to large scale computing and data storage. Deploying these systems requires that the computing and storage capabilities are transportable and energy efficient. The LLGrid system of supercomputers allows hundreds of researchers simultaneous interactive access to large amounts of processing and storage for development and testing of their sensor processing algorithms. The requirements of the LLGrid user base are as diverse as the sensors they are developing: sonar, radar, infrared, optical, hyperspectral, video, bio and cyber. However, there are two common elements: delivering large amounts of data interactively to many processors and high level user interfaces that require minimal user training. The LLGrid software stack provides these capabilities on dozens of LLGrid computing clusters across Lincoln Laboratory. LLGrid systems range from very small (a few nodes) to very large (40+ racks).
READ LESS

Summary

MIT Lincoln Laboratory is a federally funded research and development center that applies advanced technology to problems of national interest. Research and development activities focus on long-term technology development as well as rapid system prototyping and demonstration. A key part of this mission is to develop and deploy advanced sensor...

READ MORE

Driving big data with big compute

Summary

Big Data (as embodied by Hadoop clusters) and Big Compute (as embodied by MPI clusters) provide unique capabilities for storing and processing large volumes of data. Hadoop clusters make distributed computing readily accessible to the Java community and MPI clusters provide high parallel efficiency for compute intensive workloads. Bringing the big data and big compute communities together is an active area of research. The LLGrid team has developed and deployed a number of technologies that aim to provide the best of both worlds. LLGrid MapReduce allows the map/reduce parallel programming model to be used quickly and efficiently in any language on any compute cluster. D4M (Dynamic Distributed Dimensional Data Model) provided a high level distributed arrays interface to the Apache Accumulo database. The accessibility of these technologies is assessed by measuring the effort to use these tools and is typically a few lines of code. The performance is assessed by measuring the insert rate into the Accumulo database. Using these tools a database insert rate of 4M inserts/second has been achieved on an 8 node cluster.
READ LESS

Summary

Big Data (as embodied by Hadoop clusters) and Big Compute (as embodied by MPI clusters) provide unique capabilities for storing and processing large volumes of data. Hadoop clusters make distributed computing readily accessible to the Java community and MPI clusters provide high parallel efficiency for compute intensive workloads. Bringing the...

READ MORE

HPC-VMs: virtual machines in high performance computing systems

Published in:
HPEC 2012: IEEE Conf. on High Performance Extreme Computing, 10-12 September 2012.

Summary

The concept of virtual machines dates back to the 1960s. Both IBM and MIT developed operating system features that enabled user and peripheral time sharing, the underpinnings of which were early virtual machines. Modern virtual machines present a translation layer of system devices between a guest operating system and the host operating system executing on a computer system, while isolating each of the guest operating systems from each other. In the past several years, enterprise computing has embraced virtual machines to deploy a wide variety of capabilities from business management systems to email server farms. Those who have adopted virtual deployment environments have capitalized on a variety of advantages including server consolidation, service migration, and higher service reliability. But they have also ended up with some challenges including a sacrifice in performance and more complex system management. Some of these advantages and challenges also apply to HPC in virtualized environments. In this paper, we analyze the effectiveness of using virtual machines in a high performance computing (HPC) environment. We propose adding some virtual machine capability to already robust HPC environments for specific scenarios where the productivity gained outweighs the performance lost for using virtual machines. Finally, we discuss an implementation of augmenting virtual machines into the software stack of a HPC cluster, and we analyze the affect on job launch time of this implementation.
READ LESS

Summary

The concept of virtual machines dates back to the 1960s. Both IBM and MIT developed operating system features that enabled user and peripheral time sharing, the underpinnings of which were early virtual machines. Modern virtual machines present a translation layer of system devices between a guest operating system and the...

READ MORE