Publications

Refine Results

(Filters Applied) Clear All

R&D Areas

R&D Groups

Year

Items per page

By

Charles Yee Clear filter

Benchmarking SciDB data import on HPC systems

September 13, 2016

Conference Paper

Author:

Siddharth S. Samsi

…

Published in:

HPEC 2016: IEEE Conf. on High Performance Extreme Computing, 13-15 September 2016.

Topic:

high performance computing

R&D area:

R&D group:

Summary

SciDB is a scalable, computational database management system that uses an array model for data storage. The array data model of SciDB makes it ideally suited for storing and managing large amounts of imaging data. SciDB is designed to support advanced analytics in database, thus reducing the need for extracting data for analysis. It is designed to be massively parallel and can run on commodity hardware in a high performance computing (HPC) environment. In this paper, we present the performance of SciDB using simulated image data. The Dynamic Distributed Dimensional Data Model (D4M) software is used to implement the benchmark on a cluster running the MIT SuperCloud software stack. A peak performance of 2.2M database inserts per second was achieved on a single node of this system. We also show that SciDB and the D4M toolbox provide more efficient ways to access random sub-volumes of massive datasets compared to the traditional approaches of reading volumetric data from individual files. This work describes the D4M and SciDB tools we developed and presents the initial performance results. This performance was achieved by using parallel inserts, a in-database merging of arrays as well as supercomputing techniques, such as distributed arrays and single-program-multiple-data programming.

READ LESS

Summary

Benchmarking SciDB data import on HPC systems

LLMapReduce: multi-level map-reduce for high performance data analysis

May 23, 2016

Conference Paper

Author:

Chansup Byun

…

Published in:

IEEE Int. Parallel and Distributed Processing Symp., IPDPS 2016, 23-27 May 2016.

Topic:

high performance computing

R&D area:

Cyber Security and Information Sciences

R&D group:

Secure Resilient Systems and Technology

Summary

The map-reduce parallel programming model has become extremely popular in the big data community. Many big data workloads can benefit from the enhanced performance offered by supercomputers. LLMapReduce provides the familiar map-reduce parallel programming model to big data users running on a supercomputer. LLMapReduce dramatically simplifies map-reduce programming by providing simple parallel programming capability in one line of code. LLMapReduce supports all programming languages and many schedulers. LLMapReduce can work with any application without the need to modify the application. Furthermore, LLMapReduce can overcome scaling limits in the map-reduce parallel programming model via options that allow the user to switch to the more efficient single-program-multiple-data (SPMD) parallel programming model. These features allow users to reduce the computational overhead by more than 10x compared to standard map-reduce for certain applications. LLMapReduce is widely used by hundreds of users at MIT. Currently LLMapReduce works with several schedulers such as SLURM, Grid Engine and LSF.

READ LESS

Summary

LLMapReduce: multi-level map-reduce for high performance data analysis

Scalability of VM provisioning systems

January 1, 2016

Conference Paper

Author:

Michael S. Jones

…

Published in:

HPEC 2016: IEEE Conf. on High Performance Extreme Computing, 13-15 September 2016.

Topic:

high performance computing

R&D area:

Cyber Security and Information Sciences

R&D group:

Lincoln Laboratory Supercomputing Center

Summary

Virtual machines and virtualized hardware have been around for over half a century. The commoditization of the x86 platform and its rapidly growing hardware capabilities have led to recent exponential growth in the use of virtualization both in the enterprise and high performance computing (HPC). The startup time of a virtualized environment is a key performance metric for high performance computing in which the runtime of any individual task is typically much shorter than the lifetime of a virtualized service in an enterprise context. In this paper, a methodology for accurately measuring the startup performance on an HPC system is described. The startup performance overhead of three of the most mature, widely deployed cloud management frameworks (OpenStack, OpenNebula, and Eucalyptus) is measured to determine their suitability for workloads typically seen in an HPC environment. A 10x performance difference is observed between the fastest (Eucalyptus) and the slowest (OpenNebula) framework. This time difference is primarily due to delays in waiting on networking in the cloud-init portion of the startup. The methodology and measurements presented should facilitate the optimization of startup across a variety of virtualization environments.

READ LESS

Summary

Scalability of VM provisioning systems

Enabling on-demand database computing with MIT SuperCloud database management system

September 15, 2015

Conference Paper

Author:

Andrew J. Prout

…

Published in:

HPEC 2015: IEEE Conf. on High Performance Extreme Computing, 15-17 September 2015.

Topic:

high performance computing

R&D area:

Cyber Security and Information Sciences

R&D group:

Secure Resilient Systems and Technology

Summary

The MIT SuperCloud database management system allows for rapid creation and flexible execution of a variety of the latest scientific databases, including Apache Accumulo and SciDB. It is designed to permit these databases to run on a High Performance Computing Cluster (HPCC) platform as seamlessly as any other HPCC job. It ensures the seamless migration of the databases to the resources assigned by the HPCC scheduler and centralized storage of the database files when not running. It also permits snapshotting of databases to allow researchers to experiment and push the limits of the technology without concerns for data or productivity loss if the database becomes unstable.

READ LESS

Summary

Enabling on-demand database computing with MIT SuperCloud database management system

Big data strategies for data center infrastructure management using a 3D gaming platform

September 15, 2015

Conference Paper

Author:

Matthew Hubbell

…

Published in:

HPEC 2015: IEEE Conf. on High Performance Extreme Computing, 15-17 September 2015.

Topic:

high performance computing

R&D area:

Cyber Security and Information Sciences

R&D group:

Secure Resilient Systems and Technology

Summary

High Performance Computing (HPC) is intrinsically linked to effective Data Center Infrastructure Management (DCIM). Cloud services and HPC have become key components in Department of Defense and corporate Information Technology competitive strategies in the global and commercial spaces. As a result, the reliance on consistent, reliable Data Center space is more critical than ever. The costs and complexity of providing quality DCIM are constantly being tested and evaluated by the United States Government and companies such as Google, Microsoft and Facebook. This paper will demonstrate a system where Big Data strategies and 3D gaming technology is leveraged to successfully monitor and analyze multiple HPC systems and a lights-out modular HP EcoPOD 240a Data Center on a singular platform. Big Data technology and a 3D gaming platform enables the relative real time monitoring of 5000 environmental sensors, more than 3500 IT data points and display visual analytics of the overall operating condition of the Data Center from a command center over 100 miles away. In addition, the Big Data model allows for in depth analysis of historical trends and conditions to optimize operations achieving even greater efficiencies and reliability.

READ LESS

Summary

Big data strategies for data center infrastructure management using a 3D gaming platform

Portable Map-Reduce utility for MIT SuperCloud environment

September 15, 2015

Conference Paper

Author:

Chansup Byun

…

Published in:

HPEC 2015: IEEE Conf. on High Performance Extreme Computing, 15-17 September 2015.

Topic:

supercomputing

R&D area:

Cyber Security and Information Sciences

R&D group:

Secure Resilient Systems and Technology

Summary

The MIT Map-Reduce utility has been developed and deployed on the MIT SuperCloud to support scientists and engineers at MIT Lincoln Laboratory. With the MIT Map-Reduce utility, users can deploy their applications quickly onto the MIT SuperCloud infrastructure. The MIT Map-Reduce utility can work with any applications without the need for any modifications. For improved performance, the MIT Map-Reduce utility provides an option to consolidate multiple input data files per compute task as a single stream of input with minimal changes to the target application. This enables users to reduce the computational overhead associated with the cost of multiple application starting up when dealing with more than one piece of input data per compute task. With a small change in a sample MATLAB image processing application, we have observed approximately 12x speed up by reducing the application startup overhead. Currently the MIT Map-Reduce utility can work with several schedulers such as SLURM, Grid Engine and LSF.

READ LESS

Summary

Portable Map-Reduce utility for MIT SuperCloud environment

Achieving 100,000,000 database inserts per second using Accumulo and D4M

September 9, 2014

Conference Paper

Author:

Jeremy Kepner

…

Published in:

HPEC 2014: IEEE Conf. on High Performance Extreme Computing, 9-11 September 2014.

Topic:

big data

R&D area:

Cyber Security and Information Sciences

R&D group:

Secure Resilient Systems and Technology

Summary

The Apache Accumulo database is an open source relaxed consistency database that is widely used for government applications. Accumulo is designed to deliver high performance on unstructured data such as graphs of network data. This paper tests the performance of Accumulo using data from the Graph500 benchmark. The Dynamic Distributed Dimensional Data Model (D4M) software is used to implement the benchmark on a 216-node cluster running the MIT SuperCloud software stack. A peak performance of over 100,000,000 database inserts per second was achieved which is 100x larger than the highest previously published value for any other database. The performance scales linearly with the number of ingest clients, number of database servers, and data size. The performance was achieved by adapting several supercomputing techniques to this application: distributed arrays, domain decomposition, adaptive load balancing, and single-program-multiple-data programming.

READ LESS

Summary

Achieving 100,000,000 database inserts per second using Accumulo and D4M

D4M 2.0 Schema: a general purpose high performance schema for the Accumulo database

September 10, 2013

Conference Paper

Author:

Jeremy Kepner

…

Published in:

HPEC 2013: IEEE Conf. on High Performance Extreme Computing, 10-12 September 2013.

Topic:

big data

R&D area:

Cyber Security and Information Sciences

R&D group:

Summary

Non-traditional, relaxed consistency, triple store databases are the backbone of many web companies (e.g., Google Big Table, Amazon Dynamo, and Facebook Cassandra). The Apache Accumulo database is a high performance open source relaxed consistency database that is widely used for government applications. Obtaining the full benefits of Accumulo requires using novel schemas. The Dynamic Distributed Dimensional Data Model (D4M) [http://www.mit.edu/~kepner/D4M] provides a uniform mathematical framework based on associative arrays that encompasses both traditional (i.e., SQL) and non-traditional databases. For non-traditional databases D4M naturally leads to a general purpose schema that can be used to fully index and rapidly query every unique string in a dataset. The D4M 2.0 Schema has been applied with little or no customization to cyber, bioinformatics, scientific citation, free text, and social media data. The D4M 2.0 Schema is simple, requires minimal parsing, and achieves the highest published Accumulo ingest rates. The benefits of the D4M 2.0 Schema are independent of the D4M interface. Any interface to Accumulo can achieve these benefits by using the D4M 2.0 Schema.

READ LESS

Summary

D4M 2.0 Schema: a general purpose high performance schema for the Accumulo database

September 10, 2013

Conference Paper

Author:

Jeremy Kepner

…

Published in:

HPEC 2013: IEEE Conf. on High Performance Extreme Computing, 10-12 September 2013.

Topic:

supercomputing

R&D area:

Cyber Security and Information Sciences

R&D group:

Secure Resilient Systems and Technology

Summary

READ LESS

Summary

D4M 2.0 Schema: a general purpose high performance schema for the Accumulo database

LLGrid: supercomputer for sensor processing

April 15, 2013

Book Chapter

Author:

Jeremy Kepner

…

Published in:

Chapter 24 in Contemporary High Performance Computing, April 2013.

Topic:

supercomputing

R&D area:

Cyber Security and Information Sciences

R&D group:

Secure Resilient Systems and Technology

Summary

MIT Lincoln Laboratory is a federally funded research and development center that applies advanced technology to problems of national interest. Research and development activities focus on long-term technology development as well as rapid system prototyping and demonstration. A key part of this mission is to develop and deploy advanced sensor systems. Developing the algorithms for these systems requires interactive access to large scale computing and data storage. Deploying these systems requires that the computing and storage capabilities are transportable and energy efficient. The LLGrid system of supercomputers allows hundreds of researchers simultaneous interactive access to large amounts of processing and storage for development and testing of their sensor processing algorithms. The requirements of the LLGrid user base are as diverse as the sensors they are developing: sonar, radar, infrared, optical, hyperspectral, video, bio and cyber. However, there are two common elements: delivering large amounts of data interactively to many processors and high level user interfaces that require minimal user training. The LLGrid software stack provides these capabilities on dozens of LLGrid computing clusters across Lincoln Laboratory. LLGrid systems range from very small (a few nodes) to very large (40+ racks).

READ LESS

Summary

LLGrid: supercomputer for sensor processing

Publications

Refine Results

By

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Showing Results