Publications

Refine Results

(Filters Applied) Clear All

Detecting virus exposure during the pre-symptomatic incubation period using physiological data

Summary

Early pathogen exposure detection allows better patient care and faster implementation of public health measures (patient isolation, contact tracing). Existing exposure detection most frequently relies on overt clinical symptoms, namely fever, during the infectious prodromal period. We have developed a robust machine learning method to better detect asymptomatic states during the incubation period using subtle, sub-clinical physiological markers. Using high-resolution physiological data from non-human primate studies of Ebola and Marburg viruses, we pre-processed the data to reduce short-term variability and normalize diurnal variations, then provided these to a supervised random forest classification algorithm. In most subjects detection is achieved well before the onset of fever; subject cross-validation lead to 52±14h mean early detection (at >0.90 area under the receiver-operating characteristic curve). Cross-cohort tests across pathogens and exposure routes also lead to successful early detection (28±16h and 43±22h, respectively). We discuss which physiological indicators are most informative for early detection and options for extending this capability to lower data resolution and wearable, non-invasive sensors.
READ LESS

Summary

Early pathogen exposure detection allows better patient care and faster implementation of public health measures (patient isolation, contact tracing). Existing exposure detection most frequently relies on overt clinical symptoms, namely fever, during the infectious prodromal period. We have developed a robust machine learning method to better detect asymptomatic states during...

READ MORE

Benchmarking SciDB data import on HPC systems

Summary

SciDB is a scalable, computational database management system that uses an array model for data storage. The array data model of SciDB makes it ideally suited for storing and managing large amounts of imaging data. SciDB is designed to support advanced analytics in database, thus reducing the need for extracting data for analysis. It is designed to be massively parallel and can run on commodity hardware in a high performance computing (HPC) environment. In this paper, we present the performance of SciDB using simulated image data. The Dynamic Distributed Dimensional Data Model (D4M) software is used to implement the benchmark on a cluster running the MIT SuperCloud software stack. A peak performance of 2.2M database inserts per second was achieved on a single node of this system. We also show that SciDB and the D4M toolbox provide more efficient ways to access random sub-volumes of massive datasets compared to the traditional approaches of reading volumetric data from individual files. This work describes the D4M and SciDB tools we developed and presents the initial performance results. This performance was achieved by using parallel inserts, a in-database merging of arrays as well as supercomputing techniques, such as distributed arrays and single-program-multiple-data programming.
READ LESS

Summary

SciDB is a scalable, computational database management system that uses an array model for data storage. The array data model of SciDB makes it ideally suited for storing and managing large amounts of imaging data. SciDB is designed to support advanced analytics in database, thus reducing the need for extracting...

READ MORE

A data-stream classification system for investigating terrorist threats

Published in:
Proc. SPIE 9851, Next-Generation Analyst IV, 98510L (May 12, 2016); doi:10.1117/12.2224104.

Summary

The role of cyber forensics in criminal investigations has greatly increased in recent years due to the wealth of data that is collected and available to investigators. Physical forensics has also experienced a data volume and fidelity revolution due to advances in methods for DNA and trace evidence analysis. Key to extracting insight is the ability to correlate across multi-modal data, which depends critically on identifying a touch-point connecting the separate data streams. Separate data sources may be connected because they refer to the same individual, entity or event. In this paper we present a data source classification system tailored to facilitate the investigation of potential terrorist activity. This taxonomy is structured to illuminate the defining characteristics of a particular terrorist effort and designed to guide reporting to decision makers that is complete, concise, and evidence-based. The classification system has been validated and empirically utilized in the forensic analysis of a simulated terrorist activity. Next-generation analysts can use this schema to label and correlate across existing data streams, assess which critical information may be missing from the data, and identify options for collecting additional data streams to fill information gaps.
READ LESS

Summary

The role of cyber forensics in criminal investigations has greatly increased in recent years due to the wealth of data that is collected and available to investigators. Physical forensics has also experienced a data volume and fidelity revolution due to advances in methods for DNA and trace evidence analysis. Key...

READ MORE

D4M and large array databases for management and analysis of large biomedical imaging data

Summary

Advances in medical imaging technologies have enabled the acquisition of increasingly large datasets. Current state-of-the-art confocal or multi-photon imaging technology can produce biomedical datasets in excess of 1 TB per dataset. Typical approaches for analyzing large datasets rely on downsampling the original datasets or leveraging distributed computing resources where small subsets of images are processed independently. These approaches require significant overhead on the part of the programmer to load the desired sub-volume from an array of image files into memory. Databases are well suited for indexing and retrieving components of very large datasets and show significant promise for the analysis of 3D volumetric images. In particular, array-based databases such as SciDB utilize an architecture that supports massive parallel processing while also providing database services such as data management and fast parallel queries. In this paper, we will present a new set of tools that leverage the D4M (Dynamic Distributed Dimensional Data Model) toolbox for analyzing giga-voxel biomedical datasets. By combining SciDB and the D4M toolbox, we demonstrate that we can access large volumetric data and perform large-scale bioinformatics analytics efficiently and interactively. We show that it is possible to achieve an ingest rate of 2.8 million entries per second for importing large datasets into SciDB. These tools provide more efficient ways to access random sub-volumes of massive datasets and to process the information that typically cannot be loaded into memory. This work describes the D4M and SciDB tools that we developed and presents the initial performance results.
READ LESS

Summary

Advances in medical imaging technologies have enabled the acquisition of increasingly large datasets. Current state-of-the-art confocal or multi-photon imaging technology can produce biomedical datasets in excess of 1 TB per dataset. Typical approaches for analyzing large datasets rely on downsampling the original datasets or leveraging distributed computing resources where small...

READ MORE

Rapid sequence identification of potential pathogens using techniques from sparse linear algebra

Summary

The decreasing costs and increasing speed and accuracy of DNA sample collection, preparation, and sequencing has rapidly produced an enormous volume of genetic data. However, fast and accurate analysis of the samples remains a bottleneck. Here we present D4RAGenS, a genetic sequence identification algorithm that exhibits the Big Data handling and computational power of the Dynamic Distributed Dimensional Data Model (D4M). The method leverages linear algebra and statistical properties to increase computational performance while retaining accuracy by subsampling the data. Two run modes, Fast and Wise, yield speed and precision tradeoffs, with applications in biodefense and medical diagnostics. The D4RAGenS analysis algorithm is tested over several datasets, including three utilized for the Defense Threat Reduction Agency (DTRA) metagenomic algorithm contest.
READ LESS

Summary

The decreasing costs and increasing speed and accuracy of DNA sample collection, preparation, and sequencing has rapidly produced an enormous volume of genetic data. However, fast and accurate analysis of the samples remains a bottleneck. Here we present D4RAGenS, a genetic sequence identification algorithm that exhibits the Big Data handling...

READ MORE

Using a big data database to identify pathogens in protein data space [e-print]

Summary

Current metagenomic analysis algorithms require significant computing resources, can report excessive false positives (type I errors), may miss organisms (type II errors/false negatives), or scale poorly on large datasets. This paper explores using big data database technologies to characterize very large metagenomic DNA sequences in protein space, with the ultimate goal of rapid pathogen identification in patient samples. Our approach uses the abilities of a big data databases to hold large sparse associative array representations of genetic data to extract statistical patterns about the data that can be used in a variety of ways to improve identification algorithms.
READ LESS

Summary

Current metagenomic analysis algorithms require significant computing resources, can report excessive false positives (type I errors), may miss organisms (type II errors/false negatives), or scale poorly on large datasets. This paper explores using big data database technologies to characterize very large metagenomic DNA sequences in protein space, with the ultimate...

READ MORE

Genetic sequence matching using D4M big data approaches

Published in:
HPEC 2014: IEEE Conf. on High Performance Extreme Computing, 9-11 September 2014.

Summary

Recent technological advances in Next Generation Sequencing tools have led to increasing speeds of DNA sample collection, preparation, and sequencing. One instrument can produce over 600 Gb of genetic sequence data in a single run. This creates new opportunities to efficiently handle the increasing workload. We propose a new method of fast genetic sequence analysis using the Dynamic Distributed Dimensional Data Model (D4M) - an associative array environment for MATLAB developed at MIT Lincoln Laboratory. Based on mathematical and statistical properties, the method leverages big data techniques and the implementation of an Apache Acculumo database to accelerate computations one-hundred fold over other methods. Comparisons of the D4M method with the current gold-standard for sequence analysis, BLAST, show the two are comparable in the alignments they find. This paper will present an overview of the D4M genetic sequence algorithm and statistical comparisons with BLAST.
READ LESS

Summary

Recent technological advances in Next Generation Sequencing tools have led to increasing speeds of DNA sample collection, preparation, and sequencing. One instrument can produce over 600 Gb of genetic sequence data in a single run. This creates new opportunities to efficiently handle the increasing workload. We propose a new method...

READ MORE

Development and use of a comprehensive humanitarian assessment tool in post-earthquake Haiti

Summary

This paper describes a comprehensive humanitarian assessment tool designed and used following the January 2010 Haiti earthquake. The tool was developed under Joint Task Force -- Haiti coordination using indicators of humanitarian needs to support decision making by the United States Government, agencies of the United Nations, and various non-governmental organizations. A set of questions and data collection methodology were developed by a collaborative process involving a broad segment of the Haiti humanitarian relief community and used to conduct surveys in internally displaced person settlements and surrounding communities for a four-month period starting on 15 March 2010. Key considerations in the development of the assessment tool and data collection methodology, representative analysis results, and observations from the operational use of the tool for decision making are reported. The paper concludes with lessons learned and recommendations for design and use of similar tools in the future.
READ LESS

Summary

This paper describes a comprehensive humanitarian assessment tool designed and used following the January 2010 Haiti earthquake. The tool was developed under Joint Task Force -- Haiti coordination using indicators of humanitarian needs to support decision making by the United States Government, agencies of the United Nations, and various non-governmental...

READ MORE

Measurement of aerosol-particle trajectories using a structured laser beam

Summary

What is believed to be a new concept for the measurement of micrometer-sized particle trajectories in an inlet air stream is introduced. The technique uses a light source and a mask to generate a spatial pattern of light within a volume in space. Particles traverse the illumination volume and elastically scatter light to a photodetector where the signal is recorded in time. The detected scattering waveform is decoded to find the particle trajectory. A design is presented for the structured laser beam, and the accuracy of the technique in determining particle position is demonstrated. It is also demonstrated that the structured laser beam can be used to measure and then correct for the spatially dependent instrument-response function of an optical-scattering-based particle-sizing system for aerosols.
READ LESS

Summary

What is believed to be a new concept for the measurement of micrometer-sized particle trajectories in an inlet air stream is introduced. The technique uses a light source and a mask to generate a spatial pattern of light within a volume in space. Particles traverse the illumination volume and elastically...

READ MORE