Publications

Refine Results

(Filters Applied) Clear All

Large-scale Bayesian kinship analysis

Summary

Kinship prediction in forensics is limited to first degree relatives due to the small number of short tandem repeat loci characterized. The Genetic Chain Rule for Probabilistic Kinship Estimation can leverage large panels of single nucleotide polymorphisms (SNPs) or sets of sequence linked SNPs, called haploblocks, to estimate more distant relationships between individuals. This method uses allele frequencies and Markov Chain Monte Carlo methods to determine kinship probabilities. Allele frequencies are a crucial input to this method. Since these frequencies are estimated from finite populations and many alleles are rare, a Bayesian extension to the algorithm has been developed to determine credible intervals for kinship estimates as a function of the certainty in allele frequency estimates. Generation of sufficiently large samples to accurately estimate credible intervals can take significant computational resources. In this paper, we leverage hundreds of compute cores to generate large numbers of Dirichlet random samples for Bayesian kinship prediction. We show that it is possible to generate 2,097,152 random samples on 32,768 cores at a rate of 29.68 samples per second. The ability to generate extremely large number of samples enables the computation of more statistically significant results from a Bayesian approach to kinship analysis.
READ LESS

Summary

Kinship prediction in forensics is limited to first degree relatives due to the small number of short tandem repeat loci characterized. The Genetic Chain Rule for Probabilistic Kinship Estimation can leverage large panels of single nucleotide polymorphisms (SNPs) or sets of sequence linked SNPs, called haploblocks, to estimate more distant...

READ MORE

Detecting pathogen exposure during the non-symptomatic incubation period using physiological data

Summary

Early pathogen exposure detection allows better patient care and faster implementation of public health measures (patient isolation, contact tracing). Existing exposure detection most frequently relies on overt clinical symptoms, namely fever, during the infectious prodromal period. We have developed a robust machine learning based method to better detect asymptomatic states during the incubation period using subtle, sub-clinical physiological markers. Starting with highresolution physiological waveform data from non-human primate studies of viral (Ebola, Marburg, Lassa, and Nipah viruses) and bacterial (Y. pestis) exposure, we processed the data to reduce short-term variability and normalize diurnal variations, then provided these to a supervised random forest classification algorithm and post-classifier declaration logic step to reduce false alarms. In most subjects detection is achieved well before the onset of fever; subject cross-validation across exposure studies (varying viruses, exposure routes, animal species, and target dose) lead to 51h mean early detection (at 0.93 area under the receiver-operating characteristic curve [AUCROC]). Evaluating the algorithm against entirely independent datasets for Lassa, Nipah, and Y. pestis exposures un-used in algorithm training and development yields a mean 51h early warning time (at AUCROC=0.95). We discuss which physiological indicators are most informative for early detection and options for extending this capability to limited datasets such as those available from wearable, non-invasive, ECG-based sensors.
READ LESS

Summary

Early pathogen exposure detection allows better patient care and faster implementation of public health measures (patient isolation, contact tracing). Existing exposure detection most frequently relies on overt clinical symptoms, namely fever, during the infectious prodromal period. We have developed a robust machine learning based method to better detect asymptomatic states...

READ MORE

A cloud-based brain connectivity analysis tool

Summary

With advances in high throughput brain imaging at the cellular and sub-cellular level, there is growing demand for platforms that can support high performance, large-scale brain data processing and analysis. In this paper, we present a novel pipeline that combines Accumulo, D4M, geohashing, and parallel programming to manage large-scale neuron connectivity graphs in a cloud environment. Our brain connectivity graph is represented using vertices (fiber start/end nodes), edges (fiber tracks), and the 3D coordinates of the fiber tracks. For optimal performance, we take the hybrid approach of storing vertices and edges in Accumulo and saving the fiber track 3D coordinates in flat files. Accumulo database operations offer low latency on sparse queries while flat files offer high throughput for storing, querying, and analyzing bulk data. We evaluated our pipeline by using 250 gigabytes of mouse neuron connectivity data. Benchmarking experiments on retrieving vertices and edges from Accumulo demonstrate that we can achieve 1-2 orders of magnitude speedup in retrieval time when compared to the same operation from traditional flat files. The implementation of graph analytics such as Breadth First Search using Accumulo and D4M offers consistent good performance regardless of data size and density, thus is scalable to very large dataset. Indexing of neuron subvolumes is simple and logical with geohashing-based binary tree encoding. This hybrid data management backend is used to drive an interactive web-based 3D graphical user interface, where users can examine the 3D connectivity map in a Google Map-like viewer. Our pipeline is scalable and extensible to other data modalities.
READ LESS

Summary

With advances in high throughput brain imaging at the cellular and sub-cellular level, there is growing demand for platforms that can support high performance, large-scale brain data processing and analysis. In this paper, we present a novel pipeline that combines Accumulo, D4M, geohashing, and parallel programming to manage large-scale neuron...

READ MORE

A linear algebra approach to fast DNA mixture analysis using GPUs

Summary

Analysis of DNA samples is an important step in forensics, and the speed of analysis can impact investigations. Comparison of DNA sequences is based on the analysis of short tandem repeats (STRs), which are short DNA sequences of 2-5 base pairs. Current forensics approaches use 20 STR loci for analysis. The use of single nucleotide polymorphisms (SNPs) has utility for analysis of complex DNA mixtures. The use of tens of thousands of SNPs loci for analysis poses significant computational challenges because the forensic analysis scales by the product of the loci count and number of DNA samples to be analyzed. In this paper, we discuss the implementation of a DNA sequence comparison algorithm by re-casting the algorithm in terms of linear algebra primitives. By developing an overloaded matrix multiplication approach to DNA comparisons, we can leverage advances in GPU hardware and algorithms for Dense Generalized Matrix-Multiply (DGEMM) to speed up DNA sample comparisons. We show that it is possible to compare 2048 unknown DNA samples with 20 million known samples in under 6 seconds using a NVIDIA K80 GPU.
READ LESS

Summary

Analysis of DNA samples is an important step in forensics, and the speed of analysis can impact investigations. Comparison of DNA sequences is based on the analysis of short tandem repeats (STRs), which are short DNA sequences of 2-5 base pairs. Current forensics approaches use 20 STR loci for analysis...

READ MORE

Detecting virus exposure during the pre-symptomatic incubation period using physiological data

Summary

Early pathogen exposure detection allows better patient care and faster implementation of public health measures (patient isolation, contact tracing). Existing exposure detection most frequently relies on overt clinical symptoms, namely fever, during the infectious prodromal period. We have developed a robust machine learning method to better detect asymptomatic states during the incubation period using subtle, sub-clinical physiological markers. Using high-resolution physiological data from non-human primate studies of Ebola and Marburg viruses, we pre-processed the data to reduce short-term variability and normalize diurnal variations, then provided these to a supervised random forest classification algorithm. In most subjects detection is achieved well before the onset of fever; subject cross-validation lead to 52±14h mean early detection (at >0.90 area under the receiver-operating characteristic curve). Cross-cohort tests across pathogens and exposure routes also lead to successful early detection (28±16h and 43±22h, respectively). We discuss which physiological indicators are most informative for early detection and options for extending this capability to lower data resolution and wearable, non-invasive sensors.
READ LESS

Summary

Early pathogen exposure detection allows better patient care and faster implementation of public health measures (patient isolation, contact tracing). Existing exposure detection most frequently relies on overt clinical symptoms, namely fever, during the infectious prodromal period. We have developed a robust machine learning method to better detect asymptomatic states during...

READ MORE

Benchmarking SciDB data import on HPC systems

Summary

SciDB is a scalable, computational database management system that uses an array model for data storage. The array data model of SciDB makes it ideally suited for storing and managing large amounts of imaging data. SciDB is designed to support advanced analytics in database, thus reducing the need for extracting data for analysis. It is designed to be massively parallel and can run on commodity hardware in a high performance computing (HPC) environment. In this paper, we present the performance of SciDB using simulated image data. The Dynamic Distributed Dimensional Data Model (D4M) software is used to implement the benchmark on a cluster running the MIT SuperCloud software stack. A peak performance of 2.2M database inserts per second was achieved on a single node of this system. We also show that SciDB and the D4M toolbox provide more efficient ways to access random sub-volumes of massive datasets compared to the traditional approaches of reading volumetric data from individual files. This work describes the D4M and SciDB tools we developed and presents the initial performance results. This performance was achieved by using parallel inserts, a in-database merging of arrays as well as supercomputing techniques, such as distributed arrays and single-program-multiple-data programming.
READ LESS

Summary

SciDB is a scalable, computational database management system that uses an array model for data storage. The array data model of SciDB makes it ideally suited for storing and managing large amounts of imaging data. SciDB is designed to support advanced analytics in database, thus reducing the need for extracting...

READ MORE

D4M and large array databases for management and analysis of large biomedical imaging data

Summary

Advances in medical imaging technologies have enabled the acquisition of increasingly large datasets. Current state-of-the-art confocal or multi-photon imaging technology can produce biomedical datasets in excess of 1 TB per dataset. Typical approaches for analyzing large datasets rely on downsampling the original datasets or leveraging distributed computing resources where small subsets of images are processed independently. These approaches require significant overhead on the part of the programmer to load the desired sub-volume from an array of image files into memory. Databases are well suited for indexing and retrieving components of very large datasets and show significant promise for the analysis of 3D volumetric images. In particular, array-based databases such as SciDB utilize an architecture that supports massive parallel processing while also providing database services such as data management and fast parallel queries. In this paper, we will present a new set of tools that leverage the D4M (Dynamic Distributed Dimensional Data Model) toolbox for analyzing giga-voxel biomedical datasets. By combining SciDB and the D4M toolbox, we demonstrate that we can access large volumetric data and perform large-scale bioinformatics analytics efficiently and interactively. We show that it is possible to achieve an ingest rate of 2.8 million entries per second for importing large datasets into SciDB. These tools provide more efficient ways to access random sub-volumes of massive datasets and to process the information that typically cannot be loaded into memory. This work describes the D4M and SciDB tools that we developed and presents the initial performance results.
READ LESS

Summary

Advances in medical imaging technologies have enabled the acquisition of increasingly large datasets. Current state-of-the-art confocal or multi-photon imaging technology can produce biomedical datasets in excess of 1 TB per dataset. Typical approaches for analyzing large datasets rely on downsampling the original datasets or leveraging distributed computing resources where small...

READ MORE

Rapid sequence identification of potential pathogens using techniques from sparse linear algebra

Summary

The decreasing costs and increasing speed and accuracy of DNA sample collection, preparation, and sequencing has rapidly produced an enormous volume of genetic data. However, fast and accurate analysis of the samples remains a bottleneck. Here we present D4RAGenS, a genetic sequence identification algorithm that exhibits the Big Data handling and computational power of the Dynamic Distributed Dimensional Data Model (D4M). The method leverages linear algebra and statistical properties to increase computational performance while retaining accuracy by subsampling the data. Two run modes, Fast and Wise, yield speed and precision tradeoffs, with applications in biodefense and medical diagnostics. The D4RAGenS analysis algorithm is tested over several datasets, including three utilized for the Defense Threat Reduction Agency (DTRA) metagenomic algorithm contest.
READ LESS

Summary

The decreasing costs and increasing speed and accuracy of DNA sample collection, preparation, and sequencing has rapidly produced an enormous volume of genetic data. However, fast and accurate analysis of the samples remains a bottleneck. Here we present D4RAGenS, a genetic sequence identification algorithm that exhibits the Big Data handling...

READ MORE

Using a big data database to identify pathogens in protein data space [e-print]

Summary

Current metagenomic analysis algorithms require significant computing resources, can report excessive false positives (type I errors), may miss organisms (type II errors/false negatives), or scale poorly on large datasets. This paper explores using big data database technologies to characterize very large metagenomic DNA sequences in protein space, with the ultimate goal of rapid pathogen identification in patient samples. Our approach uses the abilities of a big data databases to hold large sparse associative array representations of genetic data to extract statistical patterns about the data that can be used in a variety of ways to improve identification algorithms.
READ LESS

Summary

Current metagenomic analysis algorithms require significant computing resources, can report excessive false positives (type I errors), may miss organisms (type II errors/false negatives), or scale poorly on large datasets. This paper explores using big data database technologies to characterize very large metagenomic DNA sequences in protein space, with the ultimate...

READ MORE

Genetic sequence matching using D4M big data approaches

Published in:
HPEC 2014: IEEE Conf. on High Performance Extreme Computing, 9-11 September 2014.

Summary

Recent technological advances in Next Generation Sequencing tools have led to increasing speeds of DNA sample collection, preparation, and sequencing. One instrument can produce over 600 Gb of genetic sequence data in a single run. This creates new opportunities to efficiently handle the increasing workload. We propose a new method of fast genetic sequence analysis using the Dynamic Distributed Dimensional Data Model (D4M) - an associative array environment for MATLAB developed at MIT Lincoln Laboratory. Based on mathematical and statistical properties, the method leverages big data techniques and the implementation of an Apache Acculumo database to accelerate computations one-hundred fold over other methods. Comparisons of the D4M method with the current gold-standard for sequence analysis, BLAST, show the two are comparable in the alignments they find. This paper will present an overview of the D4M genetic sequence algorithm and statistical comparisons with BLAST.
READ LESS

Summary

Recent technological advances in Next Generation Sequencing tools have led to increasing speeds of DNA sample collection, preparation, and sequencing. One instrument can produce over 600 Gb of genetic sequence data in a single run. This creates new opportunities to efficiently handle the increasing workload. We propose a new method...

READ MORE

Showing Results

1-10 of 10