D4M and large array databases for management and analysis of large biomedical imaging data
January 28, 2016
Advances in medical imaging technologies have enabled the acquisition of increasingly large datasets. Current state-of-the-art confocal or multi-photon imaging technology can produce biomedical datasets in excess of 1 TB per dataset. Typical approaches for analyzing large datasets rely on downsampling the original datasets or leveraging distributed computing resources where small subsets of images are processed independently. These approaches require significant overhead on the part of the programmer to load the desired sub-volume from an array of image files into memory. Databases are well suited for indexing and retrieving components of very large datasets and show significant promise for the analysis of 3D volumetric images. In particular, array-based databases such as SciDB utilize an architecture that supports massive parallel processing while also providing database services such as data management and fast parallel queries. In this paper, we will present a new set of tools that leverage the D4M (Dynamic Distributed Dimensional Data Model) toolbox for analyzing giga-voxel biomedical datasets. By combining SciDB and the D4M toolbox, we demonstrate that we can access large volumetric data and perform large-scale bioinformatics analytics efficiently and interactively. We show that it is possible to achieve an ingest rate of 2.8 million entries per second for importing large datasets into SciDB. These tools provide more efficient ways to access random sub-volumes of massive datasets and to process the information that typically cannot be loaded into memory. This work describes the D4M and SciDB tools that we developed and presents the initial performance results.