AI data wrangling with associative arrays [e-print]

January 18, 2020

Abstract

Author:

Jeremy Kepner

…

Published in:

Submitted to Northeast Database Day, NEDB 2020, https://arxiv.org/abs/2001.06731

R&D Area:

Cyber Security and Information Sciences

R&D Group:

Lincoln Laboratory Supercomputing Center

AI data wrangling with associative arrays [e-print]

Summary

The AI revolution is data driven. AI "data wrangling" is the process by which unusable data is transformed to support AI algorithm development (training) and deployment (inference). Significant time is devoted to translating diverse data representations supporting the many query and analysis steps found in an AI pipeline. Rigorous mathematical representations of these data enables data translation and analysis optimization within and across steps. Associative array algebra provides a mathematical foundation that naturally describes the tabular structures and set mathematics that are the basis of databases. Likewise, the matrix operations and corresponding inference/training calculations used by neural networks are also well described by associative arrays. More surprisingly, a general denormalized form of hierarchical formats, such as XML and JSON, can be readily constructed. Finally, pivot tables, which are among the most widely used data analysis tools, naturally emerge from associative array constructors. A common foundation in associative arrays provides interoperability guarantees, proving that their operations are linear systems with rigorous mathematical properties, such as, associativity, commutativity, and distributivity that are critical to reordering optimizations.