Data Wrangling

A key goal of the Sustain effort to harness the wealth of information available in voluminous, disparate datasets by supporting a rich set of data exploration, discovery, visualization, and model validation operators. 

Our data federation schemes reconcile data induced challenges: heterogeneity and volumes and storage mechanisms. All operators work across contiguous, disparate, or overlapping spatiotemporal scales. Additionally, a fluent interface allows chaining operators to formulate complex analyses subsets of data. To ensure timeliness, we manage the speed differential of the memory hierarchy, disperse loads and avoid I/O hotspots, preserve data locality during processing, and avoid disk and CPU contention.

A key feature we support is overlay of datasets. This involves fusion of datasets based on spatial and chronological attributes. An example of such an operation is to overlay topographical information such as roads or natural boundaries on observed phenomena – disease clusters, for instance.

Our data wrangling schemes reconciling data formats, performing spatiotemporal data alignments, addressing issues of scale via sketching, and indexing the data so that it is amenable for querying, visualization, and visualization. Notably, we can perform fusion of datasets based on spatial and chronological attributes. An example of such an operation is to overlay topographical information such as roads or natural boundaries on observed phenomena – disease clusters, for instance. We provide support for feature class datasets (including ESRI shapefile format), such as city block polygons, roads, power lines, and rivers. Overlays can also be used to contrast regions at different points in time. We have currently ingested, or are in the process of ingesting, several spatiotemporal datasets relating to urban systems.

These datasets are in different formats (CSV, Tabbed formats, GeoTIFF, JPEG, JSON, GeoJSON, GRIB, XML, ESRI geo-databases, ESRI shapefiles) from NGOs and state/federal agencies. Furthermore, each of the datasets includes a large number of features and can be available at different granularities.

Our data wrangling schemes our continually being enriched. We now include support for constructs as RDDs, DataFrames, and Datasets so that they can be processed in parallel with Spark clusters.