NoDB: Efficient Query Execution on Raw Data Files

1 minute read

Alagiannis, Ioannis, Renata Borovica, Miguel Branco, Stratos Idreos, and Anastasia Ailamaki. NoDB: Efficient Query Execution on Raw Data Files. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, 241—52. SIGMOD ™12. New York, NY, USA: ACM, 2012.

Summary

This article introduces NoDB, a database system which does not require data loading but operates on raw data files. The authors discuss two straight forward approaches to directly query raw data files:

load the data once the first relevant query arrives
integrate raw data access with the query execution (i.e. parse the data on-the-fly, if they are required according to the query plan).

Method

NoDB minimizes the cost of querying raw data by (a) applying on-the-fly parsing, (b) creating data structures that speed-up access to raw data files on demand, and (c) using caching techniques that eliminate the need for raw data access altogether.

On-the-fly parsing

selective tokenizing - i.e. stop tokenizing, once the requested attributes have been found
selective parsing - delay transformation of attributes until it is clear, that they are part of the result set
selective tuple formatting - tuples are only created, if they are selected (i.e. a part of the result)

Indexing

NoDB uses an adaptive positional map to reduce parsing and tokenization cost.

the map is populated on the fly and stores relative tuples positions in a table structure
indexing commences in the order attributes and tuples have been requested
the information is even exploited for locating close attributed (e.g. if the position of the 7th attribute is known, the map will be used to locate the 5th attribute, by performing backwards tokenization starting from the 7th attribute position, eliminating (a) the need to locate line breaks and (b) cutting the tokenization effort from four (1-5) to two attributes (7-5)).

Caching and query statistics

PostgresRaw caches binary data immediately and prioritizes data that is more difficult to convert (e.g. numerical attributes over ASCII data). The system also creates on-the-fly query statistics for optimizing access to the data.

Share on

Twitter Facebook LinkedIn

Albert Weichselbraun

NoDB: Efficient Query Execution on Raw Data Files

Summary

Method

On-the-fly parsing

Indexing

Caching and query statistics

Share on

You may also enjoy

Big, Linked Geospatial Data and Its Application in Earth Observation

Employment relations: a data driven analysis of job markets using online job boards and online professional networks

Suffix array

Dynamic feature scaling for online learning of binary classifiers