NoDB: Efficient Query Execution on Raw Data Files

1 minute read

Alagiannis, Ioannis, Renata Borovica, Miguel Branco, Stratos Idreos, and Anastasia Ailamaki. NoDB: Efficient Query Execution on Raw Data Files. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, 241—52. SIGMOD ™12. New York, NY, USA: ACM, 2012.

Summary

This article introduces NoDB, a database system which does not require data loading but operates on raw data files. The authors discuss two straight forward approaches to directly query raw data files:

  1. load the data once the first relevant query arrives
  2. integrate raw data access with the query execution (i.e. parse the data on-the-fly, if they are required according to the query plan).

Method

NoDB minimizes the cost of querying raw data by (a) applying on-the-fly parsing, (b) creating data structures that speed-up access to raw data files on demand, and (c) using caching techniques that eliminate the need for raw data access altogether.

On-the-fly parsing

  1. selective tokenizing - i.e. stop tokenizing, once the requested attributes have been found
  2. selective parsing - delay transformation of attributes until it is clear, that they are part of the result set
  3. selective tuple formatting - tuples are only created, if they are selected (i.e. a part of the result)

Indexing

NoDB uses an adaptive positional map to reduce parsing and tokenization cost.

  1. the map is populated on the fly and stores relative tuples positions in a table structure
  2. indexing commences in the order attributes and tuples have been requested
  3. the information is even exploited for locating close attributed (e.g. if the position of the 7th attribute is known, the map will be used to locate the 5th attribute, by performing backwards tokenization starting from the 7th attribute position, eliminating (a) the need to locate line breaks and (b) cutting the tokenization effort from four (1-5) to two attributes (7-5)).

Caching and query statistics

PostgresRaw caches binary data immediately and prioritizes data that is more difficult to convert (e.g. numerical attributes over ASCII data). The system also creates on-the-fly query statistics for optimizing access to the data.