NoDB: Efficient Query Execution on Raw Data Files
Summary
This article introduces NoDB, a database system which does not require data loading but operates on raw data files. The authors discuss two straight forward approaches to directly query raw data files:- load the data once the first relevant query arrives
- integrate raw data access with the query execution (i.e. parse the data on-the-fly, if they are required according to the query plan).
Method
NoDB minimizes the cost of querying raw data by (a) applying on-the-fly parsing, (b) creating data structures that speed-up access to raw data files on demand, and (c) using caching techniques that eliminate the need for raw data access altogether.On-the-fly parsing
- selective tokenizing - i.e. stop tokenizing, once the requested attributes have been found
- selective parsing - delay transformation of attributes until it is clear, that they are part of the result set
- selective tuple formatting - tuples are only created, if they are selected (i.e. a part of the result)
Indexing
NoDB uses an adaptive positional map to reduce parsing and tokenization cost.- the map is populated on the fly and stores relative tuples positions in a table structure
- indexing commences in the order attributes and tuples have been requested
- the information is even exploited for locating close attributed (e.g. if the position of the 7th attribute is known, the map will be used to locate the 5th attribute, by performing backwards tokenization starting from the 7th attribute position, eliminating (a) the need to locate line breaks and (b) cutting the tokenization effort from four (1-5) to two attributes (7-5)).
Caching and query statistics
PostgresRaw caches binary data immediately and prioritizes data that is more difficult to convert (e.g. numerical attributes over ASCII data). The system also creates on-the-fly query statistics for optimizing access to the data.