The Pathologies of Big data

less than 1 minute read

by Jacobs, Adam (2009). The pathologies of big data, Communications of the ACM, ACM, pages 36-44, 52(8)

The article demonstrates the importance of a profound understanding of data structures and storage techniques for dealing with huge data sets. Jacobs draws the following conclusions:

its easier to get data in a DBMS than out
DBMS are optimized for retrieving small amounts of information
the largest cardinalities of most databases (=the number of distinct entities about which observations are made) are small compared to the total number of observations.
the relational data model ignores the ordering of row; one must be willing to consider this order to gain an acceptable performance (compare: CLUSTER command)
the previous argument has been justified by an experiment contrasting random disk/ssd/memory access speed to the sequential one. Random disk access turned out to be 150,000 times slower than sequential one, random SSD access about 15,000 times slower than sequential SSD speed and even random memory access is about 10 times slower than sequential one.
nowadays network accesses are comparable to disk accesses in regard to latency and speed (which encourages the use of distributed computing)

Share on

Twitter Facebook LinkedIn

Albert Weichselbraun

The Pathologies of Big data

Share on

You may also enjoy

Big, Linked Geospatial Data and Its Application in Earth Observation

Employment relations: a data driven analysis of job markets using online job boards and online professional networks

Suffix array

Dynamic feature scaling for online learning of binary classifiers