The Pathologies of Big data
by Jacobs, Adam (2009). The pathologies of big data, Communications of the ACM, ACM, pages 36-44, 52(8)
The article demonstrates the importance of a profound understanding of data structures and storage techniques for dealing with huge data sets. Jacobs draws the following conclusions:
- its easier to get data in a DBMS than out
- DBMS are optimized for retrieving small amounts of information
- the largest cardinalities of most databases (=the number of distinct entities about which observations are made) are small compared to the total number of observations.
- the relational data model ignores the ordering of row; one must be willing to consider this order to gain an acceptable performance (compare: CLUSTER command)
- the previous argument has been justified by an experiment contrasting random disk/ssd/memory access speed to the sequential one. Random disk access turned out to be 150,000 times slower than sequential one, random SSD access about 15,000 times slower than sequential SSD speed and even random memory access is about 10 times slower than sequential one.
- nowadays network accesses are comparable to disk accesses in regard to latency and speed (which encourages the use of distributed computing)