Data sketching

Albert Weichselbraun

is a Professor of Information Science at the University of Applied of the Grisons.

Data sketching

less than 1 minute read

This article introduces three popular data structures that efficiently handle and summarize large data sets.

Bloom filters are basically sets which answer the question of whether an item is part of the set (i.e. has been seen by the filter) with either (i) “the item is definitely not a part of the set”, or (ii) “the item might be part of the set.
The Count-Min Sketch method is a probabilistic method for counting the number of times items of a certain type have been observed. When queried the structure returns an estimation which is considered an upper bound for the corresponding count.
The HyperLogLog method counts the number of different items seen in a large set of individuals without keeping count of every single individual.

Share on

Twitter Facebook LinkedIn

You may also enjoy

Big, Linked Geospatial Data and Its Application in Earth Observation

less than 1 minute read

Integrating earth observation data with linked open data would pave the way for easy reuse and integration of these datasets. The article discusses how knowl...

Employment relations: a data driven analysis of job markets using online job boards and online professional networks

less than 1 minute read

Career websites contain valuable data on employees, their skill sets and, employment history. This article uses k-means clustering on keywords describing ski...

Suffix array

1 minute read

The suffix array is a memory-efficient alternative to the suffix tree which provides a sorted list of string indices indicating the string’s suffixes.

Dynamic feature scaling for online learning of binary classifiers

less than 1 minute read

This article describes and evaluates different online feature scaling approaches and their impact on the performance of binary classifiers. online feature...