Large-scale Incremental Processing Using Distributed Transactions and Notifications

1 minute read

by Peng and Dabek

This paper introduces Percolator and the corresponding processing pipeline called Caffeine, which are systems for incrementally processing updates to large data sets.

Motivation

Google maintains a Web index containing sets of invariants on the index

content available under multiple URLs is assigned to the URL with the highest page rank
anchor text from incoming links is attached to the page
MapReduce limits the parallelism of computations: all documents finish one processing step before entering the next one, which makes small incremental updates of indexes inefficient
Databases use transactions to maintain invariants but do not provide the scalability required

Method

Percolator works on top of share-nothing parallel databases by introducing an additional layer to the BigTable database that (i) achieves high throughput due to the use of a massive amount of threads for performing update, and (ii) provides ACID compliant transactions.
The system attaches observers (triggers) to database columns that are invoked once a column change. These observers process the changed data, update columns and may trigger other observers. Cascades of such small update operations incrementally update the index to reflect the changes introduced by the new documents.
Message collapsing ensures that multiple changes concerning one column are pooled and, therefore, only trigger one observer.
Multiple versions of data items are stored using BigTable's timestamp dimension. This approach is also used for implementing a locking mechanism.

Costs and benefits

Percolator reduced the average document processing latency by a factor of 100 at the cost of the number of database calls required per document (MapReduce: 1 call for 100s of Web pages versus 50 calls for one page using Percolator) and a higher resource consumption (about twice as many as MapReduce).
Only applicable where computations can be broken down into small updates (otherwise MapReduce is more efficient)
Percolator is essentially immune to stragglers which are a serious problem in batch processing.

Misc

Implementation of databases: "get close to the iron" - use hardware as directly as possible since operation system structures like disk caches and schedulers make it hard to implement efficient databases.

Share on

Twitter Facebook LinkedIn

Albert Weichselbraun

Large-scale Incremental Processing Using Distributed Transactions and Notifications

Motivation

Method

Costs and benefits

Misc

Share on

You may also enjoy

Big, Linked Geospatial Data and Its Application in Earth Observation

Employment relations: a data driven analysis of job markets using online job boards and online professional networks

Suffix array

Dynamic feature scaling for online learning of binary classifiers