by Peng and Dabek
This paper introduces Percolator and the corresponding processing pipeline called Caffeine, which are systems for incrementally processing updates to large data sets.
MotivationGoogle maintains a Web index containing sets of invariants on the index
- content available under multiple URLs is assigned to the URL with the highest page rank
- anchor text from incoming links is attached to the page
- MapReduce limits the parallelism of computations: all documents finish one processing step before entering the next one, which makes small incremental updates of indexes inefficient
- Databases use transactions to maintain invariants but do not provide the scalability required
- Percolator works on top of share-nothing parallel databases by introducing an additional layer to the BigTable database that (i) achieves high throughput due to the use of a massive amount of threads for performing update, and (ii) provides ACID compliant transactions.
- The system attaches observers (triggers) to database columns that are invoked once a column change. These observers process the changed data, update columns and may trigger other observers. Cascades of such small update operations incrementally update the index to reflect the changes introduced by the new documents.
- Message collapsing ensures that multiple changes concerning one column are pooled and, therefore, only trigger one observer.
- Multiple versions of data items are stored using BigTable's timestamp dimension. This approach is also used for implementing a locking mechanism.
Costs and benefits
- Percolator reduced the average document processing latency by a factor of 100 at the cost of the number of database calls required per document (MapReduce: 1 call for 100s of Web pages versus 50 calls for one page using Percolator) and a higher resource consumption (about twice as many as MapReduce).
- Only applicable where computations can be broken down into small updates (otherwise MapReduce is more efficient)
- Percolator is essentially immune to stragglers which are a serious problem in batch processing.
- Implementation of databases: "get close to the iron" - use hardware as directly as possible since operation system structures like disk caches and schedulers make it hard to implement efficient databases.