Kumar, Arun, Feng Niu, and Christopher RÃ©. Hazy: Making It Easier to Build and Maintain Big-data Analytics. Communications of the ACM 56, no. 3 (March 2013): 40—49. doi:10.1145/2428556.2428570.This article introduced the Hazy project, an approach that identifies common patterns for implementing algorithms for big data analytics. Hazy distinguishes between two different kind of abstractions: (i) programming abstractions which provide means to decouple applications from the underlying algorithms, and (ii) infrastructure abstractions that provide an infrastructure for implementing algorithms.
Programming AbstractionsHazy provides programming abstractions through a combination of the relational data model and a probabilistic rule-based language. The project uses the Markov logic language (Markov logical program or Markov logical network (MLN)) that consists of rules and the probability of results obtained from these rules.
Infrastructure AbstractionsBismarck is a unified architecture for in-DBMS analytics aims at providing a DBMS-based infrastructure abstraction that decouples algorithms from implementation details.
- The key insight in Bismarck was, that Incremental Gradient Descent (IGD) - a method for solving convex programming problems such as logistic regression, support vector machines (SVM) and conditional random fields (CRF) - can be implemented using user-defined aggregates (UGAs).
- Bismarck exposes the following standard functions for implementing algorithms:
- Transition(state, data), that is automatically computed on each tuple in the selected relation.
- HogWild! parallelizes IGD and, therefore, benefits from multi-core machines
- Jellyfish uses Latin square patterns to chunk the data matrix enabling Jellyfish to run the factorization in parallel on multiple cores.
OutlookThe authors mention the following research directions and challenges:
- Feature extraction - support the extraction of relevant features (or signals) for subsequent machine learning algorithms; this is particularly important because more signals beat sophisticated models.
- Assisted development - to lower the necessary deep understanding of data and algorithms for developing systems.
- Providing support for new data platforms, including the Hadoop ecosystem.