Hazy: Making It Easier to Build and Maintain Big-Data Analytics

1 minute read

Kumar, Arun, Feng Niu, and Christopher Ré. Hazy: Making It Easier to Build and Maintain Big-data Analytics. Communications of the ACM 56, no. 3 (March 2013): 40—49. doi:10.1145/2428556.2428570.

This article introduced the Hazy project, an approach that identifies common patterns for implementing algorithms for big data analytics. Hazy distinguishes between two different kind of abstractions: (i) programming abstractions which provide means to decouple applications from the underlying algorithms, and (ii) infrastructure abstractions that provide an infrastructure for implementing algorithms.

Programming Abstractions

Hazy provides programming abstractions through a combination of the relational data model and a probabilistic rule-based language. The project uses the Markov logic language (Markov logical program or Markov logical network (MLN)) that consists of rules and the probability of results obtained from these rules.

Infrastructure Abstractions

Bismarck is a unified architecture for in-DBMS analytics aims at providing a DBMS-based infrastructure abstraction that decouples algorithms from implementation details.

  1. The key insight in Bismarck was, that Incremental Gradient Descent (IGD) - a method for solving convex programming problems such as logistic regression, support vector machines (SVM) and conditional random fields (CRF) - can be implemented using user-defined aggregates (UGAs).
  2. Bismarck exposes the following standard functions for implementing algorithms:
    • Initialize(state)
    • Transition(state, data), that is automatically computed on each tuple in the selected relation.
    • Finalize(state)

  3. Bismarck draws upon the infrastructure available in DBMS and therefore benefits from the maturity and scalability of such platforms. Parallel and distributed databases such as Greenplum are supported by adding support for an additional Merge(state,data) step to the framework.
  4. The following Bismarck sub-projects further improve the scalability and performance:
    • HogWild! parallelizes IGD and, therefore, benefits from multi-core machines
    • Jellyfish uses Latin square patterns to chunk the data matrix enabling Jellyfish to run the factorization in parallel on multiple cores.

Outlook

The authors mention the following research directions and challenges:

  1. Feature extraction - support the extraction of relevant features (or signals) for subsequent machine learning algorithms; this is particularly important because more signals beat sophisticated models.
  2. Assisted development - to lower the necessary deep understanding of data and algorithms for developing systems.
  3. Providing support for new data platforms, including the Hadoop ecosystem.