Data Warehousing and Analytics Infrastructure at Facebook

less than 1 minute read

Thusoo, A. et al., 2010. Data warehousing and analytics infrastructure at facebook. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data. SIGMOD ™10. New York, NY, USA: ACM, pp. 1013—1020. Available at: http://doi.acm.org/10.1145/1807167.1807278.

This article describes some of the technologies and design decisions that help Facebook handle data loads of 60 TB of new data per day and a total data volume of 15 PB using the following technologies:

  1. Apache Hive, a hadoop-based data warehousing system facilitating easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems.
  2. Scribe for collecting and aggregating log file in real time.
  3. Microstrategy, a commercial business intelligence platform for dimensional analysis.
  4. Different data clusters for
    • regular, in-production jobs that usually hold a month's worth of data
    • ad-hock jobs, holding historical data for verifying models and hypothesis.