Data Warehousing and Analytics Infrastructure at Facebook
Thusoo, A. et al., 2010. Data warehousing and analytics infrastructure at facebook. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data. SIGMOD ™10. New York, NY, USA: ACM, pp. 1013—1020. Available at: http://doi.acm.org/10.1145/1807167.1807278.
This article describes some of the technologies and design decisions that help Facebook handle data loads of 60 TB of new data per day and a total data volume of 15 PB using the following technologies:
- Apache Hive, a hadoop-based data warehousing system facilitating easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems.
- Scribe for collecting and aggregating log file in real time.
- Microstrategy, a commercial business intelligence platform for dimensional analysis.
- Different data clusters for
- regular, in-production jobs that usually hold a month's worth of data
- ad-hock jobs, holding historical data for verifying models and hypothesis.