Thusoo, A. et al., 2010. Data warehousing and analytics infrastructure at facebook. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data. SIGMOD ™10. New York, NY, USA: ACM, pp. 1013—1020. Available at: http://doi.acm.org/10.1145/1807167.1807278.
This article describes some of the technologies and design decisions that help Facebook handle data loads of 60 TB of new data per day and a total data volume of 15 PB using the following technologies:
- Apache Hive, a hadoop-based data warehousing system facilitating easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems.
- Scribe for collecting and aggregating log file in real time.
- Microstrategy, a commercial business intelligence platform for dimensional analysis.
- Different data clusters for
- regular, in-production jobs that usually hold a month's worth of data
- ad-hock jobs, holding historical data for verifying models and hypothesis.