Big Data and Its Technical Challenges
Summary
- data acquisition
- information extraction and cleaning
- data integration, aggregation and representation, where the cost of full integration is often prohibitive and, therefore, techniques which provide an on-demand integration are very attractive (e.g. only analyze relevant tweets, do on-demand focused crawls to complement data, ...)
- modelling and analysis, which is often challenging due to the data's noisy, dynamic, heterogeneous, inter-related and untrustworthy nature.
- interpretation, which requires decision makers to make use of the data. The financial crisis underscored how assumptions influence the outcome of such analyzes. Therefore, big data tools must provide users with both the ability to (a) interpret the results, and (b) to perform analyzes under different assumptions and parameters to consider different scenarios and outcomes.
- Heterogeneity
- Inconsistency and incompleteness
- Scale (i.e., the amount of data)
- Timeliness (i.e., the ability to obtain relevant information before the data becomes irrelevant) - credit card fraud should ideally be detected before suspicious transaction have been completed.
- Privacy and data ownership
- The human perspective (visualization and collaboration)
Case Study
The paper also includes a case study of the Los Angeles Metropolitan Transportation Authority (LA-Metro) which collects transportation data from the LA Country road network. The data arrives at 46 MB/min and over 15 TB have been collected so far. The data is analyzed for traffic patterns and to obtain temporal models for road segments.