COMP9313 Big Data Management week1a
Introduction
Lecturer:Yifang Sun
Considering some students might have time difference lecturer upload his pre-recorded lecture video to google drive and Baidu Netdisk…hmmmmm…although I am unsatisfied with the pre-recorded lecture but he uploaded a video to Baidu Netdisk means he do care about Chinese students.
The first concept of this course is Big Data. However, our Professor said there is no standard definition which is very uncommon in a course.
Major Characteristics of Big Data
7 V’s
1.Volume(Scale)(大量)
- Definition: Volume is how much data we have – what used to be measured in Gigabytes is now measured in Zettabytes (ZB) or even Yottabytes (YB). The IoT (Internet of Things) is creating exponential growth in data.
Source: The Digitization of the World From Edge to Core - Challenging:Time complexity
2. Variety(Diversity)(多样性)
1)Definition: Variety describes one of the biggest challenges of big data. It can be unstructured and it can include so many different types of data from XML to video to SMS. Organizing the data in a meaningful way is no simple task, especially when the data itself changes rapidly.In other words, different sources + different types.
eg1: I want to know the rating of a movie. I could get the data from IMBD and Rotte Tomatoes.
eg2: I want to collect my friends’ resumes, some friends send me the resumes as pdf version others send me their resumes as word version.
2) Challenging:
- Data integration
#- Heterogeneous(各种各样的)Traditional data integration relies on schema mapping. Like we learned in COMP9331/3331 we need to find Primary Key to indicate us how to link to entities.
#- Record Linkage in variety data (多种数据之间的联系)Different data from different sources might related to the same entities. How can we detect that? - Data curation
#- Data curation includes all the processes needed for principled and controlled data creation, maintenance, and management, together with the capacity to add value to data
3.Velocity(快速的数据流转和动态的数据体系)
- Definition: Velocity is the speed in which data is accessible. I remember the days of nightly batches, now if it’s not real-time it’s usually not fast enough.
- Challenging:
- Batch processing :process the data quickly
- Real time processing: Many types of data have a limited shelf-life where their value can erode with time—in some cases, very quickly
- Transmission