BIG DATA - Hong Kong University of Science and Technology

Download Report

Transcript BIG DATA - Hong Kong University of Science and Technology

COMP 4332 / RMBI 4330
Big Data Mining (Spring 2017)
Lei Chen
Hong Kong University of Science and Technology
[email protected]
http://www.cse.ust.hk/~leichen
Topics
• Review of Basics
• Practical Data Mining
– Imbalanced Data
– Data Extraction
– Data Integration
– Data Cleaning/Reparing
– Text and Web Mining
– Recommendation
• Hands on: Major Projects
• Student
Presentations
2017/3/21
Course Introduction
2
Outcome and Objective
• Student will know the current state of
the art in Data Mining
• Student will be able to implement a
practical data mining project
• Student will be able to present their
ideas well
• Prepared for PG study, Internship, etc.
2017/3/21
Course Introduction
3
Projects: based on KDDCUPs
• Data Protal
– Movie Portal
– The movie data portal can support people to search the basic
information of movies including movie name, cast, director, country,
language, duration, released date and types of movie. More
importantly, this portal will integrate the users’ comments (ranks)
and box office trends which it’s helpful for user to distinguish
whether the positive (or negative) comments come from real users
or paid posters. With abundant data, the system could even
classify whether a movie hire posters or not (e.g. the movie with
less box office but have a large amount of comments in total or a
large amount of comments submitted in a short period) and that is
meaningful when people view the movie’s ranks and comments
before they make decisions and go to watch movies
2017/3/21
Course Introduction
4
Important Sites
 Course Web Site
 http://www.cse.ust.hk/~leichen/comp4332
 TA:Xuelin Ling, [email protected] (RMBI)
Anand INASU CHITTILAPPILLY
[email protected] (CSE or others)
2017/3/21
Course Introduction
5
Prerequisites
 Statistics and Probability would help,
 But will be reviewed in class
 Machine Learning/Pattern Recognition would
help,
 We will review some most important algorithms
 One programming language
 We will teach new languages in the tutorial
2017/3/21
Course Introduction
6
Grading




Midterm Exam: 30%
Course Project: 30%
Presentation: 10%
Final Exam: 30%
2017/3/21
Course Introduction
7
Project
In this project, you are required to build a data portal
and conduct data mining tasks on your portal,
which integrates at least four data sources from the
Web. This is a group project, each group can have at
most 3 members.
Feb 13th, Project abstract due (one page), you should
state details about data portal usage, data sources and
team member
May 9th, Final Project Report due.
Project Presentation Date: April 19th and May 5th
2017/3/21
Course Introduction
8
Possible Project Topics
• Scholar and Publications
-Scholar, co-authors, paper title, published venue, place, date
• Movies and Actors
– Actors, directors, budget, oscar nominiations...
• Musicians and Bands
– First name, last name, birth date, birth place, bands, albums …
• Books and Authors
– Title, author(s), number of pages, language, publisher, translator, …
• Geographic Data
– Countries, regions, cities, population, area, leader, GDP, ...
• Doctors and Specialty
– First name, last name, gender, medical degree, specialty, address,
phone …
• Lawyers and Specialty
– First name, last name, gender, degree, specialty, address, phone …
Any other interesting usages proposed by your group are very welcome.
2017/3/21
Course Introduction
9
Phases your Project
Phase I: Data Collection (Feb 4th, 2017- Feb 20th , 2017)
Phase II: Entity Resolution (Feb 21st, 2017- March 6th, 2017)
Phase III: Data Fusion (March 7th, 2017-March 21st , 2017)
Phase IV: Ming on your Portal (March 22nd, 2017-April 13th, 2017)
2017/3/21
Course Introduction
10
More info
• Textbooks:
– Listed on Course Website
– Buy them online if you wish
2017/3/21
Course Introduction
11
Internet Pictures Clips Maps News Shop Email more
BIG DATA
Challenges & Opportunities
Search
Feeling Lucky
Lei Chen
12
Internet Pictures Clips Maps News Shop Email more
BIG
DATA
Outline
Background
Challenges
Opportunities
Outline
Background
“Big data” is term acknowledging the exponential growth,
availability and use of …
Challenges
“Big data” proposes ground challenges on data capture, storage,
analysis …
Opportunities
Many applications can be benefited from “Big data” …
13
Internet Pictures Clips Maps News Shop Email more
BIG
DATA
Outline
Background
Challenges
Opportunities
Background
We are capturing more data
Super exponential growth in data volume
Satellite imagery, mobile
station, distributed sensor
networks, geographical
plotting …
Copyright belongs to “Data Analysis Challenges”, JSR-08-142, Dec
14
Internet Pictures Clips Maps News Shop Email more
BIG
DATA
Outline
Background
Challenges
Opportunities
Background
We are using more data
Intelligent transportation
Digital health care
15
Internet Pictures Clips Maps News Shop Email more
BIG
DATA
Outline
Background
Challenges
Opportunities
Background
We need quick processing of the data
Volcano monitor
Hurricane moving path predication
16
Internet Pictures Clips Maps News Shop Email more
BIG
DATA
Outline
Background
Challenges
Opportunities
Background
We are exploring the unknowns with
different means of data measurements
Exploring the universe
Ocean science
17
Internet Pictures Clips Maps News Shop Email more
BIG
DATA
Outline
Background
Challenges
Opportunities
Background
We are discovering new rules from data
The well-formed.
eigenfactor project
visualizes information
flow in science.
This diagram shows
the citation links of the
journal Nature.
Copyright belongs to http://wellformed.eigenfactor.org
18
Internet Pictures Clips Maps News Shop Email more
BIG
DATA
Outline
Background
Challenges
Opportunities
Background
Defining Big Data
Wiki: Big data are datasets that grow so large that they become
awkward to work with using on-hand database
management tools. Difficulties include capture,
storage, search, sharing, analytics and visualizing.
Gartner(2011): Big data is a popular term used to acknowledge
the exponential growth, availability and use of
information in the data-rich landscape of tomorrow.
19
Internet Pictures Clips Maps News Shop Email more
BIG
DATA
Outline
Background
Challenges
Opportunities
Background
Features of Big Data
3V: Variety, Velocity and Volume
20
Internet Pictures Clips Maps News Shop Email more
BIG
DATA
Outline
Background
Challenges
Opportunities
Challenges
Applications
<key,vals>
Object
E-R
Hierarchical
Data Processing
(Processing lang,
optimization,
Visualization)
Data Model
(Interpretation,
representation)
Network Topology
Storage
(Reliability,
Scalability,
Availability)
Data Extraction
(Acquisition,
Integration,
Representation )
Internet Pictures Clips Maps News Shop Email more
BIG
DATA
Outline
Background
Challenges
. Data Model
. Storage
. Management
. Processing
Opportunities
Challenges
Data model challenges
Volume
Scale up, scale out, and scale in
Velocity
“Interactive” properties to facilitate processing
Variety
Simple but unified to adapt heterogeneity
Existing data models are not satisfactory
<key,vals>
Object
E-R
Hierarchical
Functionality
vs. Simplicity
22
Internet Pictures Clips Maps News Shop Email more
BIG
DATA
Outline
Background
Challenges
. Data Model
. Storage
. Management
. Processing
Opportunities
Challenges
Storage challenges
Storage concerns:
• Reliability: data is safe and trustable
• Availability: data is accessible
• Scalability: data operation performance does not decay
along with data size growth
However, the CAP theorem is the bottleneck. No
one-for-all solution exists
23
Internet Pictures Clips Maps News Shop Email more
BIG
DATA
Outline
Background
Challenges
. Data Model
. Storage
. Management
. Processing
Opportunities
Challenges
Storage challenges
CAP Theorem
•
•
•
Consistency
Availability
Partition tolerance
24
Internet Pictures Clips Maps News Shop Email more
BIG
DATA
Outline
Background
Challenges
. Data Model
. Storage
. Management
. Processing
Opportunities
Challenges
Storage challenges
ACID vs. BASE
RDBMS
NoSQL
Atomic
Consistent
Basically
Available
Isolated
Soft-state
Durable
Eventually
consistent
RDBMS
BigTable
HyperTable
HBase
MongoDB
Redis
Scalaris etc.
Dynamo
CouchDB
Cassandra
SimpleDB
Tokyo Cabinet
Riak
Voldemot etc.
C
P
A
25
Internet Pictures Clips Maps News Shop Email more
BIG
DATA
Outline
Background
Challenges
. Data Model
. Storage
. Management
. Processing
Opportunities
Challenges
Management challenges
“Solving 'Big Data' Challenge Involves More Than Just Managing
Volumes of Data”
Gartner(2011)
Big data management
Functionality
Flexibility
Indexing &
Partition
Adaption to new
requirement and
new component
26
Internet Pictures Clips Maps News Shop Email more
BIG
DATA
Outline
Background
Challenges
. Data Model
. Storage
. Management
. Processing
Opportunities
Challenges
Management challenges
E.g., Indexing over big data
Volume
Large volume of Requires Distributed
data captured
adaptive index
very time unit
Leads to
Significant cost
on meta data
exchange
Leads to
Ambiguity on
indexing the
same object
Variety
Data captured
from different
sources
Requires Distributed
adaptive index
27
Internet Pictures Clips Maps News Shop Email more
BIG
DATA
Outline
Background
Challenges
. Data Model
. Storage
. Management
. Processing
Opportunities
Challenges
Challenges on processing
• New query language (algebra)
Desired
Flexibility
Sacrifices & Overhead
Complexity in data modeling
“Relational” supporting
Poor scalability
“Uncertain” supporting
Poor scalability and significant
computing overhead
Scalability
Efficiency & Effectiveness
Less functionality
Poor scalability
28
Internet Pictures Clips Maps News Shop Email more
BIG
DATA
Outline
Background
Challenges
. Data Model
. Storage
. Management
. Processing
Opportunities
Challenges
Challenges on processing
• New computing paradigm for processing
Distributed
Computing Paradigm
Message Passing
Unified Access
MapReduce
Limitations
Poor scalability and fault tolerance
Invalidated efficiency over large
computing nodes
Poor functionality
29
Internet Pictures Clips Maps News Shop Email more
BIG
DATA
Outline
Background
Challenges
. Data Model
. Storage
. Management
. Processing
Opportunities
Challenges
Challenges on processing
• New optimization methodology
Load Balance
Data Locality
High Parallelism
Merging Cost
Less Network I/O
Replicated Computing
30
Internet Pictures Clips Maps News Shop Email more
BIG
Opportunities
• We are empowered to learn knowledge and process
DATA
information more accurately, effectively and efficiently.
Outline
Background
Challenges
. Data Model
. Storage
. Management
. Processing
Opportunities
Why “Big Data”?
Natural Science Study
Fundamental Scientific
Research
Big Data
Social Civilization
Daily Life
31
Internet Pictures Clips Maps News Shop Email more
BIG
DATA
Outline
Background
Challenges
. Data Model
. Storage
. Management
. Processing
Opportunities
Opportunities
Big Data for natural science study
• E.g., natural disaster forecasting and management
Flood
Forecasting
Earthquake
Meteorological data
Geographic data
Population, transportation,
urban design data
Economic data
Extreme Weather
Manage
ment
Internet Pictures Clips Maps News Shop Email more
BIG
DATA
Outline
Background
Challenges
. Data Model
. Storage
. Management
. Processing
Opportunities
Opportunities
Big Data for fundamental scientific
research
• E.g., Bio informatics and medicine
The mutual promotion relation between the gene technology
and the clinical medicine
33
Internet Pictures Clips Maps News Shop Email more
BIG
Opportunities
• Light-speed information spreading & enormous knowledge
DATA
Big Data for social civilization
line
kground
llenges
ata Model
torage
Management
rocessing
portunities
Quick events detection
Easy collaboration
Wandering where to get a real good cup of coffee ?
JUST tweet your question!!
Internet Pictures Clips Maps News Shop Email more
BIG
DATA
Outline
Background
Challenges
. Data Model
. Storage
. Management
. Processing
Opportunities
Opportunities
Big Data for daily life
• Our life can be much easier more data… E.g., trip planning
Travel to Beijing::Request
3-day stay
Budget< 1000$
Predefine
Forbidden City
Adaptive
agenda
10am Meeting every day
Real world incidents
Traffic jam
Updating
Luggage delay
Bad weather
35
Internet Pictures Clips Maps News Shop Email more
BIG
DATA
Outline
Background
Challenges
. Data Model
. Storage
. Management
. Processing
Opportunities
Opportunities
Opportunity highlights
• Volume
o Capture, store and analyze data help us better
understand the world
• Velocity
o Guaranteed effective & efficient data processing
• Variety
o Handling heterogeneous sources of data
Considering all the challenges and constraints, perhaps
there is no one-for-all solution
However, application dependent “Big Data” solutions are
promising
36
Internet Pictures Clips Maps News Shop Email more
BIG
DATA
Outline
Background
Challenges
. Data Model
. Storage
. Management
. Processing
Opportunities
. Applications
Opportunities
Applications
Heterogeneous data management
• Search doctors
• Search universities (undergoing)
Data
Integration
 Web pages on the Internet
Search
Doctors
 Hospital databases
 Search results from general-
purpose search engines
 News / rumors
Integrated
Database
Data Extraction
…
~500,000 doctors &
~30,000 hospitals
from 50+GB source
OLAP Query
Processing
37
Example of Data Portal
• Smart City Project
http://lccpu6.cse.ust.hk/index.php/smartcity/
38