course info, general introduction - Computer and Information Science
Download
Report
Transcript course info, general introduction - Computer and Information Science
Dejing Dou
Week 1, March 31, 2014 @ CIS 607 Seminar
1
About this Seminar
Introduction Lectures
Week 1: General Introduction on Big Data and Deep Learning
Week 2: Surveys
Paper Presentations and Discussions
Week 3 to Week 10: Research Paper Presentations and Discussions:
State of Art of Big data and Deep Learning approaches; Potential
overlapping research when big data meets deep learning.
Final week: Review and Discussions
2
Evaluation
Attendance: 20%
However, 2 Absences or 4 Lateness without excuse Fail
Paper Reading and Discussion: 30%
Summary and Question Preparations (homework)
Asking Questions to Paper Presenter or Instructor
Paper Presentation: 30%
35-40 Minutes Presentation
10-15 Minutes Question Answering
Final Report: 20%
Survey paper or a small implementation
3
Topics
Big Data
Real life data are so large and so complex that it is difficult
to process them using traditional database and data
mining tools. The challenges include capture, curation,
storage, search, sharing, transfer, integration, analysis, and
visualization
Deep Learning
It is part of machine learning methods which are based on
learning representations. Some representations make it
easier to learn tasks of interest from data examples.
When big data meets deep learning
Data structure, complexity, and representations.
4
Big Data EveryWhere!
Lots of data is being collected
and warehoused
Web data, e-commerce
purchases at department/
grocery stores
Bank/Credit Card
transactions
Social Network
How much data?
Google processes 20 PB a day (2008)
Wayback Machine has 3 PB + 100 TB/month
(3/2009)
Facebook has 2.5 PB of user data + 15 TB/day
(4/2009)
eBay has 6.5 PB of user data + 50 TB/day (5/2009)
CERN’s Large Hydron Collider (LHC) generates 15 PB
a year
640K ought to
be enough for
anybody.
Maximilien Brice, © CERN
Type of Data
Relational Data (Tables/Transaction/Legacy Data)
Text Data (Web)
Semi-structured Data (XML)
Graph Data
Social Network, Semantic Web (RDF), …
Streaming Data
You can only scan the data once
What to do with these data?
Aggregation and Statistics
Data warehouse and OLAP
Indexing, Searching, and Querying
Keyword based search
Pattern matching (XML/RDF)
Knowledge discovery
Data Mining
Statistical Modeling
OLAP and Data Mining
Warehouse ArchitectureClient
Client
Query & Analysis
Metadata
Warehouse
Integration
Source
Source
Source
11
Aggregates
Operators: sum, count, max, min,
median, ave
“Having” clause
Using dimension hierarchy
average by region (within store)
maximum by month (within date)
12
What is Data Mining?
Discovery of useful, possibly unexpected, patterns in
data
Non-trivial extraction of implicit, previously unknown
and potentially useful information from data
Exploration & analysis, by automatic or
semi-automatic means, of large quantities of data in
order to discover meaningful patterns
Data Mining Tasks
Classification [Predictive]
Clustering [Descriptive]
Association Rule Discovery [Descriptive]
Sequential Pattern Discovery [Descriptive]
Regression [Predictive]
Deviation Detection [Predictive]
Collaborative Filter [Predictive]
Association Rule Mining
sales
records:
tran1
tran2
tran3
tran4
tran5
tran6
cust33
cust45
cust12
cust40
cust12
cust12
p2,
p5,
p1,
p5,
p2,
p9
p5, p8
p8, p11
p9
p8, p11
p9
market-basket
data
• Trend: Products p5, p8 often bough together
• Trend: Customer 12 likes product p9
15
Association Rule Discovery
Marketing and Sales Promotion:
Let the rule discovered be
{Bagels, … } --> {Potato Chips}
Potato Chips as consequent => Can be used to determine
what should be done to boost its sales.
Bagels in the antecedent => can be used to see which
products would be affected if the store discontinues
selling bagels.
Bagels in antecedent and Potato chips in consequent =>
Can be used to see what products should be sold with
Bagels to promote sale of Potato chips!
Supermarket shelf management.
Inventory Management
Other Types of Mining
Text mining: application of data mining to textual
documents
cluster Web pages to find related pages
cluster pages a user has visited to organize their visit
history
classify Web pages automatically into a Web directory
Graph Mining:
Deal with graph data
17
Mining Time-Series and Sequence Data
Time-series database
Consists of sequences of values or events changing with time
Data is recorded at regular intervals
Characteristic time-series components
Trend, cycle, seasonal, irregular
Applications
Financial: stock price, inflation
Biomedical: blood pressure
Meteorological: precipitation
18
Mining Time-Series and Sequence Data
Time-series plot
19
Similar time series analysis
VanEck International Fund
Fidelity Selective Precious Metal and Mineral Fund
Two similar mutual funds in the different fund group
20
Spatial Cluster
Analysis
Mining clusters—k-means, k-
medoids, hierarchical, densitybased, etc.
Analysis of distinct features of the
clusters
21
Similarity Search in Multimedia Data
Multimedia Data: audio, image, video, sequence and
hypertext data. We focuses on image.
Description-based retrieval systems
Build indices and perform object retrieval based on image
descriptions, such as keywords, captions, size, and time of
creation
Labor-intensive if performed manually
Results are typically of poor quality if automated
Content-based retrieval systems
Support retrieval based on the image content, such as color
histogram, texture, shape, objects, and wavelet transforms
22
Mining Multimedia Databases in
23
Why Graph Mining?
Graphs are ubiquitous (everywhere)
Chemical compounds (Cheminformatics)
Protein structures, biological pathways/networks
(Bioinformactics)
Program control flow, traffic flow, and workflow analysis
XML databases, Web, and social network analysis
Graph is a general model
Trees, lattices, sequences, and items are degenerated graphs
Diversity of graphs
Directed vs. undirected, labeled vs. unlabeled (edges &
vertices), weighted, with angles & geometry (topological vs.
2-D/3-D)
24
from H. Jeong et al Nature 411, 41 (2001)
Graph, Graph, Everywhere
Aspirin
Internet
Yeast protein interaction network
Co-author network
25
Communication networks
The Earth is developing an electronic nervous system,
a network with diverse nodes and links are
-computers
-phone lines
-routers
-TV cables
-satellites
-EM waves
Communication
networks: Many
non-identical
components with
diverse
connections
between them.
26
Society
Nodes: individuals
Links: social relationship
(family/work/friendship/etc.)
Social networks: Many individuals with
diverse social interactions between them.
27
#1 Rod Steiger
#876
Kevin Bacon
Donald
#2
Pleasence
#3 Martin Sheen
28
Case 2: New York State Power Grid
Vertices: generators and substations
Edges: high-voltage power transmission lines and transformers
Line thickness and color indicate the voltage level
Red 765 kV, 500 kV; brown 345 kV; green 230 kV; grey 138 kV
29
World Wide Web
Nodes: WWW documents
Links: URL links
800 million documents
(S. Lawrence, 1999)
ROBOT:
collects all
URL’s found in a
document and follows
them recursively
R. Albert, H. Jeong, A-L Barabasi, Nature, 401 130 (1999)
30
Information on the Social Network
Heterogeneous, multi-relational data represented as a graph
or network
Nodes are objects
May have different kinds of objects
Objects have attributes
Objects may have labels or classes
Edges are links
May have different kinds of links
Links may have attributes
Links may be directed, are not required to be binary
Links represent relationships and interactions between
objects - rich content for mining
31
Deep Learning
Why Deep Learning
Deep Belief Networks
32
Overview
Train neural networks with many layers (vs. “shallow”
nets with just one or at most two layers)
Multiple layers work to build an improved feature space
First layer learns 1st order features (e.g. edges in images)
2nd layer learns higher order features (combinations of first
layer features, combinations of edges, corners, etc.)
Early layers usually learn in an unsupervised mode and
discover general features of the input space – serving multiple
tasks related to the unsupervised instances (image recognition,
etc.)
Then final layer features are fed into supervised layer(s)
And entire network is often subsequently tuned using
supervised training of the entire net, using the initial
weightings learned in the unsupervised phase
33
Deep Learning Tasks
Usually best when input space is locally structured – spatial
or temporal: images, language, etc. vs arbitrary input features
Images Example: early vision layer
34
Successful Applications
Object Recognition, e.g., in Image Processing
Speech Recognition and Signal Processing
Natural Language Processing
Muli-Task and Transfer Learning
35
36
Why Deep Learning
Biological Plausibility – e.g., Visual Cortex
Hastad proof - Problems which can be represented with a
polynomial number of nodes with k layers, may require
an exponential number of nodes with k-1 layers (e.g.,
parity)
Highly varying functions can be efficiently represented
with deep architectures
Less weights/parameters to update than a less efficient
shallow representation (tradeoff)
37
Early Work
Fukushima (1980) – Neo-Cognitron
LeCun (1998) – Convolutional Neural Networks
Similarities to Neo-Cognitron
Many layer Perceptrons with backpropagation
Tried early but without much success
Very slow
Diffusion of gradient
Very recent work has shown significant accuracy improvements
by "patiently" training deeper MLPs with BP using fast
machines (e.g., GPUs)
38
Convolutional Neural Networks
Convolution Example
Each layer combines (merges, smoothes) patches from
previous layers
Typically tries to compress large data (images) into a
smaller set of robust features
Basic convolution can still create many features
Pooling – Pooling Example
This step compresses and smoothes the data
Usually takes the average or max value across disjoint
patches
Often convolution filters and pooling are hand crafted –
not learned, though tuning can occur
After this unsupervised convolving the new set of
features are used to train a supervised model
39
Convolutional Neural Network Examples
C layers are
convolutions, S
layers
pool/sample
40
Training Deep Networks
Build a feature space
Note that this is what we do with trained hidden layers in
BP, but now we will build the feature space using deep
architectures
Unsupervised training between layers can decompose the
problem into distributed sub-problems (with higher levels
of abstraction) to be further decomposed at subsequent
layers
41
Greedy Layer-Wise Training
1. Train first layer using unlabeled data
Could do supervised or semi-supervised but usually just
use the larger amount of unlabeled data.
2. Then freeze the first layer parameters and start
training the second layer using the output of the
first layer as the unsupervised input to the second
layer
3. Repeat this for as many layers as desired
This builds our set of robust features
4. Use the outputs of the final layer as inputs to a
supervised layer/model and train the last
supervised layer(s) (leave early weights frozen) like
42
Deep Belief Networks (DBN)
Geoff Hinton (2006)
Uses Greedy layer-wise training but each layer is an RBM
(Restricted Boltzmann Machine)
RBM is a constrained
Boltzmann machine with
No lateral connections between
hidden (h) and visible (x) nodes
Symmetric weights
Does not use annealing/temperature, but that is all right
since each RBM not seeking a global minima, but rather
an incremental transformation of the feature space
Typically uses probabilistic logistic node, but other
activations possible
43
RBM
Sampling
and
Training
Initial state typically set to a training
example x (can be real valued)
Sampling is an iterative back and forth process
P(hi = 1|x) = sigmoid(Wix + ci) = 1/(1+e-net(hi)) // ci is hidden node
bias
P(xi = 1|h) = sigmoid(W'ih + bi) = 1/(1+e-net(xi)) // bi is visible
node bias
Contrastive Divergence (CD-k): How much contrast (in the statistical
distribution) is there in the divergence from the original training
example to the relaxed version after k relaxation steps
Then update weights based on difference just like in Boltzmann
Typically just do CD-1 (Good empirical results)
Since small learning rate, doing many of these is similar to doing fewer versions
of CD-k with k > 1
Note CD-1 just needs to get the gradient direction right, which it usually does,
44
and then change weights in that direction according to the learning rate
45
Deep Belief Network Training
Same Greedy Layer-wise approach
First train lowest RBM (h0 – h1) using
RBM update algorithm
Freeze weights and train subsequent
RBM layers
Then connect final outputs to a
supervised model and train that model
Finally, unfreeze all weights, and train
full DBN with supervised model to finetune weights
46
47
Conclusion
Much recent excitement, still much to be
discovered
Biological Plausibility
Potential for significant improvements
Good in structured/Markovian spaces
Important research question: To what extent can we
use Deep Learning in more arbitrary feature spaces?
Recent deep training of MLPs with BP shows
potential in this area
48