course info, general introduction - Computer and Information Science

Download Report

Transcript course info, general introduction - Computer and Information Science

Dejing Dou
Week 1, March 31, 2014 @ CIS 607 Seminar
1
About this Seminar
 Introduction Lectures
 Week 1: General Introduction on Big Data and Deep Learning
 Week 2: Surveys
 Paper Presentations and Discussions
 Week 3 to Week 10: Research Paper Presentations and Discussions:
State of Art of Big data and Deep Learning approaches; Potential
overlapping research when big data meets deep learning.
 Final week: Review and Discussions
2
Evaluation
 Attendance: 20%
 However, 2 Absences or 4 Lateness without excuse  Fail
 Paper Reading and Discussion: 30%
 Summary and Question Preparations (homework)
 Asking Questions to Paper Presenter or Instructor
 Paper Presentation: 30%
 35-40 Minutes Presentation
 10-15 Minutes Question Answering
 Final Report: 20%
 Survey paper or a small implementation
3
Topics
 Big Data
 Real life data are so large and so complex that it is difficult
to process them using traditional database and data
mining tools. The challenges include capture, curation,
storage, search, sharing, transfer, integration, analysis, and
visualization
 Deep Learning
 It is part of machine learning methods which are based on
learning representations. Some representations make it
easier to learn tasks of interest from data examples.
 When big data meets deep learning
 Data structure, complexity, and representations.
4
Big Data EveryWhere!
 Lots of data is being collected
and warehoused
 Web data, e-commerce
 purchases at department/
grocery stores
 Bank/Credit Card
transactions
 Social Network
How much data?
 Google processes 20 PB a day (2008)
 Wayback Machine has 3 PB + 100 TB/month
(3/2009)
 Facebook has 2.5 PB of user data + 15 TB/day
(4/2009)
 eBay has 6.5 PB of user data + 50 TB/day (5/2009)
 CERN’s Large Hydron Collider (LHC) generates 15 PB
a year
640K ought to
be enough for
anybody.
Maximilien Brice, © CERN
Type of Data
 Relational Data (Tables/Transaction/Legacy Data)
 Text Data (Web)
 Semi-structured Data (XML)
 Graph Data
 Social Network, Semantic Web (RDF), …
 Streaming Data
 You can only scan the data once
What to do with these data?
 Aggregation and Statistics
 Data warehouse and OLAP
 Indexing, Searching, and Querying
 Keyword based search
 Pattern matching (XML/RDF)
 Knowledge discovery
 Data Mining
 Statistical Modeling
OLAP and Data Mining
Warehouse ArchitectureClient
Client
Query & Analysis
Metadata
Warehouse
Integration
Source
Source
Source
11
Aggregates
 Operators: sum, count, max, min,
median, ave
 “Having” clause
 Using dimension hierarchy
 average by region (within store)
 maximum by month (within date)
12
What is Data Mining?
 Discovery of useful, possibly unexpected, patterns in
data
 Non-trivial extraction of implicit, previously unknown
and potentially useful information from data
 Exploration & analysis, by automatic or
semi-automatic means, of large quantities of data in
order to discover meaningful patterns
Data Mining Tasks
 Classification [Predictive]
 Clustering [Descriptive]
 Association Rule Discovery [Descriptive]
 Sequential Pattern Discovery [Descriptive]
 Regression [Predictive]
 Deviation Detection [Predictive]
 Collaborative Filter [Predictive]
Association Rule Mining
sales
records:
tran1
tran2
tran3
tran4
tran5
tran6
cust33
cust45
cust12
cust40
cust12
cust12
p2,
p5,
p1,
p5,
p2,
p9
p5, p8
p8, p11
p9
p8, p11
p9
market-basket
data
• Trend: Products p5, p8 often bough together
• Trend: Customer 12 likes product p9
15
Association Rule Discovery
 Marketing and Sales Promotion:
 Let the rule discovered be
{Bagels, … } --> {Potato Chips}
 Potato Chips as consequent => Can be used to determine
what should be done to boost its sales.
 Bagels in the antecedent => can be used to see which
products would be affected if the store discontinues
selling bagels.
 Bagels in antecedent and Potato chips in consequent =>
Can be used to see what products should be sold with
Bagels to promote sale of Potato chips!
 Supermarket shelf management.
 Inventory Management
Other Types of Mining
 Text mining: application of data mining to textual
documents
 cluster Web pages to find related pages
 cluster pages a user has visited to organize their visit
history
 classify Web pages automatically into a Web directory
 Graph Mining:
 Deal with graph data
17
Mining Time-Series and Sequence Data
 Time-series database
 Consists of sequences of values or events changing with time
 Data is recorded at regular intervals
 Characteristic time-series components

Trend, cycle, seasonal, irregular
 Applications
 Financial: stock price, inflation
 Biomedical: blood pressure
 Meteorological: precipitation
18
Mining Time-Series and Sequence Data
Time-series plot
19
Similar time series analysis
VanEck International Fund
Fidelity Selective Precious Metal and Mineral Fund
Two similar mutual funds in the different fund group
20
Spatial Cluster
Analysis
 Mining clusters—k-means, k-
medoids, hierarchical, densitybased, etc.
 Analysis of distinct features of the
clusters
21
Similarity Search in Multimedia Data
 Multimedia Data: audio, image, video, sequence and
hypertext data. We focuses on image.
 Description-based retrieval systems
 Build indices and perform object retrieval based on image
descriptions, such as keywords, captions, size, and time of
creation
 Labor-intensive if performed manually
 Results are typically of poor quality if automated
 Content-based retrieval systems
 Support retrieval based on the image content, such as color
histogram, texture, shape, objects, and wavelet transforms
22
Mining Multimedia Databases in
23
Why Graph Mining?
 Graphs are ubiquitous (everywhere)
 Chemical compounds (Cheminformatics)
 Protein structures, biological pathways/networks
(Bioinformactics)
 Program control flow, traffic flow, and workflow analysis
 XML databases, Web, and social network analysis
 Graph is a general model
 Trees, lattices, sequences, and items are degenerated graphs
 Diversity of graphs
 Directed vs. undirected, labeled vs. unlabeled (edges &
vertices), weighted, with angles & geometry (topological vs.
2-D/3-D)
24
from H. Jeong et al Nature 411, 41 (2001)
Graph, Graph, Everywhere
Aspirin
Internet
Yeast protein interaction network
Co-author network
25
Communication networks
The Earth is developing an electronic nervous system,
a network with diverse nodes and links are
-computers
-phone lines
-routers
-TV cables
-satellites
-EM waves
Communication
networks: Many
non-identical
components with
diverse
connections
between them.
26
Society
Nodes: individuals
Links: social relationship
(family/work/friendship/etc.)
Social networks: Many individuals with
diverse social interactions between them.
27
#1 Rod Steiger
#876
Kevin Bacon
Donald
#2
Pleasence
#3 Martin Sheen
28
Case 2: New York State Power Grid
 Vertices: generators and substations
 Edges: high-voltage power transmission lines and transformers
 Line thickness and color indicate the voltage level
 Red 765 kV, 500 kV; brown 345 kV; green 230 kV; grey 138 kV
29
World Wide Web
Nodes: WWW documents
Links: URL links
800 million documents
(S. Lawrence, 1999)
ROBOT:
collects all
URL’s found in a
document and follows
them recursively
R. Albert, H. Jeong, A-L Barabasi, Nature, 401 130 (1999)
30
Information on the Social Network
 Heterogeneous, multi-relational data represented as a graph
or network
 Nodes are objects
May have different kinds of objects
 Objects have attributes
 Objects may have labels or classes
 Edges are links
 May have different kinds of links
 Links may have attributes
 Links may be directed, are not required to be binary

 Links represent relationships and interactions between
objects - rich content for mining
31
Deep Learning
 Why Deep Learning
 Deep Belief Networks
32
Overview
 Train neural networks with many layers (vs. “shallow”
nets with just one or at most two layers)
 Multiple layers work to build an improved feature space
 First layer learns 1st order features (e.g. edges in images)
 2nd layer learns higher order features (combinations of first
layer features, combinations of edges, corners, etc.)
 Early layers usually learn in an unsupervised mode and
discover general features of the input space – serving multiple
tasks related to the unsupervised instances (image recognition,
etc.)
 Then final layer features are fed into supervised layer(s)
 And entire network is often subsequently tuned using
supervised training of the entire net, using the initial
weightings learned in the unsupervised phase
33
Deep Learning Tasks
 Usually best when input space is locally structured – spatial
or temporal: images, language, etc. vs arbitrary input features
 Images Example: early vision layer
34
Successful Applications
 Object Recognition, e.g., in Image Processing
 Speech Recognition and Signal Processing
 Natural Language Processing
 Muli-Task and Transfer Learning
35
36
Why Deep Learning
 Biological Plausibility – e.g., Visual Cortex
 Hastad proof - Problems which can be represented with a
polynomial number of nodes with k layers, may require
an exponential number of nodes with k-1 layers (e.g.,
parity)
 Highly varying functions can be efficiently represented
with deep architectures
 Less weights/parameters to update than a less efficient
shallow representation (tradeoff)
37
Early Work
 Fukushima (1980) – Neo-Cognitron
 LeCun (1998) – Convolutional Neural Networks
 Similarities to Neo-Cognitron
 Many layer Perceptrons with backpropagation
 Tried early but without much success


Very slow
Diffusion of gradient
 Very recent work has shown significant accuracy improvements
by "patiently" training deeper MLPs with BP using fast
machines (e.g., GPUs)
38
Convolutional Neural Networks
 Convolution Example
 Each layer combines (merges, smoothes) patches from
previous layers
 Typically tries to compress large data (images) into a
smaller set of robust features
 Basic convolution can still create many features
 Pooling – Pooling Example
 This step compresses and smoothes the data
 Usually takes the average or max value across disjoint
patches
 Often convolution filters and pooling are hand crafted –
not learned, though tuning can occur
 After this unsupervised convolving the new set of
features are used to train a supervised model
39
Convolutional Neural Network Examples
C layers are
convolutions, S
layers
pool/sample
40
Training Deep Networks
 Build a feature space
 Note that this is what we do with trained hidden layers in
BP, but now we will build the feature space using deep
architectures
 Unsupervised training between layers can decompose the
problem into distributed sub-problems (with higher levels
of abstraction) to be further decomposed at subsequent
layers
41
Greedy Layer-Wise Training
1. Train first layer using unlabeled data

Could do supervised or semi-supervised but usually just
use the larger amount of unlabeled data.
2. Then freeze the first layer parameters and start
training the second layer using the output of the
first layer as the unsupervised input to the second
layer
3. Repeat this for as many layers as desired

This builds our set of robust features
4. Use the outputs of the final layer as inputs to a
supervised layer/model and train the last
supervised layer(s) (leave early weights frozen) like
42
Deep Belief Networks (DBN)
 Geoff Hinton (2006)
 Uses Greedy layer-wise training but each layer is an RBM
(Restricted Boltzmann Machine)
 RBM is a constrained
Boltzmann machine with
 No lateral connections between
hidden (h) and visible (x) nodes
 Symmetric weights
 Does not use annealing/temperature, but that is all right
since each RBM not seeking a global minima, but rather
an incremental transformation of the feature space
 Typically uses probabilistic logistic node, but other
activations possible
43
RBM
Sampling
and
Training
 Initial state typically set to a training




example x (can be real valued)
Sampling is an iterative back and forth process
 P(hi = 1|x) = sigmoid(Wix + ci) = 1/(1+e-net(hi)) // ci is hidden node
bias
 P(xi = 1|h) = sigmoid(W'ih + bi) = 1/(1+e-net(xi)) // bi is visible
node bias
Contrastive Divergence (CD-k): How much contrast (in the statistical
distribution) is there in the divergence from the original training
example to the relaxed version after k relaxation steps
Then update weights based on difference just like in Boltzmann
Typically just do CD-1 (Good empirical results)
 Since small learning rate, doing many of these is similar to doing fewer versions
of CD-k with k > 1
 Note CD-1 just needs to get the gradient direction right, which it usually does,
44
and then change weights in that direction according to the learning rate
45
Deep Belief Network Training
 Same Greedy Layer-wise approach
 First train lowest RBM (h0 – h1) using
RBM update algorithm
 Freeze weights and train subsequent
RBM layers
 Then connect final outputs to a
supervised model and train that model
 Finally, unfreeze all weights, and train
full DBN with supervised model to finetune weights
46
47
Conclusion
 Much recent excitement, still much to be
discovered
 Biological Plausibility
 Potential for significant improvements
 Good in structured/Markovian spaces
 Important research question: To what extent can we
use Deep Learning in more arbitrary feature spaces?
 Recent deep training of MLPs with BP shows
potential in this area
48