Transcript document

Welcome to RPI CS!
Petros Drineas (Assistant Professor)
Theory Group
Professors: Mark Goldberg
Associate Professors: Daniel Freedman, Mukkai Krishnamoorthy, Malik MagdonIsmail, Bulent Yener.
Assistant Professors: Elliot Anshelevich, Sanmay Das.
Overview
•
(My) Research Agenda
• Massive Datasets
• A glimpse of my work
• Future directions
•
(Theory Group?) Teaching Agenda
Shameless advertising
We offer all the courses that you need in order to get a
solid background in TCS.
We use really cool
and sometimes hard
math in our courses!
2
Massive Datasets appear everywhere
Just a few examples:



The World Wide Web. Too much junk, but Google usually saves the
day!
Document Databases. IBM estimated that its databases contained
more than 3¢ 108 documents.
AT&T call-detail records. These include calling number, called
number, time of day, length of call for 260 Million calls/day.
3
Queries on Massive Datasets
What kind of queries do we want to perform on Massive Datasets?




World Wide Web: search for Jaguar, but separate the web pages
that refer to the animal from the ones that refer to the car.
Document Databases: find documents that are similar to a given
document.
AT&T call-detail records: distinguish business accounts from
residential accounts.
MANY, MANY, MANY MORE EXAMPLES/QUERIES! Definitely
more research is necessary.
A quote of Christos Papadimitriou:
“CS might not have produced an implant chip that increases our intelligence, but we
produced Google, which increases our IQ just as efficiently.”
4
Why is it difficult?
Imagine sorting 1 trillion numbers, stored in secondary memory (no random access
there!). Thus, no QuickSort, MergeSort, etc. How do you do it fast?
Not surprisingly, it is not easy! The Massive Dataset paradigm motivates looking
for new algorithms, new complexity classes, new lower bounds, new proof
techniques, etc.
5 things to remember about Massive Datasets:





They are not readily accessible.
They are stored in secondary memory and we only have sequential access over
the data.
We do not want to read the data too many times! Equivalently, we want to make
a small number of passes over the data.
Sometimes, we are only allowed to make one pass over the data (streaming
model).
Even worse, in certain applications we are not even allowed to read the full data
once (sublinear algorithms).
5
My work
In many applications data appear as large matrices
•
We can make a few “passes” through the matrices.
•
We can create and store a small “sketch” of the matrices in RAM.
•
Computing the “sketch” should be a very fast process.
•
Finally, discard the original matrix and work with the “sketch”.
1.
A “sketch” consisting of a few rows/columns of the matrix is adequate for
efficient approximations.
2.
We draw the rows/columns randomly, using adaptive sampling; e.g. we pick
rows/columns with larger elements with higher probability.
6
Applications: Data Mining
We are given m (>106) documents and n (>105) terms describing the
documents.
Database
An m-by-n matrix A (Aij shows the frequency of term j in document i).
Every row of A represents a document.
Queries
Given a new document x, find similar documents in the database (nearest
neighbors).
7
Two documents are “close” if the angle between
their corresponding vectors is small. So,
xT·d = cos(x,d)
term 2
Applications (cont’d)
is high when the two objects are close.
(We assume that the vectors are normalized.)
document d
(d,x)
document x
A·x computes all the angles and answers the query.
term 1
Key observation: The exact value xT· d might not be necessary.
1.
The feature values in the vectors are set by coarse heuristics.
2.
It is in general enough to see if xT· d > Threshold.
8
Approximating Matrix Multiplication
Goal
Proximity problem: identify all document-document matches in the
database.
Given an m-by-n matrix A, approximate the product A·AT.
Idea
1.
Pick s columns of A to form an m-by-s matrix S.
2.
(discard A) Approximate A · AT by S · ST.
9
The algorithm
•
It only requires storage of a few columns of A.
•
Faster than full matrix multiplication.
•
We can prove error bounds for this algorithm!
•
We can use similar ideas to tackle various other matrix operations.
10
Future Directions
• Extract structure from data in the form of tensors (multi-dimensional
arrays) instead of matrices (two-dimensional arrays).
• Think of the evolution of the email graph (vertices are email users and
edges denote the exchange of an email message) over time. This can be
modelled as a three-dimensional array. Can we find communities of users in
this time-evolving graph?
• Extract non-linear structure from data.
• A subtle point is that most of the existing techniques only extract linear
structure from the data. Can we devise efficient algorithms to extract nonlinear structure from massive datasets?
11
Teaching Agenda
Graduate Courses offered by Theory Group members in the last few years …
•
Topics in Computational Geometry (Freedman)
•
Graph Theory (Goldberg)
•
Computability and Complexity (Goldberg)
•
Machine and Computational Learning (Magdon-Ismail)
•
Introduction to Computational Finance (Magdon-Ismail)
•
Random Graphs and The WEB-graph (Goldberg, Moorthy, Yener)
•
Network Security (Yener)
•
Network Flows and Linear Programming (Drineas, Yener)
•
Randomized Algorithms (Drineas)
•
Machine Learning (Das)
•
Advanced Algorithm Design (Anshelevich)
12
Enjoy your (theory) classes!
13