Gautam Das - CSE, IIT Bombay

Transcript Gautam Das - CSE, IIT Bombay

Data Analytics from
Hidden Databases:
The Good and the Bad
Gautam Das
University of Texas at Arlington
COMAD 2008
Mumbai, India
1
Hidden Databases on the WEB

The Web contains numerous Hidden
Databases




They are only accessible via proprietary form-based
interfaces
They return results based on user queries specified
via these forms
Often a limited (Top-k) number of results are returned
Examples



Google Base databases (e.g., vehicles)
Realtor.com
Airlines flight information websites
2
Hidden Databases on the WEB
Web Form Interface
Inventory
Database
Top-k Results
3
Data Analytics over Hidden
Databases:
Two Complimentary Problems
PROBLEM 1: How can external users leverage the search
interface to obtain aggregate/analytical information about
the hidden database?
MOTIVATION

The Good: Data present behind hidden forms needs to be extracted
and analyzed and can be used to power third-party applications, e.g.


Web mash-up applications, meta-search engines
The Bad: Adversaries can use this info to their advantage, e.g.,


Auto dealer websites can know about inventory about their
competitors
Terrorists can determine the likelihood that flights are relatively
empty on certain routes
4
Problem 2
Privacy Preserving Search: How can
hidden DB owners allow search access
to bona fide users, yet prevent leakage of
aggregate information?

This is the opposite of the traditional
privacy preserving studies –

Here we wish to reveal individual tuples, but wish to hide
aggregates
5
Challenges for Problem 1


One approach for data analytics is to draw an
random sample of the hidden DB and use
that for model building, aggregations,
analytics, etc
Random sampling from hidden DB is difficult
due to the limitations of the search interface


Top-k interface with proprietary ranking function
Querying limits imposed by data provider
6
Challenges for Problem 2


Because tuples have to returned to bona fide
search users, traditional privacy preservation
techniques such as data perturbation
cannot be used
Query auditing techniques also cannot be
used because databases with public
interfaces cannot track query history of
specific users
7
Our Results in this Area

Random sampling from hidden databases using random walks
over the query space [SIGMOD 2007]

Two phase sampling approach, where Sigmod 2007 technique is
first used for first obtaining small sample (hence learning small
model of data), and then this data model is used to guide the
second phase [ICDE 2009]

Dummy tuples insertion approach for privacy preserving search
[submitted for publication]

Demo video of the system at:
http://129.107.18.234/~anirban/final_demo/
8
Collaborators




Nan Zhang
Arjun Dasgupta
Anirban Maiti
Heikki Mannila
9
CONCLUSION

Introduced the problem of sampling from
hidden databases and the HIDDEN-DBSAMPLER algorithm

Theoretical analysis of performance and
quality tradeoffs

Extensive experimentation on varying
datasets
10
THANK YOU
11

Gautam Das - CSE, IIT Bombay

Transcript Gautam Das - CSE, IIT Bombay

Directory