Database Technologies for E-Commerce

Download Report

Transcript Database Technologies for E-Commerce

Database Technologies for
E-Commerce
Rakesh Agrawal
IBM Almaden Research Center
Outline
 Overview
of the Pangea B2B
marketplace
 Technology dives:

Catalog integration
 Searching specification documents
 Interactive parametric search
 Storage & retrieval of e-commerce data
 Research
opportunities
Pangea Architecture
Product
Selection
Payment
and
Delivery
Brand
and
Loyalty
Creation
Preference System
User provided/System Mined
Data
Acquisition
User
Profiling
and
Market
Research
Wrappers
Cleaning
Data Transfer
Registration
Accounts
Web Infrastructure
Websphere/DB2
Product Selection
Simple Products
Search
-Parametric
-Description
-Category
-Substitute
products
Browse -Category tree
-Neighborhood
Composite
Products
Product lists
with aggregate
constraints
(e.g. BOM)
Configuration
products with
compatibility
constraints
Comparisons
Bookmarking
(shopping basket)
Price
Simple
Attribute value
comparison
Detailed
description
(e.g.data sheets)
Reviews
Experiences
Preference System
Specification/Specializaion/Customization
Multicriteria
Auctions
Brand and Loyalty Creation
Personal
Database
Resource
Center
Forums
Tools for
- extracting
- organizing
- maintaining
information
Career Center
Resume Bank
Job Bank
AskExpert
Micro
communities
Technology
Products
Chats
Events
News Center
Matchmaker
- Query
- Browse
Headhunters
Salary Surveys
Job Trends
Preference System
Specification/Specializaion/Customization
Businesses
Current
Archive
Features (1)

Eureka:

Interactive parametric search
Searches based on example products
Similarity search for product substitution




Dynamic Content generation/presentation (on
the fly from database):
Catalog Hierarchy pages with product counts
 Side-by-Side Product Comparison
 Product Listings/Details
 Shopper groups through preference layer
Features (2)

Server side Searching:

Functional description search using classification
Key word & Part Number Search
Restriction of the search to sub-trees of the category
hierarchy (pushed down to the database)



Real time Price and Availability Engine:

Ability to crawl, interrogate, extract price & availability
data from various suppliers/distributors in real time
and present them in side-by-side comparison format
Easily add new distributors/suppliers
Completely client side implementation to prevent
blocking


Features (3)

Design Worksheet (Generalized shopping
cart):

List of items: an item could be a specific part,
alternative set of parts, specifications, other design
worksheets (nesting)
Completely integrated with all relevant components
(search, content, price, etc.)
Aggregate constraints (e.g. total price)
Multiple projects saved on server
Share projects with other (authorized) users
Mining for suggestions





Features (4)

Design data warehouse creation and
maintenance

Crawling technology for extracting part information
from websites of various
manufacturer/distributor/suppliers
Data extraction from Manufacturer Spec Sheets (pdf
files)
Classification technology to build, merge, manage
and populate category hierarchies.


Features (5)


Soft content creation and maintenance
Crawling to acquire articles/news/postings from
various web news sources, usenet newsgroups, etc.
 Agents to cluster, classify, extract concepts, identify
associations.
 Personalized news/data presentation (based on user
defined profile, channels, etc.)
 Complete interleaving and integration with part
content.
 Automatic generation of summary pages for
manufacturers (e.g. related news articles, URLs ,
product categories).
Prototype
 Data
for nearly 2 million parts, in 2000
categories
 Focused crawling of more than 175
manufacturers for datasheets/
application notes/manuals (160,000)
 1/4 Terabyte database
Catalog Integration
R. Agrawal, R. Srikant:
WWW-10
Catalog Integration Problem

Integrate products from new catalog into
master catalog.
ICs
ICs
DSP
a
b
Mem.
c
Cat 1
Logic
d
e
Master Catalog
f
x
y
Cat 2
z
New Catalog
The Problem (cont.)

After integration:
ICs
a
b
Logic
Mem.
DSP
x
y
c
d
e
f
z
Desired Solution
Automatically integrate products:
 little or no effort on part of user.
 domain independent.
 Problem size:
 Million products
 Thousands of categories

Model
Product descriptions consist of words
 Products live in the leaf-level categories

Basic Algorithm
Build classification model using product
descriptions in master catalog.
 Use classification model to predict
categories for products in the new
catalog.

95%
x
5%
DSP
Logic
National Semiconductor Files
Part: DS14185 EIA/TIA-232 3 Driver x 5 Receiver
Part_Id: DS14185
Manufacturer: national
Title: DS14185 EIA/TIA-232 3 Driver x 5 Receiver
Description: The DS14185 is a three driver, five receiver
device which conforms to the EIA/TIA-232-E standard.The
flow-through pinout facilitates simple non-crossover board layout.
The DS14185 provides a one-chip solution for the common 9-pin
serial RS-232 interface between data terminal and data
communications equipment.
Part: LM3940 1A Low Dropout Regulator
Part: Wide Adjustable Range PNP Voltage Regulator
Part: LM2940/LM2940C 1A Low Dropout Regulator
...
...
...
National Semiconductor Files
with Categories
Part: DS14185 EIA/TIA-232 3 Driver x 5 Receiver
Pangea Category:
Choice 1: Transceiver
Choice 2: Line Receiver
Choice 3: Line Driver
Choice 4: General-Purpose Silicon Rectifier
Choice 5: Tapped Delay Line
Part: LM3940 1A Low Dropout Regulator
Pangea Category:
Choice 1: Positive Fixed Voltage Regulator
Choice 2: Voltage-Feedback Operational Amplifier
Choice 3: Voltage Reference
Choice 4: Voltage-Mode SMPS Controller
Choice 5: Positive Adjustable Voltage Regulator
...
...
Accuracy on Pangea Data
B2B Portal for electronic components:
 1200 categories, 40K training
documents.
 500 categories with < 5 documents.
 Accuracy:
 72% for top choice.
 99.7% for top 5 choices.

Enhanced Algorithm: Intuition





Use affinity information in the catalog to be
integrated (new catalog):
Products in same category are similar.
Bias the classifier to incorporate this
information.
Accuracy boost depends on quality of new
catalog:
Use tuning set to determine amount of bias.
Algorithm

Extension of the Naive-Bayes
classification to incorporate affinity
information
Naive Bayes Classifier
Pr(Ci|d) = Pr(Ci)Pr(d|Ci)/Pr(d) //Baye’s Rule
 Pr(d): same for all categories (ignore)
 Pr(Ci) = #docs  Ci / #total docs
 Pr(d|Ci) = Pwd Pr(w|Ci)

– Words occur independently (unigram model)

Pr(w|Ci) = (n(Ci ,w)+) / (n(Ci)+ |V|)
– Maximum likelihood estimate smoothed with the
Lidstone’s law of succession
Enhanced Algorithm

=
Pr(Ci|d,S) //d existed in category S
Pr(Ci,d,S) / Pr(d,S)
– Pr(Ci,d,S) = Pr(d,S) Pr(Ci|d,S)
Pr(Ci)Pr(S,d|Ci) / Pr(d,S)
= Pr(Ci)Pr(S|Ci)Pr(d| Ci) / Pr(S,d)
=
– Assuming d, S independent given Ci
=
Pr(S)Pr(Ci|S)Pr(d| Ci) / Pr(S,d)
– Pr(S|Ci) Pr(Ci) = Pr(Ci|S) Pr(S)
=
Pr(Ci|S)Pr(d|Ci) / Pr(d|S)
– Pr(S,d) = Pr(S)Pr(d|S)

Same as NB except Pr(Ci|S) instead of Pr(Ci)
– Ignore Pr(d|S) as it is same for all classes
Computing Pr(Ci|S)

Pr(Ci|S) =
|Ci|(#docs in S predicted to be in Ci)w /
 j[1,n] |Cj|(#docs in S predicted to be in Cj)w
 |Ci| = #docs in Ci in the master catalog
 w determines weight of the new catalog
– Use a tune set of documents in the new catalog
for which the correct categorization in the master
catalog is known
– Choose one weight for the entire new catalog or
different weights for different sections
Superiority of the Enhanced
Algorithm

Theorem: The highest possible accuracy
achievable with the enhanced algorithm is no
worse than what can be achieved with the
basic algorithm.
 Catch: The optimum value of the weight for
which enhanced achieves highest accuracy is
data dependent.
 The tune set method attempts to select a
good value for weight, but there is no
guarantee of success.
Empirical Evaluation

Start with a real catalog M
 Remove n products from M to form the new
catalog N
 In the new catalog N
– Assign f*n products to the same category as M
– Assign the rest to other categories as per some
distribution (but remember their true category)

Accuracy: Fraction of products in N assigned
to their true categories
Improvement in Accuracy (Pangea)
100
Perfect
95
90-10
Accuracy
90
80-20
85
80
GaussianA
75
GaussianB
70
Base
65
1
2
5
10
25
W eig h t
50
100 200
Improvement in Accuracy (Reuters)
Accuracy
100
98
Perfect
96
90-10
94
80-20
92
90
GaussianA
88
GaussianB
86
Base
84
82
1
2
5
10
25
W eig h t
50
100 200
Improvement in Accuracy
(Google.Outdoors)
100
Perfect
Accuracy
90
90-10
80-20
80
GaussianA
70
GaussianB
60
Base
50
1
5
25
100
W eig h t
400
1000
Tune Set Size (Pangea)
95
Per fect
Accuracy
90
90-10
85
80-20
GaussianA
80
GaussianB
75
Base
70
0
5
10
20
Tune Set Size
35
50
Similar results for Reuters and Google.
Empirical Results
20
% Errors
15
Standar d
Enhanced
10
5
0
71-22-6
79-21
100
Purity (No. of classes & their distribution)
Summary
Classification accuracy can be improved
by factoring in the affinity information
implicit in the data to be categorized.
 How to apply these ideas to other types
of classifiers?

Searching Specification
Documents
R. Agrawal, R. Srikant. WWW-2002
Specfication Documents

Consist of <attribute name, value> pairs,
embedded in text
 Examples:
– Data sheets for electronic parts
– Classifieds
Sources of Problems

Synonyms for attribute names and units.
– "lb" and "pounds", but no "lbs" or "pound".

Attribute names are often missing.
– No "Speed", just "MHz Pentium III"
– No "Memory", just "MB SDRAM"

Accurate data extraction is hard, e.g. partial
datasheet for Cypress CY7C225A PROM:
High Speed
18 ns address set-up
12 ns clock to output
Low Power
495 mW (commercial)
660 mW (military)
An end run!

Use a simple regular expression extractor to get
numbers
 Do simple data extraction to get hints, e.g.
– Hint for unit: the word following the number.
– Hint for attribute name: k words following the number.
256 MB SDRAM memory
Unit Hint:
MB

Attribute Hint:
SDRAM, Memory
Use only numbers in the queries
– Treat any attribute name in the query also as hint
Documents and Queries
Document D = {<ni , Hi > | ni N, Hi A, 1 
i m}
Query Q = {<qi , Ai > | ni N, Ai A, 1 
i k}
Hi and Ai are hints
D = {<256, {MB, SDRAM, Memory}>, <700, {<MHz,
CPU>}}
Q = <200 MB, 750 MHz>
Document D = {ni | ni N, 1 
i m}
Query Q = {qi | ni N, 1 
i k}
No hints with Documents and Queries
D = {256, 700} Q = {200, 750}
Can it work?

Yes!!!!!
– Provided data is non-reflective
– Reflectivity can be computed a priori for a
given data set and provides estimate of
expected accuracy.
Demo
Reflectivity
Non-reflective
<x=i, y=j> => 
<x=j, y=i>
Low Reflectivity
50
50
40
40
30
30
20
20
10
10
0
0
10
20
30
40
50
0
0
50
50
40
40
30
30
20
20
10
10
0
0
0
10
20
30
40
Low Reflectivity
50
0
10
20
30
10
20
30
40
40
High Reflectivity
50
50
Non-reflectivity in real life

Non-overlapping attributes:
– Memory: 64 - 512 Mb, Disk: 10 - 40 Gb

Correlations:
– Memory: 64 - 512 Mb, Disk: 10 - 100 Gb still
fine.

Clusters
Basic Idea

Consider co-ordinates of a point
 If there is no data point at the permutations of its
co-ordinates, this point is non-reflective
– Only a few data points at the permutations of its co-
ordinates => point is largely non-reflective

Compute reflectivity as this ratio summed over all
the points
– Consider neighborhood of a point in the above
calculation
Reflectivity





D: set of m-dimensional points
n: coordinates of point x
h(n): number of points within distance r of n
Reflections(x): permutations of n
q(n): number of points in D that have at least one
reflection within distance r of n
Reflectivity(m,r) = 1 - 1/|D| xD h(n)/q(n)
Non-reflectivity = 1- Reflectivity
See the paper for reflectivity of D over k-dimensional
subspaces
Algorithms

How to compute match score (rank) of a
document for a given query?
 How to limit the number of documents for
which the match score is computed?
Match Score of a Document

Select k numbers from D yielding minimum
distance between Q and D:
– F(Q,D) = ( ik=1 w(qi ,nji)p )1/p

Map problem to Bipartite Matching in graph G:
– k source nodes: corresponding to query numbers
– m target nodes: corresponding to document numbers
– An edge from each source to k nearest targets.
Assign weight w(qi ,nj)p to the edge (qi,nj).
Bipartite Matching

The optimum solution to the minimum weight
bipartite graph matching problem matches each
number in Q with a distinct number in D such that
the distance score F(Q,D) is minimized.
 The minimum score gives the rank of the document
D for the Query Q.
 Assuming F to be L1 and w(qi,nj) := |qi - nj| / |qi +e|:
Doc:
10
.5
Query:
25
.25
20
75
.58
60
.25
Limiting the Set of Documents

Similar to the score aggregation problem
[Fagin, PODS 96]
 Proposed algorithm is an adaptation of the
TA algorithm in [Fagin-Lotem-Naor, PODS
01]
Limiting the Set of Documents
60
20

66/.1
D2 D7
25/.25
D1 D5 D7 D9
10/.5
D2 D3
75/.25
D1 D3 D4 D5
35/.75
D4 D6 D8
25/.58
D6 D8 D9
Make k conceptual sorted lists, one for each
query term [use: documents = index(number)]
 Do a round robin access to the lists. For each
document found, compute its distance F(D,Q)
 Let ni := number last looked at for query term qi
Let t := (ik=1 w(qi, ni)p)1/p
 Halt when t documents found whose distance 
t
Evaluation Metric

Benchmark: Set of answers when attribute
names are precisely known in the document
and query
 What fraction of the top 10 "true" answers
are present in the top 10 answers when
attribute names are unknown in both
document and query?
Accuracy Results
Fields
100
DRAM
3,800
10
LCD
1,700
12
Proc
1,100
12
Trans
22,273
24
Auto
205
16
Credit
666
6
Glass
214
10
Housing
506
14
Wine
179
14
80
Accuracy
Records
60
40
20
0
1
2
3
4
5
Query Size
DRAM
LCD
Proc
Trans
Auto
Credit
Glass
Housing
Wine
Reflectivity estimates accuracy


Non-reflectivity closely
tracked accuracy for all
nine data sets
Non-reflectivity arises due
to clustering and
correlations in real data
(Randomized nonreflectivity: value obtained
after permuting values in
the columns)
100
80
60
40
20
0
1
2
3
4
5
7
Query Size
Accuracy
Non-Reflectivity
Randomized NonReflectivity
Incorporating Hints

L1 =  w(qi,ni) + B v(Ai,Hi)
– v(Ai,Hi) : distance between attribute name (or
unit) for qi and set of hints for ni
– B: relative importance of number match vs.
name/unit match
Balance between Number
Match & Hint Match

Wine
100
80
Accuracy
Weight to hints should
depend on the
accuracy of hints
 Use tune set to
determine B on per
dataset basis
60
40
20
0
0.01
0.1
0.03
1
0.3
10
3
Weight of Hint Match
0.99
0.9
0.8
0.6
No HInts
Effectiveness of Hints
Wine
Improvement in
accuracy depends on
how good are hints
100
80
Accuracy

60
40
20
0
1
2
3
4
6
8
10
Query Size
0.99
0.9
0.8
0.6
No Hints
Effectiveness of Indexing
1 million docs:
100
– 1 sec for qsize = 5
– .03 sec for qsize=1
10
Time (ms)

DRAM
1
0.1
0.01
1
2
3
4
5
Query Size
Scan
Indexed
7
Summary

Allows querying using only numbers or
numbers + hints.
 End run around data extraction.
 Use simple extractor to generate hints.
 Can ascertain apriori when the technique
will be effective
Future Work

Integration with classic IR (key word
search)
– PROM speed 20 ns power 500 mW

Extension to non-numeric values
Parametric Search of
E-Commerce Data
J. Shafer, R. Agrawal : WWW-9
Conventional Wisdom
• User enters search criteria into HTML form
• User’s query is received by web-server and
submitted to a server-side database (usually
as SQL)
• Result set is returned to user as an HTML
page (or a series of pages)
Search Problem in E-Commerce
• Users often don’t know exactly what they
are looking for
– Search is a process, not a single operation
– An individual query is simply a means to an
end
• Too many/few results: try different query
• No universal ranking function
Essential Ideas of Eureka
• Incrementally fetch superset of tuples of
interest in the client (e.g. all products in the
product category of interest)
• User-interface and fast indexing structures
for interactive exploration of the cached
tuples
•Restriction by example
•Ranking
•Fuzzy restrictions
Architecture Overview
Eureka
ListRenderer
DataPump
DataGroup
DataColumn
#1
...
HTTP
DataColumn
#N
client
JDBC
Database
server
Servlet
Comments
• Used Eureka in several situations with very
positive feedback
• Used Eureka on datasets with 100K records
with no visible deterioration in performance
• Performance is excellent, even in Java
Storage and Retrieval of
E-Commerce Data
R. Agrawal, A. Somani, Y. Xu: VLDB-2001
Typical E-Commerce Data
Characteristics
An Experimental E-marketplace for
Computer components

Nearly 2 Million components
 More than 2000 leaf-level
categories
 Large number of Attributes (5000)

Constantly evolving schema
Sparsely populated data (about
50-100 attributes/component)

Conventional horizontal representation
(n-ary relation)
Name
Monitor
Height
Recharge
Output
playback
Smooth scan
Progressive Scan
PAN DVD-L75
7 inch
-
Built-in
Digital
-
-
-
KLH DVD221
-
3.75
-
S-Video
-
-
No
SONY S-7000
-
-
-
-
-
-
-
SONY S-560D
-
-
-
-
Cinema Sound
Yes
-
…
…
…
…
…
…
…
…

DB Catalogs do not support thousands of columns (DB2/Oracle
limit: 1012 columns)
 Storage overhead of NULL values Nulls increase the index size
and they sort high in DB2 B+ tree index
 Hard to load/update
 Schema evolution is expensive
 Querying is straightforward
Binary Representation
(N 2-ary relations)
Monitor

Height
Output
Name
Val
Name
Val
PAN DVD-L75
7 inch
KLH DVD221
3.75
Dense representation
 Manageability is hard
because of large number of
tables
 Schema evolution expensive
Name



Val
PAN DVD-L75
Digital
KLH DVD221
S-Video
Decomposition Storage Model
[Copeland et al SIGMOD 85],
[Khoshafian et al ICDE 87]
Monet: Binary Attribute Tables
[Boncz et al VLDB Journal 99]
Attribute Approach for storing
XML Data [Florescu et al INRIA
Tech Report 99]
Vertical representation
(One 3-ary relation)
Oid (object identifier) Key (attribute name) Val (attribute value)
Oid
Key
Val
0
‘Name’
‘PAN DVDL75’
0
‘Monitor’
‘7 inch’
0
‘Recharge’
‘Built-in’
0
‘Output’
‘Digital’
1
‘Name’
‘KLH DVD221’
1
‘Height’
‘3.75’
1
‘Output’
‘S-Video’
1
‘Progressiv
e Scan’
‘No’
2
‘Name’
‘SONY S-7000’
…
…
…





Objects can have large number of
attributes
Handles sparseness well
Schema evolution is easy
Implementation of SchemaSQL [LSS 99]
Edge Approach for storing XML Data [FK
99]
Querying over Vertical
Representation

Simple query on a Horizontal scheme
SELECT MONITOR FROM H WHERE OUTPUT=‘Digital’
Becomes quite complex:
SELECT v1.Val
FROM vtable v1, vtable v2
WHERE v1.Key = ‘Monitor’
AND v2.Key = ‘Output’
AND v2.Val = ‘Digital’
AND v1.Oid = v2.Oid
Writing applications becomes much harder. What can we do ?
Solution : Query Mapping

Translation layer maps relational algebraic
operations on H to operations on V
Horizontal
view (H)
Attr1
…
Attr2
Query Mapping Layer
Vertical
table (V)
Oid
Key
Val
Attrk
…
Key Issues

Can we define a translation algebra ?
 Standard database view mechanism does not
work
 attribute values
become column names (need higher
order views)

Can the vertical representation support fast
querying ?
 What are the new requirements from the database
engines?
Transformation Algebra

Defined an algebra for transforming expressions
over horizontal views into expressions over the
vertical representation.
 Two key operators:
– v2h () Operation – Convert from vertical to horizontal
k(V) = [POid(V)]  [i=1,k POid,Val(Key=‘Ai’(V))]
– h2V () Operation – Convert from horizontal to vertical
k(H) = [i=1,k POid,’Ai’Ai(Ai  ‘’(V))] 
[i=1,k POid,’Ai’Ai(i=1,k Ai=‘’(V))
– Similar to Unfold/Fold [LSS 99] ,Gather/Scatter [STA 98]
Implementation Strategies

VerticalSQL
– Uses only SQL-92 level capabilities
 VerticalUDF
– Exploits User Defined Functions and Table Functions
– Binary
– 2-ary representation with one relation per attribute
(using only SQL-92 transforms)
 SchemaSQL
– Addresses a more general problem
– Performed 2-3X worse than Vertical representation
because of temporary tables
Performance: Experimental setup

600 MHz dual-processor Intel Pentium machine
 512 MB RAM
 Windows NT 4.0
 Database IBM DB2 UDB 7.1
 Two 30GB IDE Drives
 Buffer Pool Size – 50 MB
 All numbers reported are cold numbers
 Synthetic data
– 200X100K & r=10%  200 columns, 100K rows and
non-null density = 10%
Clustering by Key outperformed
clustering by Oid
Execution time (seconds)
density = 10%, 1000 cols x 20K rows
25
20
VerticalSQL_oid
15
VerticalSQL_key
10
5
0
0.1%
1%
Join selectivity
Join
5%
VerticalSQL comparable to Binary
and outperforms Horizontal
density = 10%
Execution time (seconds)
60
50
40
HorizontalSQL
30
VerticalSQL
20
Binary
10
0
200x100K
400x50K
800x25K
1000x20K
Table (#cols x #rows)
Projection of 10 columns
VerticalUDF is the best approach
density = 10%
Execution time (seconds)
30
20
VerticalSQL
Binary
10
VerticalUDF
0
200x100K
400x50K
800x25K
1000x20K
Table (#cols x #rows)
Projection of 10 columns
Wish List from Database Engines
– VerticalUDF approach showed need for
– Enhanced table functions
– First class treatment of table functions
– Native support for v2h and h2v operations
– Partial indices
Summary
Horizontal
Vertical (w/
Mapping)
Binary (w/
Mapping)
Manageability
+
+
-
Flexibility
-
+
-
Querying
+
+
+
Performance
-
+
+
Opportunities
Semiautomatic Catalog Creation
Supplier Web Sites
Supplier 1
Supplier 2
Technology for Quickly Assemble
catalog from semi-structured data
Web Browser
Target
Pages
Proxy
Supplier[i]
Semiautomatic
Catalog
Extraction
Logs
Fetch Rules
Extract Rules
Data
Mining
Rule
Repository
WCS
Catalog
...
Supplier n
Architecture
Web
Crawler
Data Store
Other
Sources
PDF
to
Text
OCR
Extractor
Categorizer
Indexer
(IM for
Text)
Improving the Catalog Hierarchy
 Identify sets of nodes (or single "kitchen
sink" node) for re-organization.
 Cluster the set of nodes, and compare the
resulting hierarchy with the original.
 Identify nodes that fit better elsewhere in
the hierarchy.
 Metrics: classification accuracy, tightness
of cluster.
Integration of Real Time
Operational Data
Dynamic
Req Res
1
6
Server
2
3
Client
4
5
4
Remote Site 1
Remote Site 2
4
Fetch target page(s)4,
Query, Extract Response5
in real time
Remote Site n
Privacy Preserving Data Mining
 Insight: Preserve privacy at the individual
level, while still building accurate data
mining models at the aggregate level.
 Add random noise to individual
30
values to protect privacy.
 Can dramatically change
distribution of values.
become
s 65
(30+35)
 EM algorithm to estimate original
distribution of values given randomized
values + randomization function.
 Estimate only accurate over thousands of
values => preserves privacy.
 Algorithms for building classification
models and discovering association rules on
top of privacy-preserved data with only
small loss of accuracy.
Alice’s
age
Alice’s
salary
Bob’s
age
30 | 70K | ...
50 | 40K | ...
Randomizer
Randomizer
65 | 20K | ...
25 | 60K | ...
Reconstruct
distribution
of Age
Reconstruct
distribution
of Salary
Data Mining Algorithms
Data Mining Model
Seems to work well!
1000
Original
800
600
Randomized
400
Reconstructed
200
60
0
20
Number of People
1200
Age
Accuracy vs. Privacy
Fn 3
100
Accuracy
90
80
Or iginal
70
Random ized
Reconstr ucted
60
50
40
10
20
40
60
80
100
Randomization Level
150
200
Summary
 E-Commerce is a rich application domain
for database technology
 Research opportunities abound