Graph Indexing A Frequent Structure Based Approach
Download
Report
Transcript Graph Indexing A Frequent Structure Based Approach
Will Data Mining Change the
Functions of DBMS?
Jiawei Han
DAIS (Data And Information Systems) Lab
University of Illinois at Urbana-Champaign
Will DM Be Integrated with DB Functions?
DM: Already a functional component of DBMS
Microsoft/SQLServer: Analysis Manager
IBM/DB2 & IntelligentMiner
Oracle: Data Mining Package
But will DM be “intruding” into DBMS, i.e., be
integrated with essential DBMS functions?
Indexing
Data integration
Data cleaning
Query processing
Indexing by Data Mining
Indexing graphs? ─ # of subgraphs: exponential!
Chemical Informatics/bioinformatics …
Discriminative frequent graph patterns (SIGMOD’04)
Indexing subsequences?
Shopping sequence, DNA/protein sequence (SDM’05)
When is discriminative frequent pattern indexing useful?
Complex objects, big (object) queries
Sample database
(a)
(b)
Query graph
(c)
Data Cleaning by Data Mining
Load messy data into a structured database?
Inconsistent data: age = “1946”?
Field mis-alignments
Glitches of data: completely messed up inputs
Missing/un-matching delimiters: XML, HTML
data
Big field: BLOB, CLOB, multimedia and text
Data mining
Data cleaning by distribution/outlier analysis
Dependency/correlation analysis
Schema-directed or schema “discovery”
Data Integration by Data Mining
Linking and mining cross-over multiple data
relations
Cross-mine (Classification across multiple
data relations: ICDE’04)
Search across heterogeneous databases
Object identification/merge, reference
reconciliation (Alon’s group)
Mining across heterogeneous DBs
Personalizing data from heterogeneous
sources
Query Processing by Data Mining
Query plan refinement based on query
execution history
Better query planning by investigating additional
data statistics
Current optimizer: key/foreign key, cardinality,
# distinct values
Additional information:
Strong dependency/correlation
Histogram, dense vs. sparse regions, etc.
Conclusions
DBers have been “invading” into DM and made
great contributions
It is time to consider that DM may invade DBMS
to enhance its functionality
General philosophy
Invisible data mining
Google is doing this for page ranking
successfully
Can we do it to enhance DBMS?
You can do better if you know your data better!