Transcript BIBE06Day3

Assigning Schema Labels Using
Ontology and Heuristics
Xuan Zhang, Rouming Jin, Gagan Agrawal
Problem Statement
• Setting: A online integration system finds a new
data file…
• Question: Can it be integrated into the system on
the fly? How?
• Sub-tasks:
• Understand the data
• Talk to data host
• Consult field expert
• Process the data
• Database administrator
• Programmer
Can we automate
the process?
On-the-fly Integration Overview
Dataset flat file
Raw attribute values
Layout learning
Value cleaning and summarization
Step 1
Layout descriptor
Parser generation
Parser
Parsing
Attribute summaries
Score calculation
Scores
Expert or clustering algorithm
Cutoff values
Step 2
Labeling
1.
2.
3.
Delimiter Identification (Ref [25], [26])
Wrapper Generation (Ref [32])
Schema Mining
Labels
Step 3
Schema Mining
• Assign meaning (label or names) to
attributes in a data set
• Challenges
• What
• Delimiters
Values
• How
• Top-down
Bottom-up
Our Approach
• Summarize attribute values from bottom up
• Similarity between ontology and schema
• An attribute a with label att, a value v
• Schema: “v is-a att”
• Ontology: “Node(v) is a child of Node(att)”
• E.g protein is-a molecule type
• Common ancestor of values in ontology ~
attribute label in schema
Real-world Complications
• Complete comprehensive ontology database
• Selective sampling
• Error-free dataset
• Adjustable sensitivity and fault tolerance
• Time
• Data mining + Statistic analysis
Remark: attribute label  attribute name
e.g date : {creation date, last modification date}
Outline
•
•
•
•
Motivation
System
Mining algorithm
Experiment
Schema Mining System
Value cleaning and summarization
• System
Attribute summaries
Score calculation
Scores
Expert or
Clustering algorithm
Cutoff values
Labeling
Attribute labels
• Data cleaning and
summarization
• Score function
• Ontology database
Data Summarization
• Token profile: a ordered list of
N(numerical), A(alphabetic) and special
characters
• E.g Profile(“polyA_site”)=A_A
• Token category: word, number or else
• Frequent tokens
• Approximate frequent token mining algorithm
• Assumption: token distributed evenly
Template Scoring Function
• Simple
• Adjustable trade-off
between sensitivity and
error tolerance
Temperature
• Desired property
1.0
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.0
F_pt
B_pt
t
Ontology Database
• Goal: to approximate a complete
comprehensive ontology database
• Approach
• “Complete”: sample popular terms
• “Comprehensive”: public ontology databases +
common facts
• Result
• 6 major categories, 386 terms
Ontology Contents
• Sample existing databases
• Organism name: NCBI Taxonomy + taxonomy hierarchy
• Cellular component: Gene Ontology
• Publication method: NCBI Entrez Journals
+ direct submission
• New categories
• Biology database: popular database names
• Molecular type: biology fact
• Free text: common words in natural language
• Enhancement
Outline
• Motivation
• System
• Mining algorithm
• Using ontology
• Using heuristics
• Experiment
Mining With Ontology
1. Occurrence(term) =
Frequent_Counts[i],
if term=Frequent_Tokens[i]
mini:[0, t] Frequent_Counts[i],
if term=Frequent_Tokens[0]|…|Frequent_Tokens[t]
0, else
2. Strength(term) =
Occurrence(term) +  Strength(child_term)
Mining With Ontology
• Likelihood of attribute to be labeled with l
• Factors:
• Relative strength of term l compared with that of
other terms
• completeness of ontology
• Score = product of two factors, modulated by
the template scoring function
Mining With Heuristics
• Use token profile
• “number”: {N, N.N}
• “date”: {N-A-N, N/N/N}
• Use frequent token counts
• “identification”: Frequent_Counts[]=1
Mining With Heuristics
• Use other token information
• “biological sequence”: length >45, or in 10’s
• Use token sequence information
• “people name”: length (2~3), separator (“,” or
“and”), profile (not number, date)
Experimental Results
• Datasets
• GenBank, UniProt SWISSPROT and Pfam
• Cutoff values
• Cluster scores to group most, middle and little
by minimizing standard deviation
• Evaluation
• Weighted Cohen’s Kappa: Compare group
most, middle and little with true label Y(yes),
P(partial) and N(no)
Results: Summary
Very good
Good
Moderate
Category 1: cellular component, 2: database, 3: date, 4: free text, 5: ID, 6: molecule type,
7: name, 8: number, 9: organism, 10: publication method, 11: sequence
Results: Cellular Component (Ontology)
Results: Biology Database (Ontology)
Results: Free Text (Ontology)
Results: Molecule Type (Ontology)
Results: Organism Name (Ontology)
Results: Publication Method (Ontology)
Results: Date (Heuristics)
Results: ID (Heuristics)
Results: People Name (Heuristics)
Results: Number (Heuristics)
Results: Bio. Sequence (Heuristics)
Discussion: Hits and Misses
• According to Kappa tests, good or very
good
• Possible improvement
•
•
•
•
Better clustering method
Bigger ontology database
More involved language analysis
Hybrid of bottom-up and top-down approaches
Assigning Schema Labels
Using
Ontology and Heuristics