A Robust System Architecture For Mining Semi

Download Report

Transcript A Robust System Architecture For Mining Semi

A Robust System Architecture
For Mining Semi-structured
Data
By Aby M Mathew
CSE 6331
11301999
Introduction
A versatile system architecture for text
mining that differentiates and maintains
structured plus unstructured data
components.
Motivation
• A digital library could contain tons of
document concepts, using SQL possible to generate quantitative rules,
based on a certain criteria.
• What about rules related to a subset
such as,
– which journal publishes articles associated
within an area of interest.
Presentation Organization
• Overview of the IRIS system.
• Differences between structured &
unstructured data.
• How is the data stored.
• Algorithm used for rule generation.
• Conclusion.
Overview of the IRIS system
GUI
Rule Generator
Concept Library
Database
IDM
Document Collection
Brief Description Of Individual
Components
• Rule Generator - parses the user
request via GUI and determines an
execution strategy.
• Database contains structured data which has mappings b/w tuples and the
document.
• Concept library maintains unstructured
data as concepts - mappings exist b/w
concepts and documents.
Contd ..
• IDM ( Information discovery module )
– extracts concepts and structured values
from a document collection
– updates the database and concept library.
Components of the Rule
Generator
parser
optimizer
processor
• Parser - accepts data and reconditions it
for the optimizer.
• Optimizer - uses the constraints, rule
type and generates an efficient
execution plan.
• Processor - executes plans laid out by
the optimizer.
Components of the IDM
Discoverer
Extractor
Refresher
• Discoverer - Intelligent agent that
determines domains.
• Extractor - Based on the domain
knowledge, it populates the database
and concept library.
• Refresher - Helps maintain consistency
of the database and concept library.
Differences b/w the two data
types
• Structured data type
– Certain features that forms key entities.
E.g.., Author, Publisher, Date etc.
• Unstructured data type
– Blocks of text that are unidentifiable as
structured. E.g.., Abstract headings,
paragraphs etc.
How is the data stored ?
• Structured data is stored using a
relational schema that is mapped to a
database.
• Unstructured data is stored in a
compressed form using ECH(extended
concept hierarchy).
Extended Concept Hierarchy
• This is a hierarchical form of
representing data.
its not always constrained to a tree
structure.
relationships maintain additional
links b/w the entities in the hierarchy.
Example University ECH
Employees
Admin
Faculty
Full
Associate
Provost
Dean
Calculation of minimum support
(min sup) in ECH
If C1 & C2 are the two concepts found
in the document,
then min sup =
documents( C1 )  documents( C2 )
documents( C1 )  documents( C2 )
where ‘documents ( c )’ is the number
of documents where concept ‘c’ occurs.
Example for calculating min
sup
Say concept C1 appears in 500
documents and C2 appears in 600
documents, 100 of which concept C1
also appears.
Min sup =
100 / 1000
=
0.1
Algorithm used for rule
generation
• Get Document ids of documents containing
structured data value - using SQL statements. ( set
‘A’ ).
• Get Document ids of documents containing
unstructured concept - using ECH. ( set ‘B’ ).
• C = A  B.
• Get document ids of concept Cr where Cr is related to
C1 via edge P, C or S. If the min sup of Cr & C1 are
above min sup. ( set ‘D’ ).
• E = C  D.
• confidence = ( num elements in E ) / ( num elements
in C ).
Advantages of Using this
system
• Distinguishing b/w structured -vsunstructured data, helps generate more
interesting rules.
• Being domain specific - accuracy
improves.
• Scalable as any database can be used
as the database component.
• Meaningful data is stored - compact
representation of the document.
Bibliography
• L. Singh, P. Scheurmann & B. Chen,
“IRIS: Our prototype rule generation
system”, 1999.
• L. Singh, P. Scheurmann & B. Chen,
“Generating Association Rules from
Semi-structured documents using an
Extended concept Hierarchy”, 1999.