Powerpoint - Mathematical & Computer Sciences
Download
Report
Transcript Powerpoint - Mathematical & Computer Sciences
Heterogeneous Association Rules
Mining
Badr Al-Daihani
School of Computer
Science
Cardiff University
Edinburgh,UK
BNCOD21
Overview
Motivation
Challenges of Bioinformatics Databases Management
Approaches to integration of bioinformatics databases
Association rule mining
Hypothesis
Basic concepts
Material and methods
Edinburgh,UK
BNCOD21
Motivation
Very large heterogeneous databases.
Need to link.
Integration.
Complex relation.
Edinburgh,UK
BNCOD21
Challenges of Bioinformatics
Databases Management
Bioinformatics Databases format:
Flat files: GenBank, EMBL, DDBJ, PDB.
Relational databases: HGMD, MGMD
Object-oriented database: AceDB.
XML databases: PIR, SwissProt, InterPro.
Characteristics:
The Diversity/variety of data.
The representational heterogeneity.
Autonomous and web-based sources.
Varied interface and query capabilities
Edinburgh,UK
BNCOD21
Approaches to integration of
bioinformatics databases
Multiple models of data integration:
•
•
•
Federation
Warehousing
Mediations
Edinburgh,UK
BNCOD21
Federation
Provides access to distributed data while preserving database autonomy
examples:
K2/BioKleisli
Entrez
Edinburgh,UK
BNCOD21
Warehousing
import data from remote sources and copy to local server
Example:
GUS (Genome Unified Schema)
Sequence Retrieval System (SRS)
Edinburgh,UK
BNCOD21
Mediations
•
•
stores no data on its own rather it provides a virtual view of the
integrated sources
Examples:
Transparent Access to Multiple Bioinformatics Information Source
(TAMBIS)
Knowledge-based Integration of Neuroscience Data (KIND)
Edinburgh,UK
BNCOD21
Hypothesis:
It is possible to mine diverse databases to recover datasets related to
a disease, associated gene mutations and mutagens which aid
scientists understanding of their cause.
Edinburgh,UK
BNCOD21
Association Rules
Association Rules –interesting association relationship among huge
amounts of transactions
An association rule is an expression of the form X => Y, where X and Y
are sets of items
Goal of AA – To find all association rules that satisfy user-specified
minimum support and minimum confidence threshold
Examples.
– Rule form: “Body ead [support, confidence]”.
– buys(x, “diapers”) buys(x, “beers”) [0.5%, 60%]
– major(x, “CS”) ^ takes(x, “DB”) grade(x, “A”) [1%, 75%]
Edinburgh,UK
BNCOD21
Association Rules
Applications:
Basket data analysis
Genomic Data
Cross-marketing
Catalog design
sale campaign analysis
Web Personalization
clustering, classification, etc.
Edinburgh,UK
BNCOD21
Basic Concepts
The discovery of interesting association relationships among huge
amount of gene mutation can help in determining the cause of
mutation in tumours and diseases.
Gene is a segment of a DNA molecule that contains all the information
required for synthesis of a product.
Gene mutation is any change in the DNA sequence of a gene.
Types of mutations: Insertion, Deletion, Insertion/Deletion, Complex,
and Multiple Substitution
Edinburgh,UK
BNCOD21
Material and Methods
HGMD database
The Human Gene Mutation Database (HGMD) runs by University of
Wales College of Medicine.
Known (published) gene lesions responsible for human inherited
disease.
Provides information about practical diagnoses.
Edinburgh,UK
BNCOD21
Material and Methods
MGMD database
The Mammalian Gene Mutation Database (MGMD).
Runs by Centre of Molecular Genetics and Toxicology, University of
Wales Swansea.
profiles of known (published) mutagen-induced gene mutations.
Stores the mutation spectra information.
It has 39134 records.
Edinburgh,UK
BNCOD21
Material and Methods
Sets of items whose elements tend to be in both databases will be
retrieved to discover the interesting association rules among genes,
mutations, mutagens and diseases.
Edinburgh,UK
BNCOD21
Material and Methods
Graphical User Interface (GUI)
Mining tools
Query interpreter
Wrapper
DBn
Wrapper
MGMD
Edinburgh,UK
Wrapper
HGMD
BNCOD21
References
[1] Hernandez T. and Kambhampati S. (2004) Integration of Biological Sources: Current
Systems and Challenges Ahead, Proc. of the ACM SIGMOD Conference.
[2] C. Goble et al. (2001) Transparent access to multiple bioinformatics information
sources. IBM Systems Journal, 40(2).
[3] Barbara Eckman,Zoe Lacroix and Louiqa Raschid (2001) Optimized Seamless
Integration of Biomolecular Data,IEEE, International Conference on Bioinformatics and
Biomedical Egineering,23-32. [4] Lacroix Z, Boucelma O and Essid M (2003) The Biological
Integration System. Proc. of the 5th ACM Workshop on Web Information and data
management, pp 45-49.
[5] Aldana J.,Roldán M, Navas I, Pérez A and Trelles O (2004) Integrating Biological Data
Sources and Data Analysis Tools through Mediators, Proceedings of the 2004 ACM
symposium on Applied computing.
[6]. Agrawal, R.-Imielinski, T.-Swami, A. (1993) Mining Association Rules Between Sets of
Items in Large Databases. Proc. ACM SIGMOD:207-216.
[7] P.D. Lewis, J.S. Harvey, E.M. Waters, and J.M. Parry
(2000) The Mammalian Gene Mutation Database, Mutagenesis, 15(5): 411- 414.
[8] Krawczak M, Ball EV, Fenton I, Stenson PD, Abeysinghe S, Thomas N, Cooper DN
(2000): Human Gene Mutation Database - a biomedical information and research resource.
Human Mutation 15(1):45-51.
Edinburgh,UK
BNCOD21
Edinburgh,UK
BNCOD21