CSCE590/822 Data Mining Principles and Applications
Download
Report
Transcript CSCE590/822 Data Mining Principles and Applications
CSCE555 Bioinformatics
Lecture 21 Integrative Genomics
Meeting: MW 4:00PM-5:15PM SWGN2A21
Instructor: Dr. Jianjun Hu
Course page: http://www.scigen.org/csce555
University of South Carolina
Department of Computer Science and Engineering
2008
www.cse.sc.edu.
Outline
What is Integrative Genomics
Why integrative genomics
The Data Sources
Integrating strategies
Issues in Integrative genomics
Application Example: disease gene
prioritization
Integrative Genomics - what is it?
Acquisition, Integration, Curation, and
Analysis of biological data
Hypothesis
Integrative Genomics: the study of complex interactions between genes, organism
and environment, the triple helix of biology. Gene <–> Organism <-> Environment
It is definitely beyond the buzzword stage - Universities now have programs named
'Integrated Genomics.'
Information is not knowledge - Albert Einstein
Why Integrative Genomics?
Support Complex Queries
•
•
•
•
Show me all genes involved in brain development
that are expressed in the Central Nervous System.
Show me all genes involved in brain development
in human and mouse that also show iron ion
binding activity.
For this set of genes, what aspects of function
and/or cellular localization do they share?
For this set of genes, what mutations are reported
to cause pathological conditions?
Integrative genomics for Biomedicine
•
To correlate diseases with
• anatomical parts affected,
• the genes/proteins involved, and
• the underlying physiological processes
(interactions, pathways, processes).
• support personalized or “tailor-made”
medicine.
How to integrate multiple types of genome-scale data across experiments and
phenotypes in order to find genes associated with diseases
Two Separate Worlds…..
Disease
World
Bioinformatics & the “omes
Medical Informatics
Genome
Regulome
Transcriptome
miRNAome
Proteome
Disease
Database
Interactome
Metabolome
Patient
Record
s
Clinica
l Trials
Variome
Pharmacogenom
e
PubMed
→Name
Physiome
OMIM
→Synonyms
Clinical
→Related/Similar Diseases
Synopsis
→Subtypes
Pathome
→Etiology
→Predisposing Causes
→Pathogenesis
→Molecular Basis
382 “omes” so far………
→Population Genetics
→Clinical findings
and there is “UNKNOME” too - genes with
→System(s) involved
→Lesions
no function known
→Diagnosis
→Prognosis
http://omics.org/index.php/Alphabetically_ordered_list_of_omics
→Treatment
With Some Data Exchange…
→Clinical Trials……
Data Sources: The –Omics
Clinical data
Disease data
Bioinformatic Data-1978 to present
•
•
•
•
•
•
DNA sequence
Gene expression
Protein expression
Protein Structure
Genome mapping
SNPs & Mutations
•
•
•
•
•
•
Metabolic networks
Regulatory networks
Trait mapping
Gene function analysis
Scientific literature
and others………..
Human Genome Project – Data Deluge
Database name
Nucleotide
Protein
Structure
Genome Sequences
Popset
SNP
3D Domains
Domains
GEO Datasets
No. of Human Gene Records currently in
NCBI: 29413 (excluding pseudogenes,
mitochondrial genes and obsolete records).
Includes ~460 microRNAs
GEO Expressions
Records
12,427,463
419,759
11,232
75
21,010
11,751,216
41,857
19
5,036
16,246,778
UniGene
123,777
UniSTS
323,773
PubMed Central
HomoloGene
Taxonomy
4,278
19,520
1
NCBI Human Genome Statistics – as on February12, 2008
Information Deluge…..
• 3 scientific journals in 1750
• Now - >120,000 scientific journals!
• >500,000 medical articles/year
• >4,000,000 scientific articles/year
• >16 million abstracts in PubMed
derived from >32,500 journals
A researcher would have to scan 130 different journals
and read 27 papers per day to follow a single disease,
such as breast cancer (Baasiri et al., 1999 Oncogene
18: 7958-7965).
Methods for Integration
1. Link driven federations
• Explicit links between databanks.
2. Warehousing
• Data is downloaded, filtered,
integrated and stored in a warehouse.
Answers to queries are taken from
the warehouse.
• Integrative analysis
3. Others….. Semantic Web, etc………
Link-driven Federations
1. Creates explicit links between
databanks
2. query: get interesting results and use
web links to reach related data in other
databanks
Examples: NCBI-Entrez, SRS
http://www.ncbi.nlm.nih.gov/Database/datamodel/
http://www.ncbi.nlm.nih.gov/Database/data
http://www.ncbi.nlm.nih.gov/Database/data
http://www.ncbi.nlm.nih.gov/Database/data
http://www.ncbi.nlm.nih.gov/Database/data
Querying Entrez-Gene
Data Warehousing
Data is downloaded, filtered, integrated and
stored in a warehouse. Answers to queries
are taken from the warehouse.
Advantages
Disadvantages
1.
Good for very-specific, task-based
queries and studies.
1.
Can become quickly outdated –
needs constant updates.
2.
Since it is custom-built and usually
expert-curated, relatively less errorprone
2.
Limited functionality – For e.g.,
one disease-based or one
system-based.
Integrative data analysis
Data is downloaded, filtered
Inference algorithms that integrate
heterogeneous data
Evidences are usually weak from one data
source, integration will enhance signals
Cross-validation effect to reduce false
positive
Common Issues in Integrative Genomics
• Heterogeneous Data Sets - Data Integration
– From Genotype to Phenotype
– Experimental and Consensus Views
• Incorporation of Large Datasets
– Whole genome annotation pipelines
– Large scale mutagenesis/variation projects (dbSNP)
• Computational vs. Literature-based Data
Collection and Evaluation (MedLine)
• Data Mining
– extraction of new knowledge
– testable hypotheses (Hypothesis Generation)
No Integrative Genomics is Complete
without Ontologies
Gene World
Gene
(GO)
Ontology
Biomedical World
• Unified Medical
Language System
(UMLS)
The 3 Gene Ontologies (Recap)
• Molecular Function = elemental activity/task
– the tasks performed by individual gene products; examples are
carbohydrate binding and ATPase activity
– What a product ‘does’, precise activity
• Biological Process = biological goal or objective
– broad biological goals, such as dna repair or purine metabolism,
that are accomplished by ordered assemblies of molecular
functions
– Biological objective, accomplished via one or more ordered assemblies of functions
• Cellular Component = location or complex
– subcellular structures, locations, and macromolecular complexes;
examples include nucleus, telomere, and RNA polymerase II
holoenzyme
– ‘is located in’ (‘is a subcomponent of’ )
http://www.geneontology.org
What can researchers do with GO?
•
Access gene product functional
information
•
Find how much of a proteome is
involved in a process/ function/
component in the cell
•
Map GO terms and incorporate
manual annotations into own
databases
•
Provide a link between biological
knowledge and
•
gene expression profiles
•
proteomics data
• Getting the GO and
GO_Association Files
• Data Mining
– My Favorite Gene
– By GO
– By Sequence
• Analysis of Data
– Clustering by
function/process
• Other Tools
Unified Medical Language System (UMLS)
http://umlsks.nlm.nih.gov/kss/
The UMLS Metathesaurus contains information about biomedical
concepts and terms from many controlled vocabularies and
classifications used in patient records, administrative health data,
bibliographic and full-text databases, and expert systems.
The Semantic Network, through its semantic types, provides a
consistent categorization of all concepts represented in the UMLS
Metathesaurus. The links between the semantic types provide the
structure for the Network and represent important relationships in the
biomedical domain.
The SPECIALIST Lexicon is an English language lexicon with many
biomedical terms, containing syntactic, morphological, and orthographic
information for each term or word.
Example Study: Disease Gene
Identification and Prioritization
Hypothesis: Majority of genes that impact or cause
disease share membership in any of several
functional relationships OR Functionally similar or
related genes cause similar phenotype.
Functional Similarity – Common/shared
•Gene Ontology term
•Pathway
•Phenotype
•Chromosomal location
•Expression
•Cis regulatory elements (Transcription factor binding sites)
•miRNA regulators
•Interactions
•Other features…..
Which of these
interactants are potential
new candidates?
7
Known Disease Genes
Mining human
interactome
66
HPRD
BioGrid
Direct Interactants of Disease
Genes
Indirect Interactants of Disease
Genes
Prioritize candidate genes in the
interacting partners of the disease-related
genes
•Training sets: disease related genes
•Test sets: interacting partners of the training
genes
778
Example: Breast cancer
OMIM genes (level Directly interacting genes Indirectly interacting genes
0)
(level 1)
(level2)
15
342
15
2469!
342
2469
ToppGene – General Schema
http://toppgene.cchmc.org
TOPPGene - Data Sources
1. Gene Ontology: GO and NCBI Entrez Gene
2. Mouse Phenotype: MGI (used for the first time
for human disease gene prioritization)
3. Pathways: KEGG, BioCarta, BioCyc, Reactome,
GenMAPP, MSigDB
4. Domains: UniProt (Pfam, Interpro,etc.)
5. Interactions: NCBI Entrez Gene (Biogrid,
Reactome, BIND, HPRD, etc.)
6. Pubmed IDs: NCBI Entrez Gene
7. Expression: GEO
8. Cytoband: MSigDB
New
9. Cis-Elements: MSigDB
features
10. miRNA Targets: MSigDB
added
Benefits of Integrative Genomics
1. To unravel the connection between genotype and phenotype Systematically identify novel phenotype–genotype relationships.
2. Hypotheses generator.
3. Paves way for prognosis, diagnosis, and personalized medicine (adverse
drug reactions, etc.).
4. Deeper understanding of disease and an enhanced integration of medicine
with biology.
5. Increasing knowledge of the genes associated with diseases will allow
researchers to address more complicated issues, including the relative
contributions to disease of genes in the core biological set shared by all
species and those encoding proteins specific to humans; how sequence
features (such as conservation and polymorphism) relate to disease
characteristics; and how protein function relates to the outcome of clinical
treatment
6. And MANY MORE……..
Summary
Networks and integration of databases are
keys to success in Bioinformatics.
Integration of computation and data into a
single cohesive whole will increase the
efficiency of research effort
◦ by reducing the serendipity & hit and miss nature of
empirical research and
◦ will provide valuable clues to the biomedical researchers
on their choice of experiments - limitations of funds,
manpower and time.
Users have to know what is available and
how to access (what are the limitations) and
use the resources they are offered.
Thank You!
Algorithms in bioinformatics
• string algorithms
• dynamic programming
• machine learning (NN, k-NN, SVM, GA, ..)
• Markov chain models
• hidden Markov models
• Markov Chain Monte Carlo (MCMC) algorithms
• stochastic context free grammars
• EM algorithms
• Gibbs sampling
• clustering
• tree algorithms
• text analysis
• hybrid/combinatorial techniques and more…