Presentation - Anil Jegga - Cincinnati Children`s Hospital Medical

Download Report

Transcript Presentation - Anil Jegga - Cincinnati Children`s Hospital Medical

Genes
Disease
Disease
s s Disease
Disease
s s Disease
Disease
s s Diseases
Diseases
Genes
Diseases
Genes
Anatomy
Physiology
Medical
Informatics
Bioinformatics
Novel
relationships &
Deeper insights
Diseases
Mining Bio-Medical Mountains
& YOU
How Computer Science can help
Biomedical Research and Health Sciences
Anil Jegga
4/4/2016
Division of Biomedical Informatics,
Cincinnati Children’s Hospital Medical Center (CCHMC)
Department of Pediatrics, University of Cincinnati
http://anil.cchmc.org
[email protected]
Acknowledgement
Biomedical Engineering/Bioinformatics
• Jing Chen
• Sivakumar Gowrisankar
• Vivek Kaimal
Computer Science
• Amit Sinha
• Mrunal Deshmukh
• Divya Sardana
Electrical Engineering
• Nishanth Vepachedu
Two Separate Worlds…..
Disease
World
Medical Informatics
Bioinformatics & the “omes
Genome
Regulome
Transcriptome
miRNAome
Disease
Database
Patient
Records
Clinical
Trials
Proteome
Interactome
Metabolome
Variome
Pharmacogenome
PubMed
→Name
Physiome
OMIM
→Synonyms
Clinical
→Related/Similar Diseases
Synopsis
Pathome
→Subtypes
→Etiology
→Predisposing Causes
→Pathogenesis
→Molecular Basis
→Population Genetics
382 “omes” so far………
→Clinical findings
→System(s) involved
→Lesions
and there is “UNKNOME” too →Diagnosis
→Prognosis
genes with no function known
→Treatment
http://omics.org/index.php/Alphabetically_ordered_list_of_omics
→Clinical Trials……
With Some Data Exchange…
now…. The number 1 FAQ
How much biology should I know??
No simple or straight-forward
answer… unfortunately!
But the mantra is:
Interact routinely with biologists
OR
Work with the biologists or the
biological data
But I want to learn some basics…
1.
2.
3.
4.
http://www.ncbi.nlm.nih.gov/Education
http://www.ebi.ac.uk/2can/
http://www.genome.gov/Education/
http://genomics.energy.gov/
1.
Introduction to Bioinformatics by Teresa Attwood, David ParrySmith
A Primer of Genome Science by Gibson G and Muse SV
Bioinformatics: A Practical Guide to the Analysis of Genes and
Proteins, Second Edition by Andreas D. Baxevanis, B. F. Francis
Ouellette
Algorithms on Strings, Trees, and Sequences: Computer Science and
Computational Biology by Dan Gusfield
Bioinformatics: Sequence and Genome Analysis by David W. Mount
Discovering Genomics, Proteomics, and Bioinformatics by A. Malcolm
Campbell and Laurie J. Heyer
Books
2.
3.
4.
5.
6.
And the other FAQs….
1.
2.
3.
4.
5.
What bioinformatics topics are closest to
computer science?
Should computer science departments
involve themselves in preparing their
graduates for careers in bioinformatics?
And if so, what topics should they cover?
And how much biology should they be
taught?
Lastly, how much effort should be
expended in re-directing computer
scientists to do work in bioinformatics?
Cohen, 2005; Communications of the ACM
Issues to be considered……..
1.
2.
3.
4.
5.
Computer science Vs molecular biology –
Subject & Scientists - Cultural
differences
Current goals of molecular biology,
genomics (or biomedical research in a
broader sense)
Data types used in bioinformatics or
genomics
Areas within computer science of interest
to biologists
Bioinformatics research - Employment
opportunities
Biological Challenges - Computer Engineers
• Post-genomic Era and the goal of biomedicine
– to develop a quantitative understanding of how
living things are built from the genome that
encodes them.
• Deciphering the genome code
– Identifying unknown genes and assigning function
by computational analysis of genomic sequence
– Identifying the regulatory mechanisms
– Identifying their role in normal
development/states vs disease states
Biological Challenges - Computer Engineers
• Data Deluge: exponential growth of data
silos and different data types
– Human-computer interaction specialists need to
work closely with academic and clinical
biomedical researchers to not only manage the
data deluge but to convert information into
knowledge.
• Biological data is very complex and
interlinked!
– Creating information systems that allow
biologists to seamlessly follow these links
without getting lost in a sea of information - a
huge opportunity for computer scientists.
A major goal in molecular
• Networks, networks, and networks!
biology
is
Functional
– Each gene in the genome is not an independent
entity. Multiple genes interact to perform a
Genomics
specific function.– Study of the
– Environmental influences – Genotyperelationships
among
environment interaction
– Integrating
genomic
and
biochemical
data
genes
in
DNA
&
their
together into quantitative and predictive
models of biochemistry and physiology
function
–
in
normal
and
– Computer scientists, mathematicians, and
statisticians
will ALL bestates
an integral and critical
disease
part of this effort.
Biological Challenges - Computer Engineers
Informatics – Biologists’ Expectations
• Representation, Organization, Manipulation,
Distribution, Maintenance, and Use of information,
particularly in digital form.
• Functional aspect of bioinformatics:
Representation, Storage, and Distribution of data.
– Intelligent design of data formats and databases
– Creation of tools to query those databases
– Development of user interfaces or visualizations
that bring together different tools to allow the
user to ask complex questions or put forth
testable hypotheses.
Informatics – Biologists’ Expectations
• Developing analytical tools to discover knowledge in
data
– Levels at biological information is used:
• comparing sequences – predict function of a
newly discovered gene
• breaking down known 3D protein structures
into bits to find patterns that can help predict
how the protein folds
• modeling how proteins and metabolites in a cell
work together to make the cell function…….
Finally….
What does informatics mean to biologists?
The ultimate goal of analytical
bioinformaticians is to develop predictive
methods that allow biomedical
researchers and scientists to model the
function and phenotype of an organism
based only on its genomic sequence. This
is a grand goal, and one that will be
approached only in small steps, by many
scientists from different but allied
disciplines working cohesively.
Biology – Data Structures
Four broad categories:
1. Strings: To represent DNA, RNA, amino
acid sequences of proteins
2.Trees: To represent the evolution of
various organisms (Taxonomy) or
structured knowledge (Ontologies)
3.Sets of 3D points and their linkages: To
represent protein structures
4.Graphs: To represent metabolic,
regulatory, and signaling networks or
pathways
Biology – Data Structures
Biologists are also interested in
1. Substrings
2. Subtrees
3. Subsets of points and linkages, and
4. Subgraphs.
Beware: Biological data is often
characterized by huge size, the
presence of laboratory errors (noise),
duplication, and sometimes unreliability.
Support Complex Queries – A typical demand
• Get me all genes involved in or associated with
brain development that are differentially
expressed in the Central Nervous System.
• Get me all genes involved in brain development in
human and mouse that also show iron ion binding
activity.
• For this set of genes, what aspects of function
and/or cellular localization do they share?
• For this set of genes, what mutations are
reported to cause pathological conditions?
Model Organism Databases: Common Issues
• Heterogeneous Data Sets - Data Integration
– From Genotype to Phenotype
– Experimental and Consensus Views
• Incorporation of Large Datasets
– Whole genome annotation pipelines
– Large scale mutagenesis/variation projects (dbSNP)
• Computational vs. Literature-based Data
Collection and Evaluation (MedLine)
• Data Mining
– extraction of new knowledge
– testable hypotheses (Hypothesis Generation)
Bioinformatic Data-1978 to present
•
•
•
•
•
•
DNA sequence
Gene expression
Protein expression
Protein Structure
Genome mapping
SNPs & Mutations
•
•
•
•
•
•
Metabolic networks
Regulatory networks
Trait mapping
Gene function analysis
Scientific literature
and others………..
Human Genome Project – Data Deluge
Database name
Nucleotide
Protein
Structure
Genome Sequences
Popset
SNP
3D Domains
Domains
No. of Human Gene Records
currently in NCBI: 29413
(excluding pseudogenes,
mitochondrial genes and obsolete
records).
Includes ~460 microRNAs
GEO Datasets
GEO Expressions
Records
12,427,463
419,759
11,232
75
21,010
11,751,216
41,857
19
5,036
16,246,778
UniGene
123,777
UniSTS
323,773
PubMed Central
HomoloGene
Taxonomy
4,278
19,520
1
NCBI Human Genome Statistics – as on February12, 2008
The Gene Expression Data Deluge
Till 2000: 413 papers on microarray!
Year
2001
2002
2003
2004
2005
2006
2007
2008
PubMed
Articles
834
1557
2421
3508
4400
4824
5108
647…
Problems Deluge!
Allison DB, Cui X, Page GP,
Sabripour M. 2006. Microarray
data analysis: from disarray to
consolidation and consensus.
Nat Rev Genet. 7(1): 55-65.
Information Deluge…..
• 3 scientific journals in 1750
• Now - >120,000 scientific journals!
• >500,000 medical articles/year
• >4,000,000 scientific articles/year
• >16 million abstracts in PubMed
derived from >32,500 journals
A researcher would have to scan 130 different
journals and read 27 papers per day to follow a
single disease, such as breast cancer (Baasiri et al.,
1999 Oncogene 18: 7958-7965).
Data-driven Problems…..
What’s in a name!
Rose is a rose is a rose is a rose!
Gene Nomenclature
Disease names
•Accelerin
•Draculin
•
•Antiquitin
•Fidgetin
•Bang Senseless
•Gleeful
•
•Bride of Sevenless •Knobhead
•
•Christmas Factor •Lunatic Fringe •
•Cockeye
•Mortalin
•
•Crack
•Orphanin
•Draculin
•Profilactin
•Dickie’s small eye •Sonic Hedgehog
Mobius Syndrome with
Poland’s Anomaly
Werner’s syndrome
Down’s syndrome
Angelman’s syndrome
Creutzfeld-Jacob
disease
1.
Generally, the names refer to
some feature of the mutant
phenotype
2.
Dickie’s small eye (Thieler et al.,
1978, Anat Embryol (Berl), 155:
81-86) is now Pax6
3.
Gleeful: "This gene encodes a
C2H2 zinc finger transcription
factor with high sequence
similarity to vertebrate Gli
proteins, so we have named the
gene gleeful (Gfl)." (Furlong et
al., 2001, Science 293: 1632)
•
How to name or describe proteins, genes, drugs, diseases and conditions consistently and
coherently?
•
How to ascribe and name a function, process or location consistently?
•
How to describe interactions, partners, reactions and complexes?
Some Solutions
•
Develop/Use controlled or restricted vocabularies (IUPAC-like naming conventions,
HGNC, MGI, UMLS, etc.)
•
Create/Use thesauruses, central repositories or synonym lists (MeSH, UMLS, etc.)
•
Work towards synoptic reporting and structured abstracting
Rose is a rose is a rose is a rose….. Not Really!
What is a cell?
•
any small compartment
•
(biology) the basic structural and functional unit of all
organisms; they may exist as independent units of life (as in
monads) or may form colonies or tissues as in higher plants
and animals
•
a device that delivers an electric current as a result of
chemical reaction
•
a small unit serving as part of or as the nucleus of a larger
political movement
•
cellular telephone: a hand-held mobile radiotelephone for use
in an area divided into small sections, each with its own shortrange transmitter/receiver
•
small room in which a monk or nun lives
•
a room where a prisoner is kept
Image Sources: Somewhere from the internet…
Semantic Groups, Types and Concepts:
•
Semantic Group Biology – Semantic Type Cell
•
Semantic Groups Object OR Devices – Semantic
Types Manufactured Device or Electrical Device
or Communication Device
•
Semantic Group Organization – Semantic Type
Political Group
Foundation Model Explorer
The REAL
Problems
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
COLORECTAL CANCER [3-BP DEL, SER45DEL]
COLORECTAL CANCER [SER33TYR]
PILOMATRICOMA, SOMATIC [SER33TYR]
HEPATOBLASTOMA, SOMATIC [THR41ALA]
DESMOID TUMOR, SOMATIC [THR41ALA]
PILOMATRICOMA, SOMATIC [ASP32GLY]
OVARIAN CARCINOMA, ENDOMETRIOID TYPE, SOMATIC [SER37CYS]
HEPATOCELLULAR CARCINOMA SOMATIC [SER45PHE]
HEPATOCELLULAR CARCINOMA SOMATIC [SER45PRO]
MEDULLOBLASTOMA, SOMATIC [SER33PHE]
1.
CTNNB1
MET
HEPATOCELLULAR
CARCINOMA SOMATIC
[ARG249SER]
TP53*
Hepatocellular Carcinoma
TP53
Many disease states are
complex, because of many genes
(alleles & ethnicity, gene
families, etc.), environmental
effects (life style, exposure,
etc.) and the interactions.
aflatoxin B1, a mycotoxin
induces a very specific Gto-T mutation at codon 249
in the tumor suppressor
gene p53.
Environmental Effects
The REAL
Problems
1.
2.
3.
4.
5.
6.
7.
ALK in cardiac myocytes
Cell to Cell Adhesion Signaling
Inactivation of Gsk3 by AKT causes accumulation
of b-catenin in Alveolar Macrophages
Multi-step Regulation of Transcription by Pitx2
Presenilin action in Notch and Wnt signaling
Trefoil Factors Initiate Mucosal Healing
WNT Signaling Pathway
1.
2.
CTNNB1
HEPATOCELLULAR CARCINOMA
MET
LIVER:
•Hepatocellular carcinoma;
•Micronodular cirrhosis;
•Subacute progressive viral hepatitis
NEOPLASIA:
•Primary liver cancer
TP53
CBL mediated ligand-induced downregulation
of EGF receptors
Signaling of Hepatocyte Growth Factor
Receptor
1.
Estrogen-responsive protein Efp
controls cell cycle and breast tumors
growth
2. ATM Signaling Pathway
3. BTG family proteins and cell cycle
regulation
4. Cell Cycle
5. RB Tumor Suppressor/Checkpoint
Signaling in response to DNA
damage
6. Regulation of transcriptional activity
by PML
7. Regulation of cell cycle progression
by Plk3
8. Hypoxia and p53 in the
Cardiovascular system
9. p53 Signaling Pathway
10. Apoptotic Signaling in Response to
DNA Damage
11. Role of BRCA1, BRCA2 and ATR in
Cancer Susceptibility….Many More…..
Methods for Integration
1. Link driven federations
• Explicit links between databanks.
2. Warehousing
• Data is downloaded, filtered, integrated
and stored in a warehouse. Answers to
queries are taken from the warehouse.
3. Others….. Semantic Web, etc………
Link-driven Federations
1. Creates explicit links between databanks
2. query: get interesting results and use web
links to reach related data in other
databanks
Examples: NCBI-Entrez, SRS
http://www.ncbi.nlm.nih.gov/Database/datamodel/
http://www.ncbi.nlm.nih.gov/Database/datamodel/
http://www.ncbi.nlm.nih.gov/Database/datamodel/
http://www.ncbi.nlm.nih.gov/Database/datamodel/
http://www.ncbi.nlm.nih.gov/Database/datamodel/
Link-driven Federations
1. Advantages
•
•
complex queries
Fast
•
•
•
require good knowledge
syntax based
terminology problem not solved
2. Disadvantages
Data Warehousing
Data is downloaded, filtered, integrated and
stored in a warehouse. Answers to queries are
taken from the warehouse.
Advantages
Disadvantages
1. Good for very-specific,
task-based queries and
studies.
1. Can become quickly
outdated – needs
constant updates.
2. Since it is custom-built
and usually expertcurated, relatively less
error-prone.
2. Limited functionality –
For e.g., one diseasebased or one systembased.
Algorithms in Bioinformatics
1. Finding similarities among strings
2. Detecting certain patterns within strings
3. Finding similarities among parts of spatial
structures (e.g. motifs)
4. Constructing trees
• Phylogenetic or taxonomic trees:
evolution of an organism
• Ontologies – structured/hierarchical
representation of knowledge
5. Classifying new data according to
previously clustered sets of annotated
data
Algorithms in Bioinformatics
6. Reasoning about microarray data and the
corresponding behavior of pathways
7. Predictions of deleterious effects of
changes in DNA sequences
8. Computational linguistics: NLP/Textmining. Published literature or patient
records
9. Graph Theory – Gene regulatory
networks, functional networks, etc.
10.Visualization and GUIs (networks,
application front ends, etc.)
Disease Gene Identification and
Prioritization
Hypothesis: Majority of genes that impact or
cause disease share membership in any of several
functional relationships OR Functionally similar or
related genes cause similar phenotype.
Functional Similarity – Common/shared
•Gene Ontology term
•Pathway
•Phenotype
•Chromosomal location
•Expression
•Cis regulatory elements (Transcription factor binding sites)
•miRNA regulators
•Interactions
•Other features…..
Background, Problems & Issues
1. Most of the common diseases are multifactorial and modified by genetically and
mechanistically complex polygenic
interactions and environmental factors.
2. High-throughput genome-wide studies like
linkage analysis and gene expression
profiling, tend to be most useful for
classification and characterization but do
not provide sufficient information to
identify or prioritize specific disease causal
genes.
Background, Problems & Issues
3. Since multiple genes are associated with
same or similar disease phenotypes, it is
reasonable to expect the underlying genes
to be functionally related.
4. Such functional relatedness (common
pathway, interaction, biological process,
etc.) can be exploited to aid in the finding
of novel disease genes. For e.g., genetically
heterogeneous hereditary diseases such as
Hermansky-Pudlak syndrome and Fanconi
anaemia have been shown to be caused by
mutations in different interacting proteins.
PPI - Predicting Disease Genes
1. Direct protein–protein interactions (PPI) are
one of the strongest manifestations of a
functional relation between genes.
2. Hypothesis: Interacting proteins lead to same
or similar disease phenotypes when mutated.
3. Several genetically heterogeneous hereditary
diseases are shown to be caused by mutations
in different interacting proteins. For e.g.
Hermansky-Pudlak syndrome and Fanconi
anaemia. Hence, protein–protein interactions
might in principle be used to identify
potentially interesting disease gene candidates.
7
Known Disease Genes
Mining human
interactome
HPRD
BioGrid
Direct Interactants
of Disease Genes
Indirect Interactants
of Disease Genes
Prioritize candidate genes in the
interacting partners of the diseaserelated genes
•
Training sets: disease related genes
•
Test sets: interacting partners of the
training genes
66
Which of these
interactants are
potential new
candidates?
778
Example: Breast cancer
OMIM genes
(level 0)
Directly interacting
genes (level 1)
Indirectly interacting
genes (level2)
15
342
2469!
15
342
2469
ToppGene – General Schema
the Ultimate Goal…….
Disease
World
Medical Informatics
Bioinformatics
Genome
Regulome
Personalized Medicine
►Decision Support System
►Outcome Predictor
→Name
►Course Predictor
→Synonyms
Diagnostic Test Selector
→Related/Similar►Diseases
→Subtypes
►Clinical Trials Design
→Etiology
►Hypothesis Generator…..
→Predisposing Causes
Disease
Databas
e
Patient
Record
s
Clinical
Trials
►
→Pathogenesis
→Molecular Basis
→Population Genetics
→Clinical findings
→System(s) involved
→Lesions
→Diagnosis
→Prognosis
→Treatment
→Clinical Trials……
Computer
Engineers
OMIM
Transcriptome
Proteome
Interactome
Metabolome
Physiome
Pathome
Variome
Pharmacogenome
PubMed
& YOU
“To him who devotes his life to science, nothing can give more happiness
than increasing the number of discoveries, but his cup of joy is full when the
results of his studies immediately find practical applications”
Thank You!
— Louis Pasteur
http://sbw.kgi.edu/