Transcript Overture

Overture
(Motivating bioinformatics)
UIUC Saurabh Sinha
Special issue of journal Science, July 1, 2005.
>What Is the Universe Made Of?>What is the Biological Basis of
Consciousness?>Why Do Humans Have So Few Genes?>To What
Extent Are Genetic Variation and Personal Health Linked?>Can the
Laws of Physics Be Unified?>How Much Can Human Life Span Be
Extended?>What Controls Organ Regeneration?>How Can a Skin Cell
Become a Nerve Cell?>How Does a Single Somatic Cell Become a
Whole Plant?>How Does Earth's Interior Work?>Are We Alone in the
Universe?>How and Where Did Life on Earth Arise?>What Determines
Species Diversity?>What Genetic Changes Made Us Uniquely
Human?>How Are Memories Stored and Retrieved?>How Did
Cooperative Behavior Evolve?>How Will Big Pictures Emerge from a
Sea of Biological Data?>How Far Can We Push Chemical SelfAssembly?>What Are the Limits of Conventional Computing?>Can We
Selectively Shut Off Immune Responses?>Do Deeper Principles
Underlie Quantum Uncertainty and Nonlocality?>Is an Effective HIV
Vaccine Feasible?>How Hot Will the Greenhouse World Be?>What
Can Replace Cheap Oil -- and When?>Will Malthus Continue to Be
Wrong?
Where does bioinformatics
come in ?
“Why do humans have so few
genes ?”
A simple organism
Environmental signal
GENE
Response (protein)
A simple organism
GENE1
GENE2
GENE3
A simple organism
GENE1
GENE6
GENE2
GENE7
GENE3
GENE8
GENE4
GENE9
GENE5
GENE10
A complex organism
GENE1
GENE6
GENE2
GENE7
GENE3
GENE8
GENE4
GENE9
GENE5
GENE10
Complex
circuit of
interactions
Regulatory networks
• Genes are switches, transcription factors are
(one type of) input signals, proteins are
outputs
• Proteins (outputs) may be transcription
factors and hence become signals for other
genes (switches)
• This may be the reason why humans have so
few genes (the circuit, not the number of
switches, carries the complexity)
• Bioinformatics can unravel such networks,
given the genome (DNA sequence) and gene
activity information
Decoding the regulatory
network: Method 1
• Find patterns (“motifs”) in DNA
sequence that occur more often than
expected by chance
– These are likely to be binding sites for
transcription factors
– Knowing these can tell us if a gene is
regulated by a transcription factor (i.e., the
“switch”)
How to find motifs ?
• One method called “Gibbs sampling”
• A special kind of “Markov chain Monte
Carlo” sampling technique popular in
physics and computer science
• We shall see such an approach
(Lawrence et al. 1992) later in the
course
Decoding the regulatory
network: Method 2
More on method 2
• Paper from Nature journal, Sep 2004.
– We shall see this later in the course
• Integrates multiple, heterogeneous sources
of information
Common problem in bio– Multiple species conservation
informatics: Integration of
• What is functional is
probably
conserved
different
types
of information,
• How to find “conserved”
sequence
? (Next topic)
in principled
manner
– “ChIP-on-chip” data
• Which transcription factors bind which sequences ?
A high throughput, low resolution measurement
Sequence alignment
• The staple of a bioinformatician’s diet
• Are two sequences similar ? Are portions of
two sequences similar ? Which portions are
these ?
• Several algorithms that use dynamic
programming, a popular technique in
computer science
• Realistic versions of the problem are NPhard, approximation algorithms abound.
• Can we handle sequences of length ~109 ?
Sequence alignment
• Blanchette et al. (2004) in the journal
Genome Research
– We shall see this paper later in the course
• Efficient algorithm for genome-wide multiple
sequence alignment
• Uses the “divide and conquer” approach, as
well as “dynamic programming”
• Can help in reconstruction of “ancestral
genomes”
– Jurassic Park MMVI ? ! ? !
On counting genes
• The original question was “Why do
humans have so few genes?”
• How do we know how many genes
there are in the human genome ? (And
where they are in the genome)
• Experiments can be designed, but
bioinformatics plays a major role
Gene prediction
• The task of predicting the locations of genes
in a new genome (“annotation”)
• Several gene prediction software exist
• We shall see one such approach (Lukashin &
Borodovsky 1998, in the journal Nucleic Acids
Research) later in the course.
• The more sophisticated ones use Hidden
Markov models (HMM) and multiple species
comparison
“What controls organ
regeneration ?”
“How does a single somatic
cell become a whole plant ?”
Regeneration is mysterious, but do we
understand the “generation” part of
“regeneration”
• Developmental biology
• The timeline from a single cell (with
genetic material from mother and father)
to a multicellular embryo, and to an
adult
• A paradox : All cells in the adult body
have the same DNA, then how come
different cells are different ?
… and to this ?
Drosophila
(fruitfly)
How does a single cell lead to this ? …
Answer: Regulatory networks
(Again !)
• Bioinformatics used to scan entire genome for regions
that participate in “segmenting” the embryo
• Hidden Markov models, a popular technique in signal
processing, used to detect such regions
– Rajewsky et al. (2002) in the journal “BMC Bioinformatics”
• Multiple species comparison aids discovery
– Evolutionary models to integrate multiple species information
in the HMM framework
– Sinha et al. (2004) in the journal “BMC Bioinformatics”
“How did cooperative behavior
evolve?”
A related question
• What is the genetic (molecular) basis of social
behavior ?
• Social behavior in honey bees
• Young worker bees are nurses in the hive;
older ones go out to forage
• This behavioral maturation is determined by
needs of colony
– Shortage of foragers => some nurses will become
foragers prematurely
– Bees respond to social cues.
– What is the genetic basis of this ?
Bioinformatics of social
behavior
• UIUC team (Sinha, Ling, Whitfield, Zhai,
Robinson) scanned the genome to
understand this
– Attend the BIO seminar to learn about this
• Regulatory network of social behavior
• Statistical tools, such as Hypergeometric test,
partial correlation analysis, threshold-based
classification, support vector machine
classifier, etc. used for this project
“How will big pictures emerge
from a sea of biological data?”
The sea
• Genomes: 3 x 109 bp of human genome
• Similar numbers for other genomes: mouse, rat,
dog, chicken, chimp etc.
• Microarray: snapshots of 1000s of genes’
activities at one time and condition. Thousands
of microarrays.
• ChIP-on-chip data: measurements of a
transcription factor’s binding affinity for 1000s of
genes (promoters).
Big pictures
• An example: Segal et al. (2004) in the journal
Nature Genetics
– We shall see this work later in the course
• “A module map showing conditional activity of
expression modules in cancer” (Segal et al.)
• Integrates sequence, motif, and microarray
data through statistical tests to predict
involvement of particular genes in particular
types of cancer
The sea of biological data
• Biological literature, capturing decades of
painstaking experimental work on genetics
and molecular biology
• Can we glean useful information from this
vast body of knowledge ?
• Biological literature mining.
– Natural language processing
– Text Information Retrieval (statistical approaches)
• Ling et al. (2005) in Pacific Symposium on
Biocomputing. On “Gene summarization”
– We shall see this paper later in the course
Some other challenges
• Protein structure prediction
• Can we predict the 3-D structure of a protein
from its amino acid sequence ?
– Why ?
– One good reason: structure gives clues about
function. If we can tell the structure, we can
perhaps tell the function
– We can design amino acid sequences that will fold
into proteins that do what we want them to do.
Drug design !!
• One approach: neural networks, a popular
technique in comptuer science, applied to this
problem (Jones, 1999 in the J. of Mol Bio.)
Some other challenges
• “Metagenomics”
• Most studies to date are on genomes of one
species
• A sample from the soil contains hundreds of
bacteria, thousands of viruses. Can we study
all of these ?
• Bioinformatics is indispensable !!
• New type of data, new types of algorithms
Many more challenges
• New types of data come due to
technological breakthroughs in biology
• High throughput data carries
unprecedented amount of information
• Too much noise
• Bioinformatics removes the noise and
reveals the truth
Bioinformatics
• Is not about one problem (e.g., designing
better computer chips, better compilers,
better graphics, better networks, better
operating systems, etc.)
• Is about a family of very different problems,
all related to biology, all related to each other
• How can computers help solve any of this
family of problems ?
Bioinformatics and You
• You can learn the tools of bioinformatics
• These tools owe their origin to computer
science, information theory, probability
theory, statistics, etc.
• You can learn the language of biology,
enough to understand what the problems are
• You can apply the tools to these problems
and contribute to science