The World of Microbes on the Internet

Download Report

Transcript The World of Microbes on the Internet

Bioinformatics
Genomic Biology
as a Quantitative Science
Stuart M. Brown, Ph.D.
Director, Research Computing, NYU School of Medicine
A Genome Revolution is underway
in Biology and Medicine



We are in the midst of a "Golden Era" of
biology
The Human Genome Project has produced a
huge storehouse of data that will be used to
change every aspect of biological research
and medicine
The revolution is about treating biology as an
information science, not about specific
technologies.
The Human Genome Project
The job of the biologist is changing
As more biological information
becomes available and
laboratory equipment becomes
more automated ...
– The biologist will spend more time using computers
& on experimental design and data analysis
(and less time doing tedious lab biochemistry)
– Biology will become a more quantitative science
(think how the periodic table affected chemistry)
Biological Information
Protein 2-D gel
mRNA Expression
Protein 3-D Structure
Mass Spec.
Genome sequence
The Cell
A review of some basic genetics
DNA

4 bases (G, C, T, A)

base pairs
G--C
T--A

genes

non-coding regions
Decoding Genes
Classic Molecular Biology


A gene is a DNA sequence at a particular
locus on a chromosome that encodes a protein.
The Central Dogma of Molecular Biology:
DNA ––—> RNA ——> Protein


A mutation changes the DNA sequence - leads
to a change in protein sequence - or no
protein.
Alleles are slightly different DNA sequences
of the same gene.
The human genome is the the complete DNA
content of the 23 pairs of human chromosomes
- 44 autosomes plus two sex chromosomes
- approximately 3.2 billion base pairs.
Bold Words from Francis Collins:
“The history of biology was forever altered a
decade ago by the bold decision to launch a
research program that would characterize in
ultimate detail the complete set of genetic
instructions of the human being.”
Francis S. Collins
Director of the National Human Genome Research
Institute
N Engl J Med 1999 882:42-65
Genome Projects



Complete genomic sequences:
– Dozens of microorganisms
– Yeast, C. elegans, Drosophila
– Mouse
– Human
Comparative genomics
All this data is enabling new kinds of research for those with the computational skills to take
advantage of it.
How does genome
sequencing technology
work?




Molecular biology of the Sanger method
Sub-cloning of fragments - BAC, PAC,
cosmid, plasmid, phage
Automated sequencers
The need for computers to assemble the
"reads" and manage the workflow
Automated sequencing machines,
particularly those made by PE Applied
Biosystems, use 4 colors, so they can
read all 4 bases at once.
Raw Genome Data:
Lots of Sequence Data

How to extract useful knowledge from all
of this data?

Need sophisticated computer tools
–
–
–
–
Find the genes
Figure out what they do (function)
Diagnostic tests
Medical treatments
Finding genes in genome
sequence is not easy

About 1% of human DNA encodes
functional genes.

Genes are interspersed among long stretches
of non-coding DNA.

Repeats, pseudo-genes, and introns
confound matters
Gene prediction tools - look for Start and
Stop codons, intron splice sites,
similarity to known genes and cDNAs,
etc.
Data Mining Tools

Scientists need to work with a lot of layers of
information about the genome
–
–
–
–
–

coding sequence of known genes and cDNAs
genetic maps (known mutations and markers)
gene expression
Protein sequence (from Mass Spectroscopy)
cross species homology
Most of the best tools are free on the Web
UCSC
Ensembl at EBI/EMBL
What comes after Genome
Sequencing?



We are now in the "Post-Genomic" era.
It is possible to use the genome sequence
plus a variety of automated laboratory
equipment to do entirely new kinds of
biology.
Not just scaled-up, but comprehensive
Relate genes to Organisms

Diseases
– OMIM: Human Genetic Disease

Metabolic and regulatory pathways
– KEGG
– Cancer Genome Project
Human Alleles

The OMIM (Online Mendelian Inheritance
in Man) database at the NCBI tracks all human
mutations with known phenotypes.

It contains a total of about 2,000 genetic
diseases [and another ~11,000 genetic loci with
known phenotypes - but not necessarily known gene
sequences]

It is designed for use by physicians:
– can search by disease name
– contains summaries from clinical studies
KEGG: Kyoto Encylopedia of
Genes and Genomes



Enzymatic and regulatory pathways
Mapped out by EC number and crossreferenced to genes in all known organisms
(wherever sequence information exits)
Parallel maps of regulatory pathways
Genomics

What is Genomics?
– An operational definition:
• The application of high throughput automated
technologies to molecular biology.
– A philosophical definition:
• A wholistic or systems approach to the study of
information flow within a cell.
Genomics Technologies



Automated DNA sequencing
Automated annotation of sequences
DNA microarrays
– gene expression (measure RNA levels)

SNP Genotyping
– Genome diagnostics (genetic testing)

Proteomics
– Protein identification
– Protein-protein interactions
DNA chip microarrays




Put a large number (~100K) of cDNA sequences or
synthetic DNA oligomers onto a glass slide (or other
substrate) in known locations on a grid.
Label an RNA sample and hybridize
Measure amounts of RNA bound to each square in
the grid
Make comparisons
– Cancerous vs. normal tissue
– Treated vs. untreated
– Time course

Many applications in both basic and clinical research
Spot your own Chip
(plans available for free from Pat Brown’s website)
Robot spotter
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Ordinary glass
microscope slide
cDNA spotted microarrays
Goal of Microarray experiments

Microarrays are a very good way of
identifying a bunch of genes involved in a
disease process
– Differences between cancer and normal tissue
– Tuberculosis infected vs resistant lung cells

Mapping out a pathway
– Co-regulated genes

Finding function for unknown genes
– Involved these processes
Direct Medical Applications

Diagnosis
– Type of cancer
– Aggressive or benign?

Monitor treatment outcome
– Is a treatment having the desired effect on the
target tissue?
When you go looking…
…you will certainly find something!
Human Genetic Variation


Every human has essentially the same set of genes
But there are different forms of each gene -- known as
alleles
– blue vs. brown eyes
– genetic diseases such as cystic fibrosis or Huntington’s
disease are caused by dysfunctional alleles
Alleles are created by mutations in the
DNA sequence of one person - which
are passed on to their descendants
Clinical Manifestations
of Genetic Variation
(All disease has a genetic component)



Susceptibility vs. resistance
Variations in disease severity or symptoms
Reaction to drugs (pharmacogenetics)
All of these traits can be traced back to
particular genes (or sets of genes)
Pharmacogenomics

People react differently to drugs
– Side effects
– Variable effectiveness

There are genes that control these
reactions

SNP markers can be used to identify
these genes (profiles)
Use the Profiles

Genetic profiles of new patients can then be
used to prescribe drugs more effectively &
avoid adverse reactions.
– Sell a drug with a gene test

Can also speed clinical trials by testing on
those who are likely to respond well.
Toxicogenomics

There are a number of common pathways
for drug toxicity (or environmental tox. )

It is possible to compile genomic signatures
(gene expression data) for these pathways.

Candidate drug molecules can be screened
in cell culture or in animals for induction of
these toxicity pathways.
Planning for a Genomics Revolution

Bioinformatics support must be integral in the
planning process for the development of new
genomics research facilities.

Genome Project sequencing centers have more
staff and more $$$ spent on data analysis than on
the sequencing itself.

Microarray facilities will be even more skewed
toward data analysis
It is an information-intensive business!

Implications for Biomedicine

Physicians will use genetic information to
diagnose and treat disease.
» Virtually all medical conditions have a
genetic component.

Faster drug development research
» Individualized drugs
» Gene therapy

All Biologists will use gene sequence
information in their daily work
Training "computer savvy"
scientists

Know the right tool for the job

Get the job done with tools available

Network connection is the lifeline of the
scientist

Jobs change, computers change, projects
change, scientists need to be adaptable
Long Term Implications
 A "periodic
table for biology" will lead to
an explosion of research and discoveries we will finally have the tools to start
making systematic analyses of biological
processes (quantitative biology).
Understanding the genome will lead to
the ability to change it - to modify the
characteristics of organisms and people in
a wide variety of ways

Genomics Education



Genomics scientists need basic training in
both Molecular Biology and Computing
Specific training in the use of automated
laboratory equipment, the analysis of large
datasets, and bioinformatics algorithms
Particularly important for the training of
medical doctors - at least a familiarity with
the technology
Genomics in Medical Education
“The explosion of information about the new
genetics will create a huge problem in
health education. Most physicians in
practice have had not a single hour of
education in genetics and are going to be
severely challenged to pick up this new
technology and run with it."
Francis Collins
Bioinformatics:
A Biologist's
Guide to
Biocomputing
and the
Internet
Stuart M. Brown, Ph.D.
[email protected]
www.med.nyu/rcr