NYU_Lec1 - NDSU Computer Science

Download Report

Transcript NYU_Lec1 - NDSU Computer Science

Bioinformatics
Stuart M. Brown, Ph.D.
NYU School of Medicine
What is Bioinformatics
• The use of computers to collect,
analyze, and interpret biological
information at the molecular level.
"The mathematical, statistical and computing methods that
aim to solve biological problems using DNA and amino
acid sequences and related information."
• A set of software tools for molecular
sequence analysis
Introduction






The Human Genome Project
Challenges of Molecular Biology
computing
The changing role of the Biologist in
the Age of Information
Bioinformatics software
Genomics
Impact on medicine
I. The Human Genome Project
The genome sequence is complete - almost!
– approximately 3.2 billion base pairs.
All the Genes
• Any human gene can now be found in the
genome by similarity searching with over
99% certainty.
• However, the sequence still has many
gaps
– hard to find an uninterrupted genomic
segment for any gene
– still can’t identify pseudogenes with certainty
• This will improve as more sequence data
accumulates
Raw Genome Data:
The next step is obviously to locate all of
the genes and describe their functions.
This will probably take another 15-20
years!
Celera says that there are
only ~34,000 genes
– so why are there ~60,000 human genes on
Affymetrix GeneChips?
– Why does GenBank have 49,000 human
gene coding sequences and UniGene have
96,000 clusters of unique human ESTs?
• Clearly we are in desperate need of a
theoretical framework to go with all of this
data
Implications for Biomedicine
• Physicians will use genetic information
to diagnose and treat disease.
• Virtually all medical conditions have a
genetic component.
• Faster drug development research
• Individualized drugs
• Gene therapy
• All Biologists will use gene sequence
information in their daily work
II. Bioinformatics Challenges
The huge dataset

Lots of new sequences being added
- automated sequencers
- Human Genome Project
- EST sequencing

GenBank has over 16
Billion bases and is doubling
|every year!!
(problem of exponential growth...)

How can computers keep up?
New Types of Biological Data
• Microarrays - gene expression
• Multi-level maps: genetic, physical,
sequence, annotation
• Networks of Protein-protein interactions
• Cross-species relationships
• Homologous genes
• Chromosome organization
Similarity Searching the Databanks




What is similar to my sequence?
Searching gets harder as the
databases get bigger - and quality
degrades
Tools: BLAST and FASTA = time
saving heuristics (approximate)
Statistics + informed judgement of
the biologist
>gb|BE588357.1|BE588357 194087 BARC 5BOV Bos taurus cDNA 5'.
Length = 369
Score = 272 bits (137), Expect = 4e-71
Identities = 258/297 (86%), Gaps = 1/297 (0%)
Strand = Plus / Plus
Query: 17
Sbjct: 1
Query: 77
Sbjct: 60
aggatccaacgtcgctccagctgctcttgacgactccacagataccccgaagccatggca 76
|||||||||||||||| | ||| | ||| || ||| | |||| ||||| |||||||||
aggatccaacgtcgctgcggctacccttaaccact-cgcagaccccccgcagccatggcc 59
agcaagggcttgcaggacctgaagcaacaggtggaggggaccgcccaggaagccgtgtca 136
|||||||||||||||||||||||| | || ||||||||| | ||||||||||| ||| ||
agcaagggcttgcaggacctgaagaagcaagtggagggggcggcccaggaagcggtgaca 119
Query: 137 gcggccggagcggcagctcagcaagtggtggaccaggccacagaggcggggcagaaagcc 196
|||||||| | || | ||||||||||||||| ||||||||||| || ||||||||||||
Sbjct: 120 tcggccggaacagcggttcagcaagtggtggatcaggccacagaagcagggcagaaagcc 179
Query: 197 atggaccagctggccaagaccacccaggaaaccatcgacaagactgctaaccaggcctct 256
||||||||| | |||||||| |||||||||||||||||| ||||||||||||||||||||
Sbjct: 180 atggaccaggttgccaagactacccaggaaaccatcgaccagactgctaaccaggcctct 239
Query: 257 gacaccttctctgggattgggaaaaaattcggcctcctgaaatgacagcagggagac 313
|| || ||||| || ||||||||||| | |||||||||||||||||| ||||||||
Sbjct: 240 gagactttctcgggttttgggaaaaaacttggcctcctgaaatgacagaagggagac 296
Alignment





Alignment is the basis for finding similarity
Pairwise alignment = dynamic
programming
Multiple alignment: protein families and
functional domains
Multiple alignment is "impossible" for lots
of sequences
Another heuristic - progressive pairwise
alignment
Sample Multiple Alignment
Structure- Function Relationships

Can we predict the function of protein
molecules from their sequence?
sequence > structure > function


Conserved functional domains = motifs
Prediction of some simple 3-D
structures (a-helix, b-sheet, membrane
spanning, etc.)
Protein domains
(from ProDom database)
DNA Sequencing


Automated sequencers > 40 KB per day
500 bp reads must be assembled into
complete genes
- errors especially insertions and deletions
- error rate is highest at the ends where we want to
overlap the reads
- vector sequences must be removed from ends

Faster sequencing relies on better
software

overlapping deletions vs. shotgun approaches: TIGR
Finding Genes in genome
Sequence is Not Easy
• About 2% of human DNA encodes
functional genes.
• Genes are interspersed among long
stretches of non-coding DNA.
• Repeats, pseudo-genes, and introns
confound matters
Pattern Finding Tools
• It is possible to use DNA sequence patterns
to predict genes:
•
•
•
•
promoters
translational start and stop codes (ORFs)
intron splice sites
codon bias
• Can also use similarity to known genes/ESTs
Phylogenetics


Evolution = mutation of DNA (and
protein) sequences
Can we define evolutionary relationships
between organisms by comparing DNA
sequences
- is there one molecular clock?
- phenetic vs. cladisitic approaches
- lots of methods and software, what is the
"correct" analysis?
II. The Biologist in the
Age of Information
The Internet provides a wealth of
biological information


can be overwhelming
- e-mail
- USENET
- Web
Info skill = finding the information that
you need efficiently
Computing in the lab - everyday
tasks (not computational biology)




ordering supplies
online reference books
lab notes
literature searching
Training "computer savvy" scientists

Know the right tool for the job

Get the job done with tools available


Network connection is the lifeline of
the scientist
Jobs change, computers change,
projects change, scientists need to be
adaptable
The job of the biologist is changing
• As more biological information becomes
available …
– The biologist will spend more time using
computers
– The biologist will spend more time on data
analysis (and less doing lab biochemistry)
– Biology will become a more quantitative science
(think how the periodic table and atomic theory
affected chemistry)
III. Molecular Biology
Software Tools
GCG (Wisconsin Package)

The most popular and most
comprehensive set of tools for the
molecular biologist.
- Runs on mainframe computers: (UNIX)
- Web, X-Windows (SeqLab) interfaces
- Inexpensive for large numbers of users
- Requires local databases (on the mainframe
computer)
- Allows for custom databases and programming
The Web

Many of the best tools are free over the Web
BLAST
 ENTREZ/PUBMED
 Protein motifs databases


Bioinformatics “service providers”


DoubleTwist™, Celera, BioNavigator™
Hodgepodge collection of other tools
PCR primer design
 Pairwise and Multiple Alignment

Personal Computer Programs

Macintosh and Windows applications
- Commercial: Vector NTI™, MacVector™, OMIGA™,
Sequencher™
- Freeware: Phylip, Fasta, Clustal, etc.



Better graphics, easier to use
Can't access very large databases or perform
demanding calculations
Integration with web databases and computing
services
Putting it all together


The current state of the art requires the
biologist to jump around from Web to
mainframe to personal computer
The trend is for integration:
• Web + personal computer will replace text
interface to mainframe ?
• Will the Web become the ultimate interface for
all computing ??
The Role of the RCR
• Provide software (site licenses), computing
hardware, and databases
• Train scientists to use the software
– Courses
– Newsletter & e-mail updates
– Seminars
– One-on-one training
• Technical support (on our software!)
– Phone, e-mail, lab/office visits
• Consulting
• Recommendations, joint work, do it for you,
custom software development
IV. Genomics
• The application of high-throughput
automated technologies to molecular biology.
• The experimental study of complete
genomes.
Genomics Technologies
• Automated DNA sequencing
• Automated annotation of sequences
• DNA microarrays
– gene expression (measure RNA levels)
– single nucleotide polymorphisms (SNPs)
• Protein chips (SELDI, etc.)
• Protein-protein interactions
cDNA spotted microarrays
Affymetrix Gene Chips
Microarray Data Analysis
•
•
•
•
•
Clustering and pattern detection
Data mining and visualization
Controls and normalization of results
Statistical validatation
Linkage between gene expression data and gene
sequence/function/metabolic pathways databases
• Discovery of common sequences in co-regulated
genes
• Meta-studies using data from multiple experiments
Pharmacogenomics
• The use of DNA sequence information to
measure and predict the reaction of
individuals to drugs.
• Personalized drugs
• Faster clinical trials
– Selected trail populations
• Less drug side effects
– Toxicogenomics
Impact on Bioinformatics
• Genomics produces high-throughput, highquality data, and bioinformatics provides
the analysis and interpretation of these
massive data sets.
• It is impossible to separate genomics
laboratory technologies from the
computational tools required for data
analysis.
Genomics Software @ the RCR
• Affymetrix Gene Chip Analysis Suite
• GeneSpring
• Research Genetics Pathways (nylon filters)
• TIGR Spotfinder, ScanAlyze, Cluster
• Coming soon : a shared microarray database