No Slide Title
Download
Report
Transcript No Slide Title
Introduction to Bioinformatics
CPSC 265
What is bioinformatics?
Interface of biology and computer science
Analysis of proteins, genes and genomes
using computer algorithms and
computer databases
Genome informatics: making
sense of the billions of base pairs of DNA
that are sequenced by genomics projects.
Mostly, it’s about protein and DNA sequences
What do bioinformatics researchers do?
Process large data outputs from new technologies
Turn sequence data into whole-genome sequences
Interpret genome sequences in terms of genes
and their expression
Find genes that control crop, animal traits, disease etc.
Model evolution in genomes and proteins
Model and predict 3D structures of proteins
Sequences (millions)
Base pairs of DNA (billions)
Growth of GenBank
Updated 8-12-04:
>40b base pairs
1982
1986
1990
1994
Year
1998
2002
Fig. 2.1
Page 17
Cost of sequencing is falling
exponentially
DNA sequence analysis
Could be like those from our experiment last week
Or, a lot bigger, like the whole human genome.
Some have chromatogram or “quality” data, some
don’t.
DNA makes RNA makes protein
Hard to sequence RNA
Very hard to sequence protein
We can deduce RNA sequence from DNA
(in bacteria, as easy as turning Ts to Us.
In eukarya, need also to figure out where introns are)
We can deduce protein sequence from RNA, using
the Universal Genetic Code
Conceptual
Translation
In a computer,
take each set
of three RNA
letters, and
then figure out
what amino
acid they code
for.
Professional
biologists use
the SINGLE
LETTER
CODE
DNA potentially encodes six proteins
5’ CAT CAA
5’ ATC AAC
5’ TCA ACT
5’ CATCAACTACAACTCCAAAGACACCCTTACACATCAACAAACCTACCCAC 3’
3’ GTAGTTGATGTTGAGGTTTCTGTGGGAATGTGTAGTTGTTTGGATGGGTG 5’
5’ GTG GGT
5’ TGG GTA
5’ GGG TAG
We call these READING FRAMES
5’ CAT CAA
5’ ATC AAT
5’ TCA ATG
5’ CATCAATGACAACTCCAAAGACACCCTTACACATCAACAAACCTACCCAC 3’
3’ GTAGTTACTGTTGAGGTTTCTGTGGGAATGTGTAGTTGTTTGGATGGGTG 5’
5’ GTG GGT
5’ TGG GTA
5’ GGG TAG
All proteins start with M (ATG)
TAG, TAA and TGA are all STOP
This can help narrow it down
5’ CAT CAA
5’ ATC AAT
5’ TCA ATG
5’ CATCAATGACAACTCCAAAGACACCCTTACACATCAACAAACCTACCCAC 3’
3’ GTAGTTACTGTTGAGGTTTCTGTGGGAATGTGTAGTTGTTTGGATGGGTG 5’
5’ GTG GGT
5’ TGG GTA
5’ GGG TAG
Once you know the sequence of the
protein, you can figure out if it has
been studied already.
You may even be able to track down
a likely structure
There are three major public DNA databases
EMBL
Housed
at EBI
European
Bioinformatics
Institute
GenBank
DDBJ
Housed
at NCBI
National
Center for
Biotechnology
Information
Housed
in Japan
Page 16
www.ncbi.nlm.nih.gov
PubMed is…
• National Library of Medicine's search service
• 12 million citations in MEDLINE
• links to participating online journals
• PubMed tutorial (via “Education” on side bar)
BLAST is…
• Basic Local Alignment Search Tool
• NCBI's sequence similarity search tool
• supports analysis of DNA and protein databases
• 80,000 searches per day
TaxBrowser is…
• browser for the major divisions of living organisms
(archaea, bacteria, eukaryota, viruses)
• taxonomy information such as genetic codes
• molecular data on extinct organisms
From the NCBI home
page, type “lectin”
and hit “Search”
PubMed
PubMed is the NCBI gateway to MEDLINE.
MEDLINE contains bibliographic citations
and author abstracts from over 4,600 journals
published in the United States and in 70 foreign
countries.
It has 12 million records dating back to 1966.
Page 35
BLAST
BLAST looks for similarity between your favorite
query sequence and other known protein or DNA
sequences.
Applications include
• identifying homologs (orthologs and paralogs)
• discovering new genes or proteins
• discovering variants of genes or proteins
• investigating expressed sequence tags (ESTs)
• exploring protein structure and function
page 88
Four components to a BLAST search
(1) Obtain the sequence (query)
(2) Select the BLAST program
(3) Enter sequence
(4) Choose optional parameters
Then click “BLAST”
page 88
Step 2: Choose the BLAST program
blastn (nucleotide BLAST)
blastp (protein BLAST)
tblastn (translated BLAST)
blastx (translated BLAST)
tblastx (translated BLAST)
DNA potentially encodes six proteins
5’ CAT CAA
5’ ATC AAC
5’ TCA ACT
5’ CATCAACTACAACTCCAAAGACACCCTTACACATCAACAAACCTACCCAC 3’
3’ GTAGTTGATGTTGAGGTTTCTGTGGGAATGTGTAGTTGTTTGGATGGGTG 5’
5’ GTG GGT
5’ TGG GTA
5’ GGG TAG
Choose the BLAST program
Program Input
Database
1
blastn
DNA
DNA
1
blastp
protein
protein
6
blastx
DNA
protein
6
tblastn
protein
DNA
36
tblastx
DNA
DNA
Step 3: choose the database
nr = non-redundant protein (most general
database)
Also can search specific organisms and DNA
rather than protein (although ALL DNA is going to
take a long time…)
filtering
So now you can
• Find any sequence in the database
• Find relevant publications
• Match DNA to protein sequence
• Find database matches to DNA or protein
• Find conserved domains in protein
• Find the 3D structure of a protein
…Without doing any experiments!