The Atlas of Protein Sequences

Download Report

Transcript The Atlas of Protein Sequences

Please use linux today if possible!
Introduction to Molecular
Biology Databases
Alinda Nagy & Hedi Hegyi, PhD
@ Institute of Enzymology,
Budapest
The BioSapiens Permanent
School of Bioinformatics
Budapest, Sept 4-8, 2006
Databases
What is a database?
• A database is a structured collection of
information. (An organized array of
information.)
• A database consists of basic objects called
records or entries.
• Each record consists of fields, which hold
defined data that is related to that
record.
• For example, a protein database would
typically have proteins as records and
protein properties as fields (i.e. name,
length, sequence, taxonomical origin, etc.)
Noam Kaplan
What is a database?
• A database is searchable (index)
-> table of contents
• A database is updated periodically
(release)
-> new edition
• A database is cross-referenced
(hyperlinks)
> links with other db
Why Databases?
• The purpose of databases is not merely to collect
and organize data, but mainly to allow advanced
data retrieval.
• A query is a method to retrieve information from
the database.
• The organization of each record into
predetermined fields allows us to use queries on
fields.
• Example: Find all human proteins that are
enzymes and have a length of 1000-1200 aa.
Noam Kaplan
Databases on the Internet
• Biological databases often have a web interface,
which allows the user to send queries to the
database.
• Some databases can be accessed by different web
servers, each offering a different interface.
User
request
query
web page
result
Web server
Database server
Noam Kaplan
Databases on the Internet
Information system
Query system
Storage System
Data
Francis Ouellette
Databases on the Internet
Information system
Query system
Storage System
Data
-
GenBank flat file
PDB file
Interaction Record
Title of a book
Book
Francis Ouellette
Databases on the Internet
Information system
Query system
Storage System
Data
- Boxes
- Oracle
- MySQL
- PC binary files
- Unix text files
- Bookshelves
Francis Ouellette
Databases on the Internet
Information system
Query system
-
A List you look at
A catalogue
indexed files
SQL
grep
Storage System
Data
Francis Ouellette
Databases on the Internet
Information system
- The UBC library
Query system
- Google
Storage System
Data
- Entrez (NCBI)
- SRS (Sequence
Retrieval System)
Francis Ouellette
Database download
• Nearly all biological databases are available
for download as simple text files.
• A local version of the database removes
limitations on how you process the data.
• Processing data in files requires some minimal
computer-programming skills.
– PERL is an easy programming language that can be
used for extraction and analysis of data from files.
Noam Kaplan
Tour of the major
molecular biology databases
• There is a tremendous amount of
information about biomolecules in publicly
available databases.
• Today, we will just look at some of the main
databases and what kind of information
they contain.
• Exercises will give you a little practice at
browsing databases.
List of molecular biology
databases
List of molecular biology
databases
• Nucleic Acids Research publishes an annual
database issue. The 2006 update of the
online Molecular Biology Database
Collection includes 858 databases
• http://www3.oup.co.uk/nar/database/c/
Large Growth in the Number of
Biological Databases
NAR Database Issue
1000
900
Number of databases
800
700
600
500
400
300
200
100
0
1996
1997
1998
1999
2000
2001
Year
2002
2003
2004
2005
2006
Molecular biology data types
Organisms
Mouse chromosome X
Lei Liu
from the Mouse Genome Informatics project
http://www.informatics.jax.org/
Genome maps
Molecular biology data types
Organisms
Genome maps
DNA sequences
RNA sequences
...AATGGTACCGATGACCTGGAGCTTGGTTCGA...
Lei Liu
Molecular biology data types
Organisms
Genome maps
DNA sequences
RNA sequences
Protein sequences
...TRLRPLLALLALWPPPPARAFVNQHLCGSHLVEA...
Lei Liu
Molecular biology data types
Organisms
Genome maps
DNA sequences
RNA sequences
RNA
structures
Protein sequences
Protein
structures
PDB entry 1CIS
P.Osmark, P.Sorensen, F.M.Poulsen
Lei Liu
Molecular biology data types
Organisms
Genome maps
DNA motifs
RNA
expression
DNA sequences
RNA sequences
RNA
structures
Protein sequences
Protein
structures
Protein
motifs
Lei Liu
Types of molecular biology
databases
14 main NAR categories:
Nucleotide Sequence
RNA sequence
Protein sequence
Structure
Genomics (non-vertebrate)
Metabolic and Signaling Pathways
Human and other Vertebrate Genomes
Human Genes and Diseases
Microarray Data and other Gene Expression
Proteomics Resources
Other
Organelle
Plant
Immunological
Resources are Becoming More
Diverse
NAR – Database Categories
2004
2006
Database Types
Database Types
Immunological
Gene
Expression Other
Disease
Nucleotide
Sequence
Plant
Organelle
RNA Sequence
Other
Nucleotide Sequence
RNA sequence
Proteomics Resources
Protein
Sequence
Genome
(human)
Microarray Data and
other Gene Expression
Protein sequence
Human Genes and
Diseases
Pathways
Structure
Structure
Genome (nonhuman)
Human and other
Vertebrate Genomes
Metabolic and Signaling
Pathways
Genomics (nonvertebrate)
NAR 2006 – A Closer Look
Database Types
• Genome scale databases
have proliferated
Immunological
Plant
Organelle
Other
Proteomics Resources
Microarray Data and
other Gene Expression
Nucleotide Sequence
RNA sequence
Protein sequence
Human Genes and
Diseases
Human and other
Vertebrate Genomes
Metabolic and
Signaling Pathways
• Traditional sequence
databases are now a
small part
Structure
Genomics (nonvertebrate)
• Databases around new
specific data types are
emerging
• Pathway and disease
orientated databases
are emerging
Database searches
Using a database
• How to get information out of a database:
– Summaries: how many entries, average or
extreme values
– Browsing: no targeted information to retrieve
– Search: looking for particular information
• Searching a database:
– Must have a key that identifies the element(s)
of the database that are of interest.
• Name of gene
• Sequence of gene
• Other information
Larry Hunter
Searching sequence databases
• Start from sequence, find information about it
• Many kinds of input sequences
– Could be amino acid or nucleotide sequence
– Genomic or mRNA/cDNA or protein sequence
– Complete or fragmentary sequences
• Exact matches are rare (even uninteresting in
many cases), so often goal is to retrieve a set
of similar sequences.
– Both small (mutations) and large (required for
function) differences within “similar” can be
interesting.
Larry Hunter
What might we want
to know about a sequence?
• Is this sequence similar to any known genes?
How close is the best match? Significance?
• What do we know about that gene?
– Genomic (chromosomal location, allelic information,
regulatory regions, etc.)
– Structural (known structure? structural domains?
etc.)
– Functional (molecular, cellular & disease)
• Evolutionary information:
– Is this gene found in other organisms?
– What is its taxonomic tree?
Larry Hunter
What can be discovered about
a gene by a database search?
• A little or a lot, depending on the gene
– Evolutionary information: homologous genes, taxonomic
distributions, allele frequencies, synteny, etc.
– Genomic information: chromosomal location, introns,
UTRs, regulatory regions, shared domains, etc.
– Structural information: associated protein structures,
fold types, structural domains
– Expression information: expression specific to
particular tissues, developmental stages, phenotypes,
diseases, etc.
– Functional information: enzymatic/molecular function,
pathway/cellular role, localization, role in diseases
Larry Hunter
NCBI and Entrez
NCBI and Entrez
• One of the most useful and comprehensive
sources of databases is the NCBI (National
Center for Biotechnology Information), part
of the NIH (National Institute of Health).
• NCBI provides interesting summaries,
browsers for genome data, and search tools
• Entrez is their database search interface
http://www.ncbi.nlm.nih.gov/Entrez
• Can search on gene names, sequences,
chromosomal location, diseases, keywords, ...
Larry Hunter
BLAST: Searching with a
sequence
• Goals is to find other sequences that are
more similar to the query than would be
expected by chance (and therefore are
homologous).
• Can start with nucleotide or amino acid
sequence, and search for either (or both)
• Many options
– E.g. ignore low information (repetitive) sequence,
set significance critical value
– Defaults are not always appropriate: READ THE
NCBI EDUCATION PAGES!
Larry Hunter
• Major
choices:
–
–
–
–
–
Larry Hunter
Translation
Database
Filters
Restrictions
Matrix
Larry Hunter
Larry Hunter
Close hit: Rat ADH alpha
Larry Hunter
Distant hit:
Human sorbitol dehydrogenase
Larry Hunter
Parameters (at bottom!)
Larry Hunter
Click on:
Larry Hunter
Larry Hunter
BLAST searches online
• http://www.ncbi.nlm.nih.gov/BLAST/
• Sequences:
>ENSP00000002501 pep:known chr:NCBI36:16:88598804:88613382
MEPPEGAGTGEIVKEAEVPQAALGVPAQGTGDNGHTPVEEEVGGIPVPAPGLLQVTERRQ
PLSSVSSLEVHFDLLDLTELTDMSDQELAEVFADSDDENLNTESPAGLHPLPRAGYLRSP
SWTRTRAEQSHEKQPLGDPERQATVLDTFLTVERPQED
>ENSP00000314902 chr:18 gene:ENSG00000176890 tr:ENST00000323250
MPVAGSELPRRPLPPAAQERDAEPRPPHGELQYLGQIQHILRCGVRKDDRTGTGTLSVFG
MQARYSLRDYSGQGVDQLQRVIDTIKTNPDDRRIIMCAWNPRDLPLMALPPCHALCQFYV
VNSELSCQLYQRSGDMGLGVPFNIASYALLTYMIAHITGLKPGDFIHTLGDAHIYLNHIE
PLKIQLQREPRPFPKLRILRKVEKIDDFKAEDFQIEGYNPHPTIKMEMAV
BLAST output for ENSP00000002501
BLAST output for ENSP00000002501
BLAST output
for ENSP00000314902
BLAST output
for ENSP00000314902
Take home messages
• There are a lot of molecular biology databases,
containing a lot of valuable information
• Not even the best databases have everything
(or the best of everything)
• These databases are moderately well crosslinked, and there are “linker” databases
• Sequence is a good identifier, maybe even
better than gene name!
Larry Hunter
Protein sequence databases
• General sequence databases (e.g. UniProt)
• Protein properties (e.g. PFD – Protein Folding Database)
• Protein localization and targeting
(e.g. NPD - Nuclear Protein Database)
• Protein sequence motifs and active sites
(e.g. BLOCKS, InterPro, PROSITE, PRINTS)
• Protein domain databases; protein classification
(e.g. InterPro, ProDom, SMART, Pfam)
• Databases of individual protein families
(e.g. Histone Database)
http://www3.oup.co.uk/nar/database/cat/1
UniProt
( The Universal Protein Resource)
http://www.uniprot.org/
ftp://ftp.uniprot.org/pub/databases/
Wu CH et al. The Universal Protein Resource (UniProt):
an expanding universe of protein information.
Nucleic Acids Res. 2006 Jan 1;34(Database issue):
D187-91.
Margaret Dayhoff
• The first protein
database was created
by Margaret Dayhoff,
calledThe Atlas of
Protein Sequences.
• It was a book.
The Atlas of Protein
Sequences
• Dayhoff had the idea that a compilation of
all protein sequences in the literature into
one resource would be a useful research
tool.
• She and her co-workers collected all known
sequences and published them together.
• Then, when a new sequence was obtained,
there was a single resource available for
determining its relationship to other known
sequences.
What is UniProt
What is UniProt
• The world's most comprehensive catalog of information
on proteins.
• Central repository of protein sequence and function.
• Created by joining the information contained in SwissProt, TrEMBL, and PIR.
• Collaboration between EBI (European Bioinformatics
Institiute), SIB (Swiss Institute of Bioinformatics) and
PIR (DDBJ to join).
• Funded mainly by NIH.
• Three database components:
•UniProt Knowledgebase (UniProtKB)
•UniProt Reference Clusters (UniRef)
•UniProt Archive (UniParc)
What is UniProt
1. UniProt Knowledgebase (UniProtKB):
central access point for extensive curated protein information,
including function, classification, and cross-reference
comprising the manually annotated UniProtKB/Swiss-Prot section
and the automatically annotated UniProtKB/TrEMBL section
2. UniProt Reference Clusters (UniRef):
combines closely related sequences into a single record to speed
searches
speed similarity searches via sequence space compression by
merging sequences that are 100% (UniRef100), 90% (UniRef90) or
50% (UniRef50) identical
3. UniProt Archive (UniParc):
comprehensive repository, reflecting the history of all protein
sequences
stores all publicly available protein sequences, containing the history
of sequence data with links to the source databases
What is UniProt
The UniProt databases collect both protein sequences
obtained through experimental determination and
protein sequences derived from the translation of
nucleotide sequences (which were predicted or
determined to codify for a protein).
Amino acid
sequence
determined
through
experimental
analysis
GeneBank
EMBL
DDBJ
Nucleotide
sequence
databases
Protein sequence databases
PIR
SWISSPROT
TrEMBL
Validated
Enriched of specific information
UniProt Goals
• High level of annotation
• Minimal redundancy
• High level of integration with other databases
• Complete and up-to-date
Annotation concepts
UniParc:
No annotation
UniProtKB:
Annotated
UniRef:
No annotation, just description line of
UniProtKB or UniParc master entry in the
cluster for use in FASTA files
Minimal redundancy
UniParc:
All sequences that are 100% identical over their
entire length are merged into a single entry,
regardless of species. UniParc represents each
protein sequence once and only once, assigning it a
unique identifier. UniParc cross-references the
accession numbers of the source databases.
UniProtKB:
Aims to describe in a single record all protein
products derived from a certain gene (or genes if
the translation from different genes in a genome
leads to indistinguishable proteins) from a certain
species.
UniRef:
Merges sequences automatically.
Integration with other
databases
UniParc:
Linked back to source records
UniProtKB:
Linked to >60 other databases
UniRef:
UniRef clusters link back to UniProtKB and
UniParc records in the cluster
Complete and up-to-date
UniParc:
All publically available protein sequences,
updated every 2 weeks (05/06, Rel 8.0:
7.116.519 entries)
UniProtKB:
All suitable stable protein sequences, updated
every 2 weeks (05/06, Rel 8.0: 3.170.612
entries)
UniRef:
All protein sequences in the UniProtKB and in
UniParc useful for sequence similarity
searches, updated every 2 weeks (05/06, Rel
8.0: 3.511.676 UniRef100, 2.254.474
UniRef90, 1.148.123 UniRef50 entries)
An example
An example
An example
An example
An example
Exercise 1 – Text search
1. Go to EXPASY. Click "UniProt Knowledgebase (Swiss-Prot
and TrEMBL)” and then search for human cochlin.
Notice that there is a wealth of information about this
protein. Furthermore, there are many links to sequence
analysis tools (some of which you will learn later) and some
other nice features. Note that this is merely a graphical
display of the original UniProtKB/SwissProt database entry
(which is in text).
2. Try to answer all of the questions below.
1. Which year was the NMR structure of the LCCL
domain determined?
2. Where is the protein expressed?
3. Which diseases are associated with the protein?
Exercise 2 – BLAST search
•1. Go to EXPASY. Click "UniProt Knowledgebase (Swiss-Prot and
TrEMBL)” and then „BLAST”.
•2. Copy the following human amino acid sequence.
MSTAVLENPGLGRKLSDFGQETSYIEDNCNQNGAISLIFSLKEEVGALAKVLRLFEENDVNLTHIESRPSRLKKDEYEFFTHLDK
RSLPALTNIIKILRHDIGATVHELSRDKKKDTVPWFPRTIQELDRFANQILSYGAELDADHPGFKDPVYRARRKQFADIAYNYRH
GQPIPRVEYMEEEKKTWGTVFKTLKSLYKTHACYEYNHIFPLLEKYCGFHEDNIPQLEDVSQFLQTCTGFRLRPVAGLLSSRDF
LGGLAFRVFHCTQYIRHGSKPMYTPEPDICHELLGHVPLFSDRSFAQFSQEIGLASLGAPDEYIEKLATIYWFTVEFGLCKQGD
SIKAYGAGLLSSFGELQYCLSEKPKLLPLELEKTAIQNYTVTEFQPLYYVAESFNDAKEKVRNFAATIPRPFSVRYDPYTQRIEVL
DNTQQLKILADSINSEIGILCSALQKIK
•3. Paste the sequence into the query sequence window and adjust the
options as necessary. You won't need to specify advanced options, but
you should choose a program and database. For simplicity, use e.g. the
UniProtKB database.
•4. Run the search and identify the protein. Use the link provided to
see the UniProtKB/SWISS-PROT report.
Exercise 2 – BLAST search
•5. Now, try to answer all of the questions below.
1. What is the SWISS-PROT primary accession number?
2. What is the common name of the protein?
3. What is the gene called?
4. Which year was the crystal structure of the catalytic domain
determined? Name the first author.
5. Does the enzyme require a co-factor to function? If so, what?
6. Name the most common disease that arises as a result of
deficiency of this enzyme.
7. How many amino acid residues are there in the protein?
8. What is the molecular weight of the protein?
Patterns and Profiles,
Protein Motifs and Domains
•
•
•
•
•
•
•
•
•
InterPro - an integrated database of protein families, domains,
motifs and functional sites.
Blocks - multiply aligned ungapped segments for the most highly
conserved regions of proteins.
Motif - a server that scans databases to find motifs or patterns
and that can generate sequence profiles.
Pfam - multiple sequence alignments and HMMs of protein domains
and families.
PRINTS - database of groups of conserved motifs, or protein
fingerprints.
ProDom - protein domain families automatically generated from
SWISS-PROT and TrEMBL.
PROSITE - database of protein families and domains defined by
functional sites, patterns and profiles.
SMART - Simple Modular Architecture Research Tool for the
identification of domains.
COGS database - clusters of sequences determined by comparing
sequences from whole genomes.
InterPro
(Integrated resource of Protein
Families, Domains and Sites)
• http://www.ebi.ac.uk/interpro/
• ftp://ftp.ebi.ac.uk/pub/databases/interpro
• Mulder NJ et al. (2005) InterPro, progress and
status in 2005. Nucleic Acids Res. 33 (Database
Issue): D201-5.
What is InterPro
• Secondary protein databases on functional
sites and domains are vital resources for
identifying distant relationships in novel
sequences, and hence for predicting protein
function and structure.
• Unfortunately, these signature databases
do not share the same formats and
nomenclature, and each database has its
own strengths and weaknesses.
• Thus, for best results, search strategies
should ideally combine all of them.
What is InterPro
– InterPro is a collaborative project aimed at providing
an integrated layer on top of the most commonly
used signature databases by creating a unique, nonredundant characterization of a given protein family,
domain or functional site.
– Integrates PROSITE, PRINTS, Pfam, ProDom,
SMART, TIGRFAMs, PIR superfamily,
SUPERFAMILY, Gene3D and PANTHER databases
and the addition of others is scheduled.
– Has cross-references to the BLOCKS database as
well as many specialized protein family and protein
structure databases.
InterPro
• The latest release of InterPro (12.1) contains
12,953 entries, with 78% coverage of all proteins
in UniProtKB.
• Each entry has annotation provided in the name,
GO mapping and abstract fields, and all matches
against the Swiss-Prot and TrEMBL components of
UniProt are precomputed and available for viewing
in different formats.
• Protein 3D structural information is integrated
from MSD, CATH and SCOP, and this data is
available in the match views to provide an at a
glance comparison of sequence and structural
domains.
InterPro
Dataflow scheme
InterProScan result
PROSITE
http://www.expasy.org/prosite/
Database of protein families and domains
PROSITE
• consists of a large collection of biologically
meaningful signatures that are described as
patterns or profiles that help to reliably identify
to which known protein family (if any) a new
sequence belongs
• the latest version (release 19.11) contains 1329
patterns and 552 profile entries
• each signature is linked to a documentation
providing information on the protein family or
domain detected by the signature: origin of its
name, taxonomic occurrence, domain architecture,
function, 3D structure, main characteristics of
the sequence, domain size and some references
PRINTS
http://www.bioinf.man.ac.uk/dbbrowser/PRINTS/
PRINTS
• The PRINT database is a compendium of
protein fingerprints.
• A fingerprint is a group of conserved
sequence motifs that together provide
diagnostic signatures for protein families.
• Fingerprints are diagnostically more
powerful than single motifs by making use
of the biological context inherent in a
multiple-motif method.
• The fingerprinting method is a reliable
technique for detecting members of large,
highly divergent protein super-families.
PFAM
http://www.sanger.ac.uk/Software/Pfam/
PFAM
• Database of multiple sequence alignments
and HMMs of protein domains and families.
• Profile hidden Markov models are
statistical models of the primary structure
consensus of a sequence family.
• The construction and use of Pfam is tightly
tied to the HMMER software package.
PFAM
• Composed of two sets of families:
– Pfam-A:
• curated part containing over 8296 protein families
– Pfam-B:
• automatically generated supplement containing a large
number of small families taken from the PRODOM
database that do not overlap with Pfam-A (lower
quality)
PFAM
Each family has the following data:
• A seed alignment which is a hand edited multiple alignment
representing the family.
• Hidden Markov Models (HMM) derived from the seed alignment
which can be used to find new members of the domain and also
take a set of sequences to realign them to the model. One HMM
is in ls mode (global) the other is an fs mode (local) model.
• A full alignment which is an automatic alignment of all the
examples of the domain using the two HMMs to find and then
align the sequences
• Annotation which contains a brief description of the domain, links
to other databases and some Pfam specific data. To record how
the family was constructed.
A PFAM entry
A PFAM entry, cont’d
PFAM searches
PFAM results
PRODOM
http://www.toulouse.inra.fr/prodom.html
PRODOM
• Database of protein domain families
automatically generated from SWISSPROT and TrEMBL databases by sequence
comparison.
• Useful for analysing the domain
arrangements of complex protein families
and the homology relationships in modular
proteins.
• Contains (release 2003.1) 144,444 domain
families containing two or more individual
domains.
SMART
http://smart.embl-heidelberg.de/
Simple Modular Architecture Research Tool
SMART
• Allows the identification and annotation of protein
domains and the analysis of domain architectures.
• The current release has more than 600 domain
families represented among nuclear, signalling and
extracellular proteins.
• Extensive annotation for each domain family is
available, providing information on function,
subcellular localization, phyletic distribution and
tertiary structure, links to OMIM in cases where
a human disease is associated with one or more
mutations in a particular domain.
Exercise 3 – Domain search
•1. Go to the PROSITE site.
•2. Under "Tools for PROSITE" choose ScanProsite.
•3. Paste the sequence below into the box and tick the Option "Exclude
patterns with a high probability of occurrence" (to find very common
patterns will not tell you much about your protein).
MWAPRCRRFWSRWEQVAALLLLLLLLGVPPRSLALPPIRYSHAGICPNDMNPNLWVDAQSTCRRECETDQECETYEKCCPNVCGTKSCVAARYMDVKGKKGPVGMPKE
ATCDHFMCLQQGSECDIWDGQPVCKCKDRCEKEPSFTCASDGLTYYNRCYMDAEACSKGITLAVVTCRYHFTWPNTSPPPPETTMHPTTASPETPELDMAAPALLNNPV
HQSVTMGETVSFCDVVGRPRPEITWEKQLEDRENVVMRPNHVRGNVVVTNIAQLVIYNAQLQDAGIYTCTARNVAGVLRADFPLSVVRGHQAAATSESSPNGTAFPAAEL
KPPDSEDCGEEQTRWHFDAQANNCLTFTFGHCHRNLNHFETYEACMLACMSGPLAACSLPALQGPCKAYAPRWAYNSQTGQCQSFVYGGCEGNGNNFESREACEESP
FPRGNQRCRACKPRQKLVTSFCRSDFVILGRVSELTEEPDSGRALVTVDEVLKDEKMGLKFLGQEPLEVTLLHVDWACPCPNVTVSEMPLIIMGEVDGGMAMLRPDSFVG
ASSARRVRKLREVMHKKTCDVLKEFLGLH
•4. Start the scan.
Which are the motifs that are found?
Exercise 4 – Domain search
•1. Go to the Pfam site.
•2. Click „Search by protein name or sequence„.
•3. Paste the sequence below into the box and choose „Both Global and
Fragment Pfam search”.
•MWAPRCRRFWSRWEQVAALLLLLLLLGVPPRSLALPPIRYSHAGICPNDMNPNLWVDAQSTCRRECETDQECETYEKCCPNVCGTKSCVAARYMDVKGKKGPVGMPKE
ATCDHFMCLQQGSECDIWDGQPVCKCKDRCEKEPSFTCASDGLTYYNRCYMDAEACSKGITLAVVTCRYHFTWPNTSPPPPETTMHPTTASPETPELDMAAPALLNNPV
HQSVTMGETVSFCDVVGRPRPEITWEKQLEDRENVVMRPNHVRGNVVVTNIAQLVIYNAQLQDAGIYTCTARNVAGVLRADFPLSVVRGHQAAATSESSPNGTAFPAAEL
KPPDSEDCGEEQTRWHFDAQANNCLTFTFGHCHRNLNHFETYEACMLACMSGPLAACSLPALQGPCKAYAPRWAYNSQTGQCQSFVYGGCEGNGNNFESREACEESP
FPRGNQRCRACKPRQKLVTSFCRSDFVILGRVSELTEEPDSGRALVTVDEVLKDEKMGLKFLGQEPLEVTLLHVDWACPCPNVTVSEMPLIIMGEVDGGMAMLRPDSFVG
ASSARRVRKLREVMHKKTCDVLKEFLGLH
4. Search Pfam.
1. Which domains are found?
2, What may be the function of this protein?
Exercise 5: Blast searches on your computer
1.
download blast-2.2.14-ia32-linux.tar.gz
file from ftp://ftp.ncbi.nih.gov/blast/executables/LATEST
2.
Make a subdirectory in your home directory:
mkdir ~/blast
3. Move the blast file there:
mv blast-2.2.14-ia32-linux.tar.gz ~/blast/
4. Go to the blast directory:
cd ~/blast/
4. unzip the file:
gunzip blast-2.2.14-ia32-linux.tar.gz
5. unpack it:
tar –xvf blast-2.2.14-ia32-linux.tar
Exercise 5: Blast searches, cont’d
6. Get the first 100 human proteins in Swissprot:
- go to http://www.expasy.org/srs5/
- click on Start
- unmark TREMBL, to search only in Swissprot
-press Continue
Exercise 5: Blast searches, cont’d
Select in the first Info line “Organism” and type in “human”
Press “Do Query”, this will retrieve all human proteins in Swissprot in
batches of 100
Exercise 5: Blast searches, cont’d
Press save
Exercise 5: Blast searches, cont’d
1. Change view to FastaSeqs
3. Press SAVE
2. Change Sequence Format
to fasta
Exercise 5: Blast searches, cont’d
6. Save file e.g. as 100seq.fa
7. Format your database of 100 sequences to make it
searchable by blast:
~/blast/blast-2.2.14/bin/formatdb –i 100seq.fa
8. Now you have a searchable database, you can search with an
input sequence of your choice. E.g. make a file from the
first sequence in 100seq.fa, grab the first sequence with
the mouse and type
cat > seq1.fa
and paste it into the file, then press <Ctrl-d>
9. Now you have an input sequence and a database, type:
~/blast/blast-2.2.14/bin/blastall –p blastp –i seq1.fa –d
100seq.fa –o seq1-vs-100seq.blastp
10. After it finished running (it will be ready immediately) you will get your
output in seq1-vs-100seq.blastp file. If you invoke the blastall program
without the “switches” it will list all the options you can use.