A 2 - Computer Science

Download Report

Transcript A 2 - Computer Science

Introduction to Bioinformatics
Junhui Wang
May 2004
outline
• What’s bioinformatics?
• introduction to biological database
• Sequence Alignment
Why use bioinformatics
?
• An explosive growth in the amount of biological information necessitates
the use of computers for cataloguing and retrieval.
• Impossible to analyze data by manual inspection
• Data mining –functional/structural information is important for studying
the molecular basis of diseases(and evolutionary patterns)
What is bioinformatics
?
• A mixture of computer science, mathematics and biology.
• Development of new algorithms and statistics to assess
relationships among members of large data sets.
• Analysis and interpretation of various types of data.
• Development and implementation of tools to efficiently access
and manage different types of information.
Database for bioinformatics
?
• Nucleotide Database & Protein database
• Primary database & Secondary database
DNA/RNA database
Primary database
Secondary database
GenBank/EMBL/DDBJ
Protein database
PDB
SWISS-PROT /PIR
DNA
RNA
protein
DNA
genomic
DNA
databases
RNA
cDNA
ESTs
protein
protein
sequence
databases
There are three major public DNA databases
EMBL
Housed
at EBI
European
Bioinformatics
Institute
GenBank
DDBJ
Housed
at NCBI
National
Center for
Biotechnology
Information
Housed
in Japan
www.ncbi.nlm.nih.gov
PubMed is…
• National Library of Medicine's search service
• 11 million citations in MEDLINE
• links to participating online journals
• PubMed tutorial (via “Education” on side bar)
Entrez integrates…
• a search and retrieval system that integrates NCBI databases
• the scientific literature;
• DNA and protein sequence databases;
• 3D protein structure data;
Entrez
BLAST is…
• Basic Local Alignment Search Tool
• NCBI's sequence similarity search tool
• supports analysis of DNA and protein databases
• 80,000 searches per day
OMIM is…
•Online Mendelian Inheritance in Man
•catalog of human genes and genetic disorders
•edited by Dr. Victor McKusick
Books is…
• searchable resource of on-line books
TaxBrowser is…
• browser for the major divisions of living organisms
( bacteria, viruses)
• taxonomy information such as genetic codes
• molecular data on extinct organisms
Structure site includes…
• Molecular Modeling Database (MMDB)
• biopolymer structures obtained from
the Protein Data Bank (PDB)
• a 3D-structure viewer
Four questions we can answer
at NCBI (and elsewhere):
[1] How can I do a literature
search using PubMed?
[2] How can WelchWeb help?
[3] How can I use Entrez to
find information about a
particular gene or protein?
[4] How can I find information
about a particular disease?
Question #1:
How can I use
PubMed at NCBI
to find literature
information?
PubMed is the NCBI gateway to MEDLINE.
MEDLINE contains bibliographic citations
and author abstracts from over 4,000 journals
published in the United States and in 70 foreign
countries.
It has 12 million records dating back to 1966.
MeSH is the acronym for "Medical Subject Headings."
MeSH is the list of the vocabulary terms used
for subject analysis of biomedical literature at NLM.
MeSH vocabulary is used for indexing journal articles
for MEDLINE.
The MeSH controlled vocabulary imposes uniformity
and consistency to the indexing of biomedical literature.
PubMed search strategies
Try the tutorial (“education” on the left sidebar)
Use boolean queries
AND ,OR, NOT
Try using “limits”
Try “LinkOut” to find external resources
Obtain articles on-line via Welch Medical Library
(and download pdf files):
http://www.welch.jhu.edu/
Question #2: How can I use WelchWeb
(from the Welch Medical Library) to do
literature searches?
WelchWeb is available at http://www.welch.jhu.edu
WelchWeb is available at http://www.welch.jhu.edu
E-mail gateway
PubMed gateway
Library catalog
Remote access
to Welch services
Request literature
Browse journals
Browse databases
Question #3:
How can I use NCBI
(or other sites)
to find information
about a protein
or gene?
Four ways to access protein and
DNA sequences
[1] LocusLink with RefSeq
[2] Entrez
[3] UniGene
[4] ExPASy Sequence Retrieval System
(this is separate from NCBI)
4 ways to access protein and DNA sequences
[1] LocusLink with RefSeq
LocusLink is a great starting point: it collects
key information on each gene/protein from
major databases. It now covers 8 organisms.
RefSeq provides a curated, optimal accession
number for each DNA (NM_006744)
or protein (NP_007635)
[2] Entrez
[3] UniGene
[4] ExPASy SRS
4 ways to access protein and DNA sequences
[1] LocusLink with RefSeq
[2] Entrez
Entrez is divided into sites for nucleotide, protein,
structure, genomes, OMIM, and more. You can use limits
(such as RefSeq) to focus your Entrez search.
[3] UniGene
[4] ExPASy SRS
The Genebank flatfile:
• the elementary unit of information
• one of the most commonly used format
• LOCUS: locus name/the length of the sequence/the molecule type/
GenBank division code/the date
• DEFINITION:summarize the biology of the record
genus species/product name/….
ACCESSION:An accession number is label that used to identify a
sequence. It is a string of letters and/or numbers that corresponds to
a molecular sequence.
VERSION:accession version
GID: the gi(geninfo identifier)
The Genebank flatfile (cont):
• KEYWORDS:identify the particular entry,not very useful
• SOURCE:either have the common name for the organism or its
scientific name
• REFERENCE: at least one reference or citation,can be published or
unpublished,MEDLINE and PUBMED identifier provide a link to the
MEDLINE and PUBMED database.
• COMMENT: refer to the whole record.
Graphics format
4 ways to access protein and DNA sequences
[1] LocusLink with RefSeq
[2] Entrez
[3] UniGene
UniGene collects expressed sequence tags (ESTs)
into clusters, in an attempt to form one gene per cluster.
Use UniGene to study where your gene is expressed
in the body, when it is expressed, and see its abundance.
[4] ExPASy SRS
4 ways to access protein and DNA sequences
[1] LocusLink with RefSeq
[2] Entrez
[3] UniGene
[4] ExPASy SRS
There are many bioinformatics servers outside NCBI.
Try ExPASy’s sequence retrieval system at
http://www.expasy.ch/
(ExPASy = Expert Protein Analysis System)
Question #4:
How can I find
information about
a particular disease?
Answer:
Try OMIM
Two main disease databases:
general and locus-specific
General
OMIM
GeneCards (Weizmann)
http://bioinformatics.weizmann.ac.il/cards/
Genes & Disease (at NCBI)
http://www.ncbi.nlm.nih.gov/disease/
Locus-specific
Human Gene Mutation Database (HGMD)
http://archive.uwcm.ac.uk/uwcm/mg/docs/oth_mut.html
Comparative method--Sequence alignment
•A long tradition in biology of comparative analysis leading to discovery
• compare the similarities and differences of the biological data
to infer structural,functional and evolutionary relationship.
• the most common comparative method—alignment
• my concern—sequence alignment
Sequence alignment
• Definition--Provides an explicit mapping between the residues of two or
more sequences.
• Large enough similarity typically infer homology
• Homology—similarity due to decent from a common ancestor.
homology information is in genes.
• 2 type according to the number of sequences
Pairwise alignment—two sequences are compared
Multiple alignment--- more than two sequences involved
Pairwise alignment –simple example
• before insert gap
query sequence
AG G V LAQ V G
object sequence
AG GV LQVG
5 identical residues
after insert gap
query sequence
AG G V LAQ V G
object sequence
A G G V L -- Q V G
8 identical residues
Gap—insert or delete
Pairwise sequence alignment
• Each pair in the alignment receives a value depend on its content.
• The total score is the sum of the values.
• There may be many alignments with maximal score.
• Example for score
identical characters (match):+1
different characters (mismatch): -1
gap: -1
Pairwise alignment –simple example
before insert gap
T CAT G
CATT G
after insert gap
score =4-2=2
TCAT- G
TCA–TG
- CATT G
-CA TTG
Similarity search –dot plot
M T F R D L L S V S F E G P R P D S S A G G S S AG G
M .
T
F
R
D
.
.
.
.
.
.
.
L
. .
L
. .
S
.
V
S
.
. .
. .
.
. .
. .
.
.
Pairwise alignment –dot plot
same sequences
high similar sequences
some similar seqences
Similarity search and alignment
• Exhaustive method:
Needleman-Wunch—global alignment
Smith-Waterman--- local alignment
• Common program:
FastA--1985
comparing a query sequence against a database of sequences
Blast (Basic Local Alignment Search technique) --1990
improvement on FastA
Alignment models: global &local
Alignment models: global &local
• Global alignment--take all of one sequence and align it with all of a second
sequence
•Disadvantage:short and highly similar subsequences may be missed in the
alignment
• input: two sequences of similar lengths
(if sequence differ in length,space may be introduced)
output: the best similarity score between the sequences
example: ACCTGC
-ACC-TGC
--
TACGTG
TAC-GTG-
Needleman-Wunsch Algorithm (1970)
• The first step is to place the two sequences along the margins of a matrix,
simply place a 1 anywhere the two sequences match and a 0 elsewhere.
• For each element in the matrix you perform the following operation.
M i,j = M i,j +Max(M k, j +1, M i+1, n)
where k is any integer larger than i and n is any integer larger than j.
•In words, alter the matrix by adding to each element the largest element from the
row just below and to the right of that element and from the column just to the right
and below the element of interest.
• The number contained in each cell of the matrix, after this operation is
completed, is the largest number of identical pairs that can be found if that
element is the origin for a pathway which proceeds to the upper left.
Example
A
D
L
G
A
V
F
A
L
C
D
R
Y
F
Q
A
1
0
0
0
1
0
0
1
0
4
3
2
1
1
0
D
0
1
0
0
0
0
0
0
0
4
4
2
1
1
0
L
0
0
1
0
0
0
0
0
1
4
3
2
1
1
0
G
0
0
0
1
0
0
0
0
5
4
3
2
1
1
0
R
0
0
0
0
0
0
0
0
5
4
3
3
1
1
0
T
0
0
0
0
0
0
0
0
5
4
3
2
1
1
0
Q
0
0
0
0
0
0
0
0
5
4
3
2
1
1
1
N
0
0
0
0
0
0
0
0
5
4
3
2
1
1
0
C
0
0
0
0
0
0
0
0
4
5
3
2
1
1
0
D
0
1
0
0
0
0
0
0
3
3
4
2
1
1
0
R
0
0
0
0
0
0
0
0
2
2
2
3
1
1
0
Y
0
0
0
0
0
0
0
0
2
2
2
2
2
1
0
Y
0
0
0
0
0
0
0
0
1
1
1
1
2
1
0
Q
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
Example
A
D
L
G
A
V
F
A
L
C
D
R
Y
F
Q
A
9
7
6
6
7
6
6
7
5
4
3
2
1
1
0
D
7
8
6
6
6
6
6
6
5
4
4
2
1
1
0
L
6
6
7
5
5
5
5
5
6
4
3
2
1
1
0
G
5
5
5
6
5
5
5
5
5
4
3
2
1
1
0
R
5
5
5
5
5
5
5
5
5
4
3
3
1
1
0
T
5
5
5
5
5
5
5
5
5
4
3
2
1
1
0
Q
5
5
5
5
5
5
5
5
5
4
3
2
1
1
1
N
5
5
5
5
5
5
5
5
5
4
3
2
1
1
0
C
4
4
4
4
4
4
4
4
4
5
3
2
1
1
0
D
3
4
3
3
3
3
3
3
3
3
4
2
1
1
0
R
2
2
2
2
2
2
2
2
2
2
2
3
1
1
0
Y
2
2
2
2
2
2
2
2
2
2
2
2
2
1
0
Y
1
1
1
1
1
1
1
1
1
1
1
1
2
1
0
Q
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
Alignment models: global &local
• Local alignment--not all of the sequences might be aligned together.
• input: two sequences S and T
output: the maximum similarity between a sub-sequence of S and a subsequence of T
example: ACCTGC
ACCTGC
--
TACGTA
TACGTA
Smith-waterman Algorithm(1981)
• Similar to the global alignment
• Difference- the best local alignment score is the greatest value
in the table.
• A negative score/weight must be given to mismatches.
• Zero must be the minimum score recorded in the matrix.
• M i,j = M i,j +Max(M k, j -1, M i-1, n)
where k is any integer smaller than i and n is any integer smaller than j
go left to right, top to bottom in the matrix .
Example
here a penalty of –0.5 for each mismatch
Common program--FastA and Blast
FastA--1985
comparing a query sequence against a database of sequences
Blast (Basic Local Alignment Search technique) --1990
improvement on FastA , a set of similarity search programs for
proteins or DNA sequences (BLASTN, BLASTP,..), developed at
NCBI.
•By seeking local alignment is able to detect relationships among
sequences that share only isolated regions of similarity.
•Structural similarity (usually functional similarity)
FASTA format
Multiple Alignment --
• Definition--A multiple alignment of sequences S1, S2,..,Sn is
a series of sequences S1’,S2’,..,Sn’ such that• all Si’ sequences are
of equal length.
• Si’ is an extension of Si obtained by inserting gaps.
Motivation—
Find diagnostic patterns to characterize protein families.
Detect/demonstrate homology between a new sequence and
existing families of proteins.
Help predict the secondary and tertiary structures of new
sequences.
Multiple Alignment Method--ClustalW Algorithm
•Based on the idea:similar sequence always have relationship of
evolution.
•During the alignment,all pairs of sequences are aligned separately in
order to calculate the similarity score.
• Make some group them according to the similarity score .
• Do alignment between groups,get the similarity score.
• Make some group again…get the final result.
• The sequences with high similarity do alignment first, then follow by
low similarity sequences.
•A guide tree is calculated, similar sequences are neighbors in the
tree, distant sequences are distant from each other in the tree.
• The sequences are progressively aligned according to the
Reference
1.http://www.ncbi.nlm.nih.gov/
2. Elementary Sequence Analysis, McMaster University,
http://colorbasepair.com/bioinformatics_courses_tutorials.html
3. Introduction to bioinformatics ,T K Attwood and D J Parry-Smith
4. Bioinformatics-A pratical guide to the analysis of genes and proteins,
Andreas D. Baxevanis and B. F. Francis ouellette
5. http://omega.cbmi.upmc.edu/~vanathi/syllabus.html