MBG305_LS_02

Download Report

Transcript MBG305_LS_02

MBG305
Applied Bioinformatics
Week 2 (02.10.2010)
Jens Allmer
Quiz
• 10 min
Databases
• Bioinformatics needs data
– Where is this data?
– Is there any organization?
– How should I cite data?
Where is the data?
• Many targeted resources exist
– miRBase http://www.mirbase.org/
• Contains microRNAs
– PDB http://www.rcsb.org/pdb/home/home.do
• Contains protein structures
– PeptideAtlas http://www.peptideatlas.org/
• Contains mass spectrometric measurements
– KEGG http://www.genome.jp/kegg/
• Contains regulatory and biochemical pathways
– PubMed http://www.ncbi.nlm.nih.gov/pubmed/
• Contains indexed journals
– ...
Where is the data?
• Sequence Databases
–
–
–
–
–
EBI
Ensembl
GenBank
SwissProt
...
(www.ebi.ac.uk/)
(www.ensembl.org)
(www.ncbi.nlm.nih.gov/Genbank)
(www.tigr.org/tdb)
• Make these pages bookmarks
– Are your bookmarks where you are?
• Try: http://www.delicious.com
– Or bring your own browser
• http://portableapps.com/apps/internet/google_chrome_portable
How is Data Organized?
• Flat Text Files
– FASTA Format
• Structured Text Files
– XML based Formats (e.g.: ASN.1)
• Databases
–
–
–
–
Structure
Index
Users
Details in MBG403
Flat Text Files
• FASTA Format (Pearson and Lipman, 1988)
– Allows multiple sequences per file
– Requires identifiers for each sequence
– Some special characters and formatting rules
• > introduces the definition line (sequence identifier)
• 80 characters per sequence line
• Only supported characters (IUPAC)
– http://www.bioinformatics.org/sms/iupac.html
• Example
>gi|189443480|gb|FG602538.1|FG602538 PF_T3_37R_G02_08AUG2003_004 Opium poppy root cDNA
library Papaver somniferum cDNA, mRNA sequence
GAACGAAGGGAGAGAACGAAAAAGAAGGAGAGAATGTGTGAGGGTCGGTTTCATACGTTTGGTGTTAACTGAGTTATGCA
ATCTGCAAAAGAGGAGAGATTAGATAGAAGATGAGAAGAATTATGACAACCTAGTCAAGTATGGATCATTGCTCTAATTC
...
>gi|189457344|gb|FG613049.1|FG613049 stem_S093_F08.SEQ Opium poppy stem cDNA library
Papaver somniferum cDNA, mRNA sequence
CTTTCTCTAGGTTTCTCCGCAATTTTCAAGTGGACGAATCCAAATAGAATTTGCCAAGCTTTTCTTGATTTATCCTACTC
GGTGTAAAAATGGCGACAATAGGAGCTTCCTCAGCTTGCTGCATGATCAGAAGCACACCCCAGAACAGTGGTAAAATTGC
...
FASTA Tools
• FASTA Viewer and DNA Translator
– http://www.biolnk.com/
• Some FASTA Tools
– http://bioinformatics.iyte.edu.tr/index.php?n=Softwares.FastaToo
ls
• FASTA Validator/ Converter to CSV file
– http://mbg305.allmer.de/tools/
FASTA Usage
• Most programs that accept sequence input accept
FASTA format
– BLAST (partially)
– FastA (obviously)
– Multiple Sequence Alignment Tools
• Most
– MS-based Database Search Engines
• Some (only database, not queries)
– Most Online Forms
FASTA Definition Line Formats
• http://en.wikipedia.org/wiki/Fasta_format
– GenBank gi|gi-number|gb|accession|locus
– EMBL Data Library gi|gi-number|emb|accession|locus
– DDBJ, DNA Database of Japan gi|ginumber|dbj|accession|locus
– NBRF PIR pir||entry Protein Research Foundation prf||name
– SWISS-PROT sp|accession|name
– Brookhaven Protein Data Bank (1) pdb|entry|chain
– Brookhaven Protein Data Bank (2)
entry:chain|PDBID|CHAIN|SEQUENCE
– Patents pat|country|number GenInfo
– Backbone Id bbs|number
– General database identifier gnl|database|identifier
– NCBI Reference Sequence ref|accession|locus
– Local Sequence identifier lcl|identifier
GenBank Flat Text File
• GenBank
– Sample record and
explanation:
• http://www.ncbi.nlm.nih.go
v/Sitemap/samplerecord
– FAQs
• http://www.ncbi.nlm.nih.go
v/books/NBK49541/#NucP
rotFAQ.Section_A_GenBa
nk_nucleotide
Structured Text Files
• Different ways to structure text files
–
–
–
–
ASN.1
XML
JSON
Wait for MBG403 for details
Structured Text Files
• ASN.1 Example
– http://www.ncbi.nl
m.nih.gov/nuccore/
NC_003622.1?rep
ort=asn1&log$=se
qview
– http://www.ncbi.nl
m.nih.gov/nuccore/
NC_003622
• Select Display
Settings ASN.1
Databases
• Unlike the previous formats not easily readable
– Special tools and languages are used to add, edit, retrieve, and
view data
• Advantages
–
–
–
–
–
Secure
Stable
Distributed
Fast Access
Huge sizes supported
• http://www.freerepublic.com/focus/f-chat/2508670/posts
• Ever tried to search in 100 TB of text for something?
Scientific Data
Source: http://www.bioinformatics.wsu.edu/bioinfo_course/notes/Lecture12.pdf
Characteristics of Scientific Data
• Highly Complex
– Images, sequences, time series, ...
– Strong interdependence of data
• In Science
–
–
–
–
Outliers are of interest
Focus of interest changes rapidly
Data is usually shared
Data must be secure
• Never change data only add
• Many viewers few creators
• Collections
– Large collections must be shared via strong servers
– Small collections (e.g. SwissProt 63MB) can be shared more
easily
– New methodologies (MS, NGS, ...) have expanded size of
databases
Desired Features for Databases
•
•
•
•
•
•
•
•
Efficiency
Scalability
Concurrency
Security
Integrity
Stability
Cross references to other databases
Universally accessible
• Query Language
• Data mining
• Data Warehouse
How Many Bioinformatics Databases?
Source: http://www.bioinformatics.wsu.edu/bioinfo_course/notes/Lecture12.pdf
An Abundance of Databases
• Databases and Collections on http://www.hsls.pitt.edu/obrc/
(2012 -> 2014)
–
–
–
–
–
–
–
–
–
–
–
–
–
DNA Sequence Databases and Analysis Tools (499) -> 463
Enzymes and Pathways (281) -> 242
Gene Mutations, Genetic Variations and Diseases (303) -> 257
Genomics Databases and Analysis Tools (703) -> 636
Immunological Databases and Tools (61) -> 49
Microarray, SAGE, and other Gene Expression (215) -> 166
Organelle Databases (29) -> 25
Other Databases and Tools (Literature Mining, Lab Protocols, Medical Topics,
and others) (179) -> 147
Plant Databases (159) -> 146
Protein Sequence Databases and Analysis Tools (492) -> 408
Proteomics Resources (74) -> 58
RNA Databases and Analysis Tools (257) -> 222
Structure Databases and Analysis Tools (452) -> 384
• Sum: 3704 -> 2457
Data Warehouses
• Are resources like NCBI and EBI databases?
– No they are larger than what is generally called a database
– They can be called data warehouses
– They consist of many interlinked databases
Need for Improvement
• Anyone can submit data to online resources
• Rigorous data checking is necessary
– Saçar and Allmer (http://journal.imbio.de/index.php?paper_id=215)
– Bağcı and Allmer (http://dx.doi.org/10.1109/HIBIT.2012.6209038)
• Data must be standardized
• Quality of data must be specified
How to Cite Data
• It is rarely necessary to present a sequence in any
writing
• In general it suffices to give
– Accession number of sequence
– Database where sequence is located
• If database is not given try
– Accession Parser (www.biolnk.com)
• In case you have a new sequence
– Generally required to deposit it in a database
– E.g.: http://www.ncbi.nlm.nih.gov/genbank/submit/
– Then cite the assigned accession number(s)
End of Theoretical Part 1
• Mind mapping
• 10 min break
Practical Part 1
Where is the data?
• Turn on your computers and let’s find out
•
•
•
•
EBI
Ensembl
GenBank
SwissProt
(www.ebi.ac.uk/)
(www.ensembl.org)
(www.ncbi.nlm.nih.gov/Genbank)
(www.tigr.org/tdb)
• Make these pages bookmarks
– Are your bookmarks where you are?
– Try: http://www.delicious.com
Retrieve Data
• You want the DNA sequence of some human
Hemoglobine
• How do you get it?
• Try to achive this goal for a few minutes
Ctrl-F
No results
Where have we gone wrong?
Language!
Database!
GenBank
GenBank
• http://www.ncbi.nlm.nih.gov/Sit
emap/samplerecord.html
GenBank
• Accession number
–
–
–
–
Applies to full record
X00000
XX000000
Never changes
GenBank
• Version
– Identifies a single sequence
– Adds version to accession number format
• X00000.0
– Version ie .0 -> .1 changes if even a single nucleotide in the
sequences is changed
– Other versions are referenced
• http://www.ncbi.nlm.nih.gov/entrez/sutils/girevhist.cgi
GenBank
• GeneInfo identifier (GI)
– Any change to the sequences forces a new gi number
– Translations get separate gi numbers
– GI:00000
GenBank
GenBank
• Sequence?
GenBank
• Eukaryotic
Retrieving Sequences By Example
• Basic Local Alignment Search Tool
• BLAST
http://www.ebi.ac.uk/
What did we do?
• We wanted to find one of the human hemoglobins
– The nucleotide sequence in FASTA format
• We wanted to find similar sequences
– BLAST (ncbi)
– FASTA (ebi)
• Who got lost in the jungle of LINKS?
– That is normal
– Bioinformatics is a quickly growing field
– Consolidation not any time soon
End of Practical Part 1
• 15 min break
Theoretical Part 2
• And now for something completely different
– http://en.wikipedia.org/wiki/And_Now_for_Something_Completely_Different
• How can we find sequences?
• Can the algorithm we found last week be used?
Similarity Searching
• Search Algorithms
– BLAST
– FASTA
– ...
• This is at the heart of bioinformatics
• It demands a lot of attention
Similarity Searching
• Exact pattern matching
• Approximate pattern matching
String Matching Math
• Remember the string matching we did last week?
• Today we will look at the math of finding EXACT
matches between queries and databases
• If time allows we will look into substitution matrices
Probability for perfect matches
Query
(Q): ATTGCC
LQ
Target (T): CGATTGCCCG
LT
LQ= length of query (number of nucleotides)
LT = length of sequence (number of nucleotides)
Element Probability
Probability of finding a nucleotide
Very roughly 0.25
Given the sequence:
ATTTCCGGGGTAGCTAGCTAGTATATTATCGGCGCTAA
What are the probabilities for A, C, G, and T?
Nucleotide Number
Frequency
A
9
0.24
C
7
0.18
G
10
0.26
T
12
0.32
N
38
1.00
Sequence Probability
p = PAPC2PGPT2
What is p?
p = the probability of randomly generating the sequence
given the frequency and number of its elements (e.g.: PA).
There is no sequential dependency assumed in this model.
What is the probability of generating
AAAAGTTT given the probabilities that we just calculated?
p = 0.244 * 0.26 * 0.323
= 0.003 * 0.260 * 0.033
= 0.000026
How Often do we Expect to Find the Query
• The number of matches is restricted by the database size
• How often can we shift Q (Query) against T (Target)?
• This defines the number of possible matching operations
n = LT – LQ +1
Example:
LQ = 6
LT = 10
n = 10 – 6 + 1
n=5
Query: ATTGCC
Target: CGATTGCCCG
The probability distribution of
the number of matches
is approximately binomial:
n = 20
p = 0.1
p = 0.5
p = 0.8
Definition: q = 1 - p
p(x) = (n! / x!(n – x)!) px qn-x
http://en.wikipedia.org/wiki/Binomial_distribution
P:
probability
What
is p? for being true
Q:
probability
What
is n? for being false
N:
number
What
is q? of trials
X: number of successes
Problem
• Factorial leads to overflows in
computer programming
• With n*p < 1 and large n
• The distribution can be
approximated by a Poisson
distribution
– Much easier to calculate for a
computer
Poisson vs. Binomial Distributions
Poisson
p(x) = e-λ (λx / x!)
λ: n*p
Binomial
p(x) = (n! / x!(n – x)!) px qn-x
Partial matches
• So far we considered matching the complete query
• Partial match:
L ( L<= LQ ^ L <= LT)
p = 2-2L
m = LQ - L -1
n = LD - L -1
E = m n 2-2L
BLAST E-Value
•
E = mn2-S
E = mn2-2L
• Describes the number of expected matches which are
equally good or better
End of Theoretical Part 2
• Mind mapping
• 10 min break
Practical Part 2
Practice Poisson vs Binomial
Q: ATG
D: CGATTGCCCG
Calculate p(0), p(1) and p(3)
Note: at least one match = 1 – p(0)
E = m n 2-2L
Assuming a database size of 10 000 000
and a query length of 10 calculate the number
of matches that would happen by chance?
Practical Concerns
• Human genome 3 billion nucleotides
• Dogma: 14 nucleotides are enough to uniquely identify a
gene
• Verify this using Poisson distribution
Poisson
p(x) = e-λ (λx / x!)
λ: n*p
BLAST Interface
• Setting a cutoff E-value
– Consider the calculation you just did
– If someone was to set the cutoff to 0.01 with the same
assumptions
• How many results would you expect?
• What would you advise the user?
• Topic will be revisited later
Amino Acid Sequences
• What changes when instead of nucleotide sequences we
were to use amino acid sequences?
Practise this
• Determine how long a query must be that it can uniquely
identify a gene in the human genome
– p < 0.05
Assignments
• Go to GenBank and inspect all parameters
– Find their meaning (even if you think you know what it means)
– Sometimes definitions are surprising
• Collect information about parameters that pose problems
to you
– Submit this information to us so that we can discuss in the
following week
Homework 1
• Make a table showing the E-value against LQ(10..100)
with LD = 3 000 000 000
• Use Excel to do this
• Send the results to [email protected]