Document 256145

Download Report

Transcript Document 256145

Bioinformatics
Lecture 1
• What is bioinformatics?
• Why bioinformatics?
• The major molecular biology facts
• Brief history of bioinformatics
• Typical problems of bioinformatics:
collection and retrieval of data
alignment and similarity search
prediction and classification
• Expectations and the level of requirements
What is Bioinformatics?
Computer
Science
Mathematics
and Statistics
Biology
What is bioinformatics?
A working definition is that of House of
Representatives Standing Committee on Primary
Industries and Regional Services Inquiry :"All aspects of gathering, storing, handling,
analyzing, interpreting and spreading vast amounts
of biological information in databases.
The
information involved includes gene sequences,
biological activity/function, pharmacological activity,
biological structure, molecular structure, proteinprotein interactions, and gene expression.
Bioinformatics uses powerful computers and
statistical techniques to accomplish research
objectives, for example, to discover a new
pharmaceutical or herbicide."
Areas of current and future development of
bioinformatics
• Molecular biology and genetics
• Phylogenetic and evolutionary sciences
• Different aspects of biotechnology including
pharmaceutical and microbiological industries
• Medicine
• Agriculture
•Eco-management
Why bioinformatics?
• Exponential growth of investments
• Constant deficit of trained professionals
• Diversification of bioinformatics applications
• Need in different types of bioinformaticians
Central Dogma of Molecular Biology
replication
GENOTYPE (i.e. Aa)
GENE (DNA)
ATGCAAGTCCACTGTATTCCA
transcription
MESSENGER (RNA)
translation
PROTEIN
PHENOTYPE (pink)
TRAIT
reverse tr
UACGUUCAGGUGACAUAAGGG
DNA
5’
3’
A C G T C A T G
5’ template
T G C A G T A C
Symbol
Double helix
3’
Meaning
Explanation
G
G
Guanine
A
A
Adenine
T
T
Thymine
C
C
Cytosine
R
A or G
puRine
Y
C or T
pYrimidine
N
A, C, G or T Any base
RNA
5’
A C G U C A U G
U
U
3’
Uracil
Genetic Code
1. Amino acids are coded by codons – triplets of
nucleotides, e.g. |ACG|TAT|….
2. There are 43 = 64 codons for ~20 amino acids, the
code is degenerate
3. Codons do not overlap
4. Deletions or insertions of one or few nucleotides (not
equal to 3 x N) usually destroy a message by shifting
a reading frame
5. Three specific codons (stop codons) do not code any
amino acid and are always located at the very end of
the protein coding part of a gene
The genetic code
The 20 amino acids common in living
organisms
PROTEINS
Green Fluorecent Protein (GFP)
1 mcgkkfelki dnvrfvghpt llqpphtiqa sktdpspkre lptmilfsvv falranadas
61 viscmhnlsr riaialqhee rrcqyltrea klmlamqdev ttiidsdgsp qspfrqilpk
121 cklardlkea ydslcttgvv rlhinnwlev sfclphkihr vggkhiplea lerslkairp
Genomic Hierarchy in Eukaryotes
Genome nuclear (1)
Chromosomes (23x2)
DNA molecules (23x2)
Genes (~30,000); only a small fraction of genome
Nucleotides (~3x109)
Eukaryotic genes are complex
Start codon
Promoter Exon 1
Intron 1
Intron 2
Exon 2
Intron 3
Exon 3
Protein coding regions
Stop codon
Exon 4
Brief history of bioinformatics: Databases
• The first biological database - Protein Identification Resource
was established in 1972 by Margaret Dayhoff
• Dayhoff and co-workers organized the proteins into families and
superfamilies based on degree of sequence similarity
• Idea of sequence alignment was introduced as well as special
tables that reflected the frequency of changes observed in the
sequences of a group of closely related proteins
• Currently there are several huge Protein Banks : SwissProt, PIR
International, etc.
• The first DNA database was established in 1979. Currently there
are several powerful databases: GenBank, EMBL, DDBJ, etc.
Brief history of bioinformatics:
evolutionary reconsructions
Brief history of bioinformatics: other
important steps
• Development of sequence retrieval methods (1970-80s)
• Development of principles of sequence alignment (1980s)
• Prediction of RNA secondary structure (1980s)
• Prediction of protein secondary structure and 3D (1980-90s)
• The FASTA and BLAST methods for DB search (1980-90s)
• Prediction of genes (1990s)
• Studies of complete genome sequences (late 1990s –2000s)
Collection and retrieval of data.
Alignment methods.
• Sequencing (DNA, proteins)
• Submission of sequences to the databases
• Computer storage of sequences
• Development of sequence formats
• Conversion of one sequence format to another
• Development of retrieval and alignment methods
Prediction, reconstruction and
classification
• Prediction of secondary and 3D structure of RNA and proteins
• Gene prediction in prokaryotes and eukaryotes
• Prediction of promoters and other functional sites
• Reconstruction of phylogeny
• Genome analysis
• Classification of proteins and genes
Prediction of RNA secondary structure:
an example
A. Single stranded RNA
5’
B. Stem and loop or hairpin loop
5’
3’
3’
Expectations of students’ performance
• Basic understanding of general principles of molecular biology
• Some mathematical and computer science background
• Focus on using computational methods and understanding
general ideas of analysis used in bioinformatics
• Formal description of algorithms and complex methodology
will not be the core elements of this unit
• The core requirement is understanding of foundations of
bioinformatics and “hands on” approach