Biology 4900 - Clayton State University
Download
Report
Transcript Biology 4900 - Clayton State University
Biology 4900
Biocomputing
Chapter 1
Introduction
Goals of the course
To provide an introduction to bioinformatics with a focus
on the National Center for Biotechnology Information
(NCBI) and European Bioinformatics Institute (EBI)
To focus on the analysis of DNA, RNA and proteins
To introduce you to the analysis of genomes
To combine theory and practice to help you solve
research problems
Websites to Bookmark
• Make a “Biocomputing” favorites folder in your normal internet browser
• Add these sites to the Biocomputing folder
– Galileo: http://www.galileo.usg.edu/scholar/clayton/subjects/
– Interlibrary Loan:
http://adminservices.clayton.edu/library/depts/articlerequestform.aspx
– Specialized search engines & tools
• NCBI (Entrez/Pubmed/etc): http://www.ncbi.nlm.nih.gov/
• Entrez: http://www.ncbi.nlm.nih.gov/Entrez/
• Tutorial for Entrez:
http://www.ncbi.nlm.nih.gov/entrez/query/static/help/entrez_tutorial_BIB.pdf
• Swiss Institute of Bioinformatics portal : http://www.expasy.org/
More Websites to Bookmark
• Biology WorkBench http://workbench.sdsc.edu/
• Pymol http://www.pymol.org/
• KEGG: http://www.genome.ad.jp/kegg/
• Swiss-Prot: http://us.expasy.org/sprot/
• PIR: http://pir.georgetown.edu/pirwww/search/textpsd.shtml
• GenBANK: http://www.ncbi.nlm.nih.gov/Genbank/index.html
• EMBL: http://www.ebi.ac.uk/embl/index.html
• DDJB: http://www.ddbj.nig.ac.jp/
• Protein Data Bank: http://www.rcsb.org/pdb/
• ORF Finder: http://www.ncbi.nlm.nih.gov/gorf/gorf.html
• GENSCAN: http://genes.mit.edu/GENSCAN.html
Software to Download/Install
• You should have the following:
– Clayton State University e-mail account (CHECK
IT REGULARLY)
– Microsoft Office (from the Hub Software Center)
– Adobe Acrobat Reader: www.acrobat.com
– Biology WorkBench http://workbench.sdsc.edu/
– Pymol http://www.pymol.org/
Formal Definitions
• Bioinformatics: Research, development or application of
computational tools and approaches for expanding the use of
biological, medical, behavioral or health data, including those
to acquire, store, organize, analyze or visualize such data.
• Computational Biology: The development and application of
data-analytical theoretical methods, mathematical modeling
and computational simulation techniques to the study of
biological, behavioral and social systems.
• Genomics: A discipline in genetics that applies recombinant
DNA, DNA sequencing methods, and bioinformatics to
sequence, assemble, and analyze the function and structure
of genomes (the complete set of DNA within a single cell of an
organism.
Pevsner J, Bioinformatics and Functional Genomics, 2nd Edition, 2009
What is biocomputing?
Interface of biology, biochemistry and computers.
Analysis of proteins, genes and genomes using
computer algorithms and computer databases.
Genomics is the analysis of genomes. The tools of
bioinformatics are used to make sense of the billions of
base pairs of DNA that are sequenced by genomics
projects.
Application of computer algorithms and databases to
store and analyze huge quantities of biological and
biochemical data.
Why we need to use this computer stuff…
Haploid human genome
(23 chromosomes)
Contains 20,000–30,000
distinct genes.
Is ~ 3.2 billion bp in length
Represented by ~800 MB of
data.
http://www.214bio.com/BOOK/ch_11_genes.html; http://en.wikipedia.org/wiki/Human_genome
Many roles of biocomputing
Allows us to ask a number of diverse questions, and use known data to provide full or
partial answers to those questions
Explore evolutionary origins of genes/proteins and determine phylogeny
(MSA and construction of phylogenetic trees).
Predict gene locations (ORF Finder, pattern searching)
Predict gene product function (Blast or FastA searches)
Predict protein structure and function (Protein Explorer)
Identify genes that are expressed before the onset of cancer through
genome sequencing
Identify drugs that can be used to treat specific diseases
Determine who was the responsible for publishing information, data,
results, related studies, etc., through literature searches (e.g., PubMed).
Why should we care?
• Locate mutations responsible for genetic
diseases
– Aids in the treatment and diagnosis of
those diseases
– Pharmacogenomics (human genetic
variability in relation to drug action)
– Targeted drugs and therapies (e.g.,
design receptor targeting moieties)
• Discover and exploit new proteins
– Environmental clean-up (e.g.,
enzymatic bioremediation,
Chakrabarty’s oil-eating microbes)
– Antibiotics and other
chemotherapeutic agents
– Useful products
Where did these data come from?
• History
– First scientific journal published in France in 1600’s.
– Discovery of DNA in 1860’s to our modern understanding of
genetic code, protein synthesis, etc.
– 1981 IBM releases first PC
– 1996 First release of PubMed
• Computers have made a dramatic impact in these areas
– It would be impossible to analyze data on a large scale without
computer databases to organize information, and computer
programs to facilitate inquiries
Managing Data: The Database
Database: Organized collection of data
Relational model: Collection of tables storing
different information, but linked with a common “key”.
Database Management System (DBMS): System to
control creation, use and maintenance of database.
Accessible via query languages (ex. SQL)
Database System: DBMS and database combined
Database Example
Example: PDB data
http://www.qbyv.com/en/network_capabilities; http://www.museumsandtheweb.com/mw2001/papers/stuer/stuer.html
How can we exploit the available data?
• Development of algorithms or databases to:
– Compare sequences (DNA, RNA, proteins)
– Predict structure
• secondary structure
• homology modelling, threading
• ab initio 3D prediction
– Analyze 3D structure
• structure comparison/ alignment
• prediction of function from structure
• molecular mechanics/ molecular dynamics
• prediction of molecular interactions, docking
– Perform energy minimization calculations
– Predict useful mutations for protein engineering
– Statistical Analyses
Extract and analyze meaningful information that can be applied toward
some end
Three perspectives on bioinformatics
The cell
The organism
The tree of life
The Cell
Central Dogma of Molecular Biology
DNA
Transcription
RNA
protein
phenotype
Translation
in ribosome
CELL
ORGANISM
Phenotype: Organisms traits
Morphology, development, biochemistry, physiology, phenology (biological
cycles), behavior (from both genes and environment)
Central Dogma of Molecular Biology
DNA
RNA
protein
phenotype
proteome
phenotype
Central Dogma of Genomics
genome
transcriptome
The “ome” , a collection of specified units
DNA is collection of deoxyribonucleic acids
RNA is collection of ribonucleic acids
Protein is collection of amino acids
Polymers
Because these are all polymers, or sequences of repeating units, we can devise
algorithms to study these sequences for trends or to compare the sequences
Nucleotides
http://en.wikipedia.org/wiki/File:RNA-comparedto-DNA_thymineAndUracilCorrected.png
Nucleic Acids
A-T
G-C
Base Pairing
H-bonds ~7 kJ/mole
http://en.wikipedia.org/wiki/File:RNA-comparedto-DNA_thymineAndUracilCorrected.png; http://mcat-review.org/molecular-biology-dna.php
Structures of Amino Acids
• Proteins and polypeptides are biochemical compounds
consisting of amino acids
– Chains of amino acids bonded together by peptide bonds between the
carboxyl and amino groups of adjacent amino acid residues
• Proteins
– Longer and more complex than polypeptides
– Typically folded into a globular or fibrous form
– Structure facilitates a biological function
Peptide linkages
R
R
H 3N
O
+
+
H3N
O
-
Amino acid
CH
NH
C
CH
O
R
O
R
C
CH
NH
Polypeptide
O
-
C
O
Protein
Proteins have different levels of structure
• Primary (1°): Sequence of amino acids
– Determines 3D structure
• Secondary (2°): H-bonding interactions
between AA residues begin to produce
regular, identifiable structures
– Alpha (α) helices
– Beta (β) strands
– Random coil
• Tertiary (3°): Overall structure of single
protein in 3 dimensions
• Quaternary (4°): Assemblies of multiple
polypeptides and/or proteins
http://protein-pdb.com/2011/10/04/primary-protein-structure/
Amino Acid Codes
Name
3-Letter 1-Letter
Alanine
Ala
A
Arginine
Arg
R
Asparagine
Asn
N
Aspartic acid
Asp
D
Cysteine
Cys
C
Glutamic Acid
Glu
E
Glutamine
Gln
Q
Glycine
Gly
G
Histidine
His
H
Isoleucine
Ile
I
Name
Leucine
Lysine
Methionine
Phenylalanine
Proline
Serine
Threonine
Tryptophan
Tyrosine
Valine
3-Letter 1-Letter
Leu
L
Lys
K
Met
M
Phe
F
Pro
P
Ser
S
Thr
T
Trp
W
Tyr
Y
Val
V
Know these 1 letter AA codes, or you will know what it means to be roasted in
the depths of the Slor…
DNA Sequence
ATCTTCAGTGTTTCCCCTGTTTTGCCC.ATTTAGTTCGCTC
||||||||||||||||||||||||||| |||||||||||||
ATCTTCAGTGTTTCCCCTGTTTTGCCCGATTTAGTTCGCTC
Genomics
Sequence Analyses
AEQLTEEQIAEFKEAFALFDKDGDGTITTKELGTVMRSLGQNPTEAELQDMINEVDADGN
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
AEQLTEEQIAEFKEAFALFDKDGDGTITTKELGTVMRSLGQNPTEAELQDMINEVDADGN
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
AEQLTEEQIAEFKEAFALFDKDGDGTITTKELGTVMRSLGQNPTEAELQDMINEVDADGN
xxx
Proteomics
Protein Sequence
Sequence Analysis Using ClustalW
2 or more
sequences for
analysis
Phylogram
ClustalW
params
(default or
custom for
different
scoring
matrices, gap
penalties, etc.)
Cladogram
1exr_A
1N0Y_A
3cln_
-EQLTEEQIAEFKEAFALFDKDGDGTITTKELGTVMRSLGQNPTEAELQDMINEVDADGN 59
AEQLTEEQIAEFKEAFALFDKDGDGTITTKELGTVMRSLGQNPTEAELQDMINEVDADGN 60
----TEEQIAEFKEAFSLFDKDGDGTITTKELGTVMRSLGQNPTEAELQDMINEVDADGN 56
:************:*******************************************
1exr_A
1N0Y_A
3cln_
GTIDFPEFLSLMARKMKEQDSEEELIEAFKVFDRDGNGLISAAELRHVMTNLGEKLTDDE 119
GTIDFPEFLSLMARKMKEQDSEEELIEAFKVFDRDGNGLISAAELRHVMTNLGEKLTDDE 120
GTIDFPEFLTMMARKMKDTDSEEEIREAFRVFDKDGNGYISAAELRHVMTNLGEKLTDEE 116
*********::******: *****: ***:***:**** *******************:*
1exr_A
1N0Y_A
3cln_
VDEMIREADIDGDGHINYEEFVRMMVS- 146
VDEMIREADIDGDGHINYEEFVRMMVSK 148
VDEMIREANIDGDGQVNYEEFVQMMTA- 143
********:*****::******:**.:
Protein Secondary Structure: PDBSum (EMBL-EBI)
•http://www.ebi.ac.uk/pdbsum/
•Either enter PDB file or can load new/existing sequence
Applications of Sequence Analyses
• Codons (3 RNA bases in sequence) determine each amino acid
that will build the protein expressed
Statistical Analysis of PDB Data: Ca2+ vs. Pb2+
SC N, 5.1
MC N,0.6
Asn, 1.1
Carbonyl,
5.6
L
L
Asp, 20.3
L
SC O,
61.0
M
L
L
Gln, 0.6
S, 7.3
L
Glu, 38.4
L
Pentagonal bipyramidal
geometry
Thr, 0.6
HOH, 20.3
Holo- and Hemi-directed
geometries
Pb: Ligand Distribution
HOH,
13.3
Asp, 29.7
HOH,
33.1
SC, 65.3
Carbonyl,
21.4
SC, 42.9
Glu, 26.6
Asn, 6.1
Asp, 24.5
Carbonyl,
23.9
Gln, 0.0
Ser, 2.6
Ca: EF-Hand
Thr, 0.3
(Kirberger, Wang et al. 2008; Kirberger and Yang 2008; Glusker et al. 1998)
Ca: Non-EF-Hand
Glu, 10.4
Asn, 4.3
Gln, 1.3
Ser, 1.3
Thr, 1.0
Tyr, 0.1
Develop Algorithms/Programs to Address Specific
Problems
• Identify calcium-binding proteins by matching patterns of
known calcium-binding sites in sequences.
Descriptive ID
Sequence Pattern
Prosite
PS00018: EF-Hand
D-X-[DNS]-{ILVFYW}-[DENSTG]-[DNQGHRK]-{GP}-[LIVMC]-[DENQSTAGC]X(2)-[DE]-[LIVMFYW]
Yang (Pattern 1)
EFH Helix E
X-{DNQ}-X-X-{GP}-{ENSPQ}-X-X-{DQRP}
EFH Loop
[DNS]-X-[DNS]-{ILVFYW}-[DENSTG]-[DNQGHRK]-{GP}-[LIVMC][DENQSTAGC]-X(2)-[ED]
EFH Helix F
[FLMYVIW]-X-X-{NPS}-{DNEQ}-X(3)
Yang (Pattern 2)
YY00018
X(1)-{DNQ}-X(2)-{GP}-{ENSPQ}-X(2)-{DQRP}-[DNS]-X(1)-[DNS]-{ILVFYW}[DENSTG]-[DNQGHRK]-{GP}-[LIVMC]-[DENQSTAGC]-X(2)-[ED]-[FLMYVIW]X(2)-{NPS}-{DNEQ}-X(3)
The Organism
The Organism
Time of development
Genes: Segments of DNA or RNA that code for a polypeptide or for functional segment of RNA.
Genes of an individual organism can change over time.
http://en.wikipedia.org/wiki/File:Gene.png; http://www.mardianinmotion.com/2009/11/anti-aging-medicine-%E2%80%93-hope-hype-or-hucksters
The Tree of Life
Tree of Life
http://www.allvoices.com/contributed-news/4553607-is-chimps-as-smart-as-human; After Pace NR (1997) Science 276:734;
http://en.wikipedia.org/wiki/File:E_coli_at_10000x,_original.jpg