BIOINFORMATICS
Download
Report
Transcript BIOINFORMATICS
REMINDERS
2nd Exam on
Coverage:
Central Dogma of DNA
• Replication
• Transcription
• Translation
Recombinant DNA technology and
molecular biology
Protein analysis
BIOINFORMATICS
BIOINFORMATICS
Study of the structure of biological
information and biological systems
Integrates theories and tools of
mathematics/statistics, computer
science and information technology
Involves the use of hardware and
software to study vast amounts of
biological data
What is Bioinformatics?
the field of science in which biology,
computer science, and information
technology merge to form a single
discipline
application of information technology
to the storage, management and
analysis of biological information
facilitated by the use of computers
FUNCTIONS
Data Management
Storage
Retrieval
Data Analysis
*Literature/Bibliography, Sequence,
Structure, Taxonomy, Expression, etc.
BIOLOGICAL DATABASES
Systematic data storage/retrieval
Maintained on a regular basis
Can contain various types of data
(integration)
Sequence
Structure
Other pertinent information
Nucleic acids and proteins are most
common
DATABASES
a large, organized body of persistent data,
usually associated with computerized
software designed to update, query, and
retrieve components of the data stored
within the system
Biological databases consist usually of the
nucleic acid sequences of the genetic
material of various organisms as well as
protein sequences and structures
DATABASES
e.g. nucleotide sequence database typically
contains information such as
contact name
the input sequence with a description of the
type of molecule
the scientific name of the source organism
from which it was isolated
additional requirements
easy access to the information
a method for extracting only that information
needed to answer a specific biological question
DATABASES
•
Sequence
–
–
–
–
–
GenBank, European Nucleotide Archive
(ENA) and DNA Data Bank of Japan
(DDBJ); managed by the International
Nucleotide Sequence Database
Collaboration (INSDC)
UniGene
Saccharomyces Genome Database
(SGD)
UniProtKB (UniProtKB/Swiss-Prot or
UniProt/TrEMBL)
ExPASy
DATABASES
Structure
Nucleic Acid Database (NDB)
Protein Data Bank (PDB)
Worldwide Protein Data Bank (wwPDB)
ExPASy
DATA MINING
Process by which testable hypotheses
are created regarding function/structure
of gene/protein of interest through
identifying similar sequences in “more
established” organisms
Tools:
Text-term search
Sequence similarity search
Machine Learning
Studies methods and the design of
computer programs based on past
experience
Why?
New methods are being introduced
Old ones should be improved
“Units” of Information
DNA (genome)
RNA (transcriptome)
Protein (proteome)
What is Being Analyzed?
Sequence
Structure
Interactions
Pathways
Mutations/Evolutions
Why?
Increasing amount of biological
information entails
Organization
Archiving
Global unification/harmonization
More biological discoveries
Functional/Structural similarities
Phylogenetic/Evolutionary patterns
Applications
Medicine
Pharmaceuticals
Biotechnology
Agriculture
STRUCTURE
DATABASES
Molecular Data
•
When you draw a molecule,
–
–
–
•
You start with atoms
Then proceed with the structure
And the three-dimensional data
What can be stored?
–
–
–
Coordinates
Sequences
Chemical graphs
• Atoms and bonds
Databases
Protein Data Bank (PDB)
Molecular Modeling Database (MMDB)
Techniques in the
Laboratory
X-ray Crystallography
Nuclear Magnetic Resonance
Formats
PDB
mmCIF
MMDB
Structure Viewers
Cn3D
RasMol
WebMol
Mage
VRML
CAD
Swiss PDB Viewer
Promises of bioinformatics
Medicine
Knowledge of protein structure facilitates drug
design
Understanding of genomic variation allows the
tailoring of medical treatment to the individual’s
genetic make-up
Genome analysis allows the targeting of
genetic diseases
The effect of a disease or of a therapeutic on
RNA and protein levels can be elucidated
The same techniques can be applied to
biotechnology, crop and livestock improvement,
etc...
Challenges in bioinformatics
Explosion of information
Need for faster, automated analysis to process
large amounts of data
Need for integration between different types of
information (sequences, literature,
annotations, protein levels, RNA levels etc…)
Need for “smarter” software to identify
interesting relationships in very large data sets
Lack of “bioinformaticians”
Software needs to be easier to access, use
and understand
Biologists need to learn about the software, its
limitations, and how to interpret its results
SEQUENCE
ALIGNMENT
Two or More Sequences
Measure similarity
Determine correspondences between
residues
Find patterns of conservation
Derive evolutionary relationships
Alignment
Correspondences of nucleotides/amino
acids in two sequences or more are
assigned
An assignment of correspondences that
preserves the order of the residues
within the sequences is an alignment
Gaps are used to achieve this
Sequence alignment refers to the
identification of residue-residue
correspondences
Uses
Homology
Similarities
“Ancestry”
Genome annotation
Assigning structure and function to
genes
Database queries
For newly-discovered/unknown
sequences
Tools
•
Dot Plots
–
•
Scoring Matrices
–
–
–
•
Diagonal lines of dots showing similarities
between two sequences
Score reflects quality of each possible
alignment; best possible score is identified
Scoring scheme is crucial
PAM (Point Accepted Mutations) and
BLOSUM (BLOCKS Substitution Matrix)
Dynamic Programming
–
Algorithmic technique that reuses previous
computations
Scoring
Penalties/Scores
Match (e.g. A – A)
Mismatch (e.g. A C)
Gap (e.g. A _)
• Linear Gap Penalty: Uniform
• Affine Gap Penalty: Gap Existence vs. Gap
Extension
Local vs. Global Alignments
Global Alignment
Similarities between majority of two
sequences
Local Alignment
Similarities between specific parts of
two sequences
Programs
Pairwise Sequence Alignment
BLAST
VAST
FASTA
Multiple Sequence Alignment
MAFFT
Needleman-Wunsch
Algorithm
•
•
•
Can be used for global and alignments
Maximum-value function
A simple scoring scheme is assumed
Three steps
–
–
–
Initialization
Matrix fill (scoring)
Traceback (alignment)