Transcript Information

Bioinformatics
Introduction
Acknowledgments
• These slides and exercises were prepared as
Bioinformatics Teaching Modules developed
by Elizabeth Murray, Ph.D. and Andrew
Rieser, Integrated Science and Technology,
Marshall University.
• Development of the slide shows and exercises
was funded in part by the National Center for
Research Resources (NCRR) of the NIH Grant
#P20 RR16477.
Acknowledgments
• These slides have been inspired by many
sources available on the Internet. We have
tried to acknowledge the contributions of
others and the source of the images in the
notes.
• If we have overlooked an acknowledgment,
please let us know and we will correct this.
• We are basing some of the examples on the
excellent new text “Bioinformatics and
Functional Genomics” by Jonathan Pevsner
Wiley Liss 2004.
Computational Biology vs.
Bioinformatics
What is Computational Biology? The development and
application of data-analytical and theoretical methods,
mathematical modeling and computational simulation
techniques to the study of biological, behavioral, and
social systems.
What is Bioinformatics? Research, development, or
application of computational tools and approaches for
expanding the use of biological, medical, behavioral or
health data, including those to acquire, store, organize,
archive, analyze, or visualize such data.
-NIH Biomedical Information and Science Technology Initiative Consortium
What is Informatics?
• The term informatics is widely used in both
health care and computer science.
• Computer specialists use the term informatics
for computer hardware, software, and
information theory
• Medical informatics includes all data
management in a hospital from patient
records, billing, images, to medical literature
etc.
A Good Definition
• Bioinformatics is the use of computers
for the acquisition, management, and
analysis of biological information.
• It incorporates elements of molecular
biology, computational biology, database
computing, and the Internet
• The key element of the definition is
information management
What Kinds of Information?
• Bioinformatics deals with any type of data that is of
interest to biologists
– DNA and protein sequences
– Gene expression (microarray)
– Articles from the literature and databases of
citations
– Images of microarrays or 2-D protein gels
– Raw data collected from any type of field or
laboratory experiment
– Software
• The analysis of DNA sequence data dominates the
field of bioinformatics, but the term can be used to
describe any type of biological data that can be
recorded as numbers or images and handled by
computers.
Who Works in Bioinformatics?
• Bioinformatics is clearly a multidisciplinary field including:
– computer systems management
– networking, database design
– computer programming
– molecular biology
How to Get a Job in
Bioinformatics
• Few scientists describe themselves as
specialists in bioinformatics.
• It is difficult to train people to specialize
in this field since different skills are
required to use computer tools to analyze
data vs. the design of those tools.
• Other specialists create the mathematical
algorithms used to build the tools.
• Strong knowledge of molecular biology is
also needed to frame meaningful questions
and problems for software development
and analysis.
Every Molecular Biologist must be
“Bioinformatics Literate”
• Most biologists are “users” not “developers”
of software and algorithms.
• This series of presentations and exercises
are intended to help you be a knowledgeable
user of software packages and be able to
frame interesting questions and interpret
your results.
A Good Day Using Bioinformatics
• A scientist studying a
model organism,
Arabidopsis, finds a
TDNA insertional
mutation in a gene they
are studying.
• They use the TDNA DNA
sequence as a probe to
hybridize to a genomic
DNA library and identify
a clone of the genomic
DNA.
Now they can determine the sequence of the gene they
mutated with TDNA.
A Good Day Using Bioinformatics
• The scientist enters the sequence into a
search tool (BLAST, FASTA) and
compares their DNA sequence with all
the DNA sequences in all the databases.
• The scientist finds a group of related
sequences to the gene with the tDNA.
• BUT the scientist doesn’t know anything
about those related genes.
A Good Day Using Bioinformatics
• The scientist can:
– Search publications on the related gene to
determine the gene function in other organisms (or
even their own organism).
– Look at the structure of the domains in related
genes to analyze function.
– Compare sequences with those of other organisms
to develop “trees” of sequence relationships.
– Analyze the promoter sequence of the gene to see
what transcription factor binding sites are there.
– Analyze expression data if the gene was included
on microarrays.
• All without lifting a pipette or thawing a tube!
A Good Day Using Bioinformatics
• Now the scientist can use bioinformatics to
guide their next experiments in the lab.
– Design PCR primers to amplify the DNA
– Search the sequence for restriction enzyme cut
sites for cloning the DNA for additional
experiments.
– Test hypotheses about structure and function of
the protein suggested by the sequence similarities.
If the protein looks like it has a kinase domain,
clone, express and purify it to see if it actually is a
kinase!
Introduction to Molecular
Genetics
• Using these slides requires some familiarity
with the principals of molecular biology and
genetics.
• If you are from a mathematics or computer
science background, the information in these
slides may be too jargon-filled and detailed
for you.
• There are many excellent resources on the
Internet to help you learn some of the basic
terminology of molecular biology.
Excellent Introductory
Resources
• The US Department of Energy has created a
useful Primer on Molecular Genetics.
– http://www.ornl.gov/sci/techresources/Human_Ge
nome/publicat/primer/toc.html
• On Line Biology Textbook
– http://www.emc.maricopa.edu/faculty/farabee/BI
OBK/BioBookTOC.html
• NCBI’s Science Primer
– http://www.ncbi.nlm.nih.gov/About/primer/index.h
tml
The Challenges of Molecular
Biology Computing
•
•
•
•
•
The big dataset problem
DNA sequencing
Pairwise and Multiple Alignments
Similarity searching the databanks
Structure-function relationships; Can
sequence patterns predict function?
• Phylogenetic analysis: Sequence
conservation across evolution
• Genomics
The Big Dataset Problem
• Biologists have been very successful in
finding the sequences of DNA and
protein molecules
– Automated DNA sequencers
– The Human Genome Project
– High throughput sequencing of cDNAs
(ESTs)
• Information scientists have to develop
tools to keep up with the data
The Big Dataset Problem
• Information is being collected, organized, and
made available in databases:
– GenBank is the central sequence information
database in the United States
– Data is shared between GenBank and European
Molecular Biology Laboratory (EMBL) and the DNA
Database of Japan (DDBJ)
– All sequence data submitted to any of these
databases is automatically integrated into the
others.
– Sequence data is also incorporated from the
Genome Sequence Data Base (GSDB) and from
patent applications.
The Big Dataset Problem
• These presentations will familiarize students
with these databases and their organization.
• Students will learn to enter data into the
databases and search for and download data
from the databases.
• Students will learn to use some of the
additional bioinformatics tools used to
organize the databases (LocusLink, COGs,
OMIM, SNP, UniGene and others).
DNA Sequencing
• One technician with an automated DNA
sequencer can produce over 20 KB of raw
sequence data per day.
• The real challenge of DNA sequencing is in
the analysis of the data
• DNA sequences reads of ~500 base pairs
must be assembled into complete genes and
chromosomes
• These 500 bp reads have errors of both
incorrect bases and insertion/deletions.
DNA Sequencing
• These presentations allow students to become
familiar with different strategies for genome
sequencing projects.
• Students will learn to analyze electronic DNA
sequence files and to use the Accelrys
Wisconsin Package Software to assemble
DNA sequences from such projects.
Pairwise and Multiple Alignments
• Pairwise alignment is the basis of similarity
searching
• Pairwise alignment has been "solved" as a
computational problem through dynamic
programming
• However, the "optimal" alignment calculated
by the computer may not be the best
representation of the true biological
alignment.
Pairwise and Multiple Alignments
• Multiple Alignment is the basis for the
analysis of protein families and functional
domains.
• When pairwise alignment is expanded to
compare multiple sequences, it becomes a
computationally huge problem.
• To reduce the nearly infinite permutations, a
simplified heuristic (approximate) algorithm is
used known as progressive pairwise alignment
• Since this problem is so complex, it is not
possible to mathematically define a truly
optimal alignment of multiple sequences.
Pairwise and Multiple Alignments
• These presentations will explain the dynamic
programming algorithm and its application.
• These presentations will allow the student to
distinguish between global and local alignment
algorithms and apply them appropriately.
• Students will learn the significance of the
Needleman/Wunsch and Smith/ Waterman
algorithms and their application.
• These slides will allow the student to
understand the role of scoring matrices (PAM
and BLOSUM) and gaps in sequence alignment.
Pairwise and Multiple Alignments
• These exercises will allow the student to use,
display and interpret data generated by the
Pairwise and Multiple alignment programs
included in the GCG Wisconsin Package.
These programs include:
• Pairwise Comparison
– Gap; FrameAlign; Compare; DotPlot; GapShow;
ProfileGap
• Multiple Comparison
– PileUp; HmmerAlign; SeqLab®; PlotSimilarity;
Pretty; PrettyBox, MEME; HmmerCalibrate;
ProfileMake; ProfileGap; Overlap; NoOverlap;
OldDistances.
Similarity Searching the
Databanks
• "Are there any sequences in the databanks
similar to my sequence?"
• Directly searching the databanks by
comparing sequences is too computer timeconsuming.
• The scientist uses timesaving heuristic tools:
FASTA and BLAST
• Meaningful interpretation relies on the
informed judgment of the Biologist and
interpretation of the statistics.
Similarity Searching the
Databanks
• Students will master the popular search tools
BLAST and FASTA (in their many versions)
used to search the databanks and learn to
interpret the significance of the statistics
and output from these programs.
• Students will learn additional sequence
searching and retrieval programs within the
GCG Wisconsin Package (FrameSearch;
MotifSearch; ProfileMake; ProfileSegments;
FindPatterns; Motifs; WordSearch;
Segments; Fetch and NetFetch).
Structure-function relationships:
• Sequence patterns that predict function.
– The prediction of the function of protein
molecules from their sequence is one of the most
challenging areas of computational molecular
biology.
• Sequence determines 3-D structure,
structure determines function
– Currently, we can’t predict a 3-D protein structure
from amino acid sequence alone. The best current
approach is based on comparing sequence similarity
to proteins of known structure = "threading"
Structure-function relationships:
• Can predict some aspects of 3-D structure
from sequence:
–
–
–
–
A-helix vs. B-sheet
membrane spanning region
helix-turn-helix
signal peptide
• Identifying conserved regions (domains or
motifs).
• Functions of these conserved domains are
defined by laboratory research.
• Domain databases can be used to scan any
unknown protein sequence for the presence of
over a thousand known domains.
Structure-function relationships:
• Databases of important conserved elements
within DNA sequences have been developed:
– transcription factor binding sites
– restriction enzyme recognition sites
• Some 3D RNA structures can also be
predicted based strictly on sequence
– by sequence comparison with other known
sequences (such as tRNA)
– by simple detection of stem-loop structures as
inverse repeats
Structure-function relationships:
• Students will learn to use PubMed and other
literature databases to obtain on-line journal
articles, abstracts and texts.
• Students will analyze proteins using software
which identifies sequence motifs, predicts
peptide properties, looks at secondary
structure, hydrophobicity, and antigenicity,
and identifies repeats and regions of low
complexity.
Structure-function relationships:
• Students will analyze sequences to
predict RNA or DNA structure. GCG
Wisconsin Package programs include
MFold, PlotFold; StemLoop.
• Students will use Gene Prediction
software packages available on the
internet, including Genefinder, Genscan
and GrailII.
Structure-function relationships:
• Students will learn to view 3-D protein
structures using Chime, Cn3d, Mage,
Rasmol and Swiss 3D viewer, Spdbv.
• Students will learn to view 3-D protein
structures using Chime, Cn3d, Mage,
Rasmol and Swiss 3D viewer, Spdbv.
• Students will design primer pairs using
Oligo3, Prime, PrimePair and TempMelt
Phylogenetic Analysis
• There are evolutionary assumptions
underlying the science of molecular
sequence analysis.
– evolution = mutation of DNA sequences
– two species that have genes that are
similar in sequence are more closely related
than are two species that have less
sequence similarity.
• It is possible to collect sequence data
from several different organisms, add
up the differences, and estimate their
relationships.
A Phylogenetic Tree
Phylogenetic Analysis
• There are a many controversies and
objections to such simplistic analyses.
– Not all DNA sequences mutate at the same
rate: protein coding regions mutate more
slowly than non-coding regions.
– Some positions in protein coding DNA
sequences are more free to mutate than
others
– Parsimony vs. maximum likelihood methods
of measuring distance.
Phylogenetic Analysis
• Students will investigate the relationships
within an aligned set of sequences through
computation of the pairwise distance between
sequences, construction of phylogenetic
trees, or calculation the degree of divergence
of two protein coding regions.
• The student will be able to collect a set of
related DNA sequences and calculate
phylogenetic distances and create a tree
using software programs in GCG Wisconsin
package (PAUPSearch; PAUPDisplay;
GrowTree; Diverge ).
Genomics
• What is genomics?
–
An operational definition: The application
of high throughput automated technologies
to molecular biology.
• A philosophical definition:
– A holistic or systems approach to the study
of information flow within a cell
Genomics
• Genomics Technologies include:
– Automated DNA sequencing and annotation
of sequences
– Gene Finding and Pattern Recognition
– DNA microarrays
– gene expression (measuring RNA levels)
– single nucleotide polymorphisms (SNPs)
– Protein chips
– Protein-protein interactions
Genomics
• The student will learn to use microarray
software available from NCBI and Marshall
University’s microarray facility to analyze
gene expression data.
• The student will learn to use GCG Wisconsin
package programs designed for genome
analysis, including TestCode, Codon
Preferences, Frame, Repeat, FindPatterns,
Composition, CodonFrequency,Window,
StatPlot, Consensus, FitConsensus, Xnu and
Seg.