Transcript Genomics
Bioinformatics
Joshua Gilkerson
Albert Kalim
Ka-him Leung
David Owen
1
What is Bioinformatics?
Bioinformatics: “The collection,
classification, storage, and analysis of
biochemical and biological information using
computers especially as applied in molecular
genetics and genomics.” (Dictionary.com)
Molecular genetics: “The branch of genetics
that deals with the expression of genes by
studying the DNA sequences of
chromosomes.” (Dictionary.com)
2
What is Bioinformatics? (cont.)
Another definition of molecular genetics: “The
branch of genetics that deals with hereditary
transmission and variation on the molecular
level.” (Dictionary.com)
Genomics: “A branch of biotechnology concerned
with applying the techniques of genetics and
molecular biology to the genetic mapping and
DNA sequencing of sets of genes or the complete
genomes of selected organisms using high-speed
methods, with organizing the results in databases,
and with applications of the data (as in medicine
or biology).” (Dictionary.com)
3
How old is the discipline?
The answer to this one depends on
which source you choose to read.
From T K Attwood and D J Parry-Smith's
"Introduction to Bioinformatics",
Prentice-Hall 1999 [Longman Higher
Education; ISBN 0582327881]: "The term
bioinformatics is used to encompass
almost all computer applications in
biological sciences, but was originally
coined in the mid-1980s for the analysis
of biological sequence data."
4
How old is the discipline? (cont.)
From Mark S. Boguski's article in
the "Trends Guide to
Bioinformatics" Elsevier, Trends
Supplement 1998 p1:
"The term "bioinformatics" is a
relatively recent invention, not
appearing in the literature until
1991 and then only in the context of
the emergence of electronic
publishing...”
5
Bioinformatic Research
up to 2005
DNA sequence
Gene expression
Protein
expression
Protein Structure
Genome mapping
Metabolic
networks
Regulatory
networks
Trait mapping
Gene function
analysis
Scientific
literature
6
What remains to be done?
Comparative
Genomics
Description of
mRNAs, proteins
(identity and
structure)
Functional
analyses
Detailed
understanding of
development,
regulation,
variation
7
The Human Genetic Code
8
Bioinformatics Activity: Where Is
Bioinformatics Done?
The biggest and best source of
bioinformatics links is the Genome
Web at the Rosalind Franklin Centre
for Genomics Research at the
Genome Campus near Cambridge,
United Kingdom.
Others: Research Centers,
Sequencing Centers, and "Virtual"
Centers (for example consortia and
communities).
9
Research Centers
Centro Nacional de Biotecnologia (CNB), Madrid, Spain.
Computational Biology and Informatics Laboratory at the
University of Pennysylvania, Philadelphia, USA
CIRB: Centro Interdipartimentale di Ricerche
Biotecnologiche, Bologna, Italy
Cold Spring Harbor Labs, New York, USA
European Molecular Biology Laboratory (EMBL),
Heidelberg, Germany.
Généthon, France.
GIRI: Genetic Information Research Institute, California,
USA.
MRC Human Genetics Unit, Edinburgh, United Kingdom.
MRC Rosalind Franklin Centre for Genomics
Research(RFCGR), Hinxton, United Kingdom.
10
Sequencing Centers
The Department of Genome
Analysis at the Institute of
Molecular Biotechnology, Jena,
Germany.
The Australian Genome Research
Facility, Austalia.
Baylor College of Medicine, USA.
Michael Smith Genome Sciences
Centre, Canada.
11
Virtual Centers
International Center for
Cooperation in Bioinformatics
network (ICCBnet):
http://www.iccbnet.org/
Belgian EMBnet node:
http://www.be.embnet.org/
12
Online Resources: What
Bioinformatics Websites Are
There?
Blogs
Information
Directories
Portals
Societies
Tools
Tutorials
13
Blogs
Bioinformatics.Org is a bioinformatics
blog.
The Bio-Web (http://cellbiol.com/) links
to resources online for molecular and
cell biologists and covers current news
in various biological/computational
fields.
Genehack (http://genehack.org/)
is one of the first bioinformatics blogs.
14
Information
The Australian National Genomic Information
Service (ANGIS) is operated by the Australian
Genomic Information Centre
(http://www.angis.org.au/new/about/generalinfo.
html#AGIC, currently at the University of
Sydney) to offer software, databases,
documentation, training and support for
biologists
"The University of Maryland AgNIC gateway
(http://agnic.umd.edu/) is a guide to quality
agricultural biotechnology information on the
Internet."
15
Directories
Christy Hightower, Engineering Librarian
at the Science and Engineering Library,
University of California Santa Cruz has
already done this better than me.
Visit her excellent article
(http://www.istl.org/istl/02winter/internet.html) about
bioinformatics Net resources in Issues
in Science and Technology
Librarianship.
16
Societies
Humberto Ortiz Zuazaga kindly
introduced The International
Society for Computational Biology
(http://www.iscb.org/) which he
points out "has links to programs
of study and online courses in
computational biology and to job
postings".
17
Collection of Tools
Bioinformatics.Org for a collection of
bioinformatics toolbox.
The Rosalind Franklin Center's
"GenomeWeb“
(http://www.rfcgr.mrc.ac.uk/GenomeWeb/).
Of historical interest only now, is the
legendary " Pedro's Molecular Biology
Search and Analysis Tools“
(http://www.public.iastate.edu/~pedro/researc
h_tools.html) that provides a collection of
WWW Links to Information
and Services Useful to Molecular Biologists.
18
Portals
Bioinformatics.Org is an international organization which
promotes freedom and openness in the field of bioinformatics
and is the root domain of a damned fine Website .
CCP11 (Collaborative Computational Project 11,
http://www.rfcgr.mrc.ac.uk/CCP11/index.jsp) is another product
of the UK's Genome Campus. CCP11 is funded by the BBSRC
and is hosted at the MRC Rosalind Franklin Center for
Genomics Research RFCGR located on the Wellcome Trust
Genome Campus, Cambridge.“
Jennifer Steinbachs runs compbiology.org which is a general
computational biology site as well as being a portal to her own
work.
BioPlanet (http://www.bioplanet.com/index.php) is well worth
visiting. It describes itself as "a not-for-profit site, funded with
our resources, for [its users'] benefit."
ColorBasePair (http://www.colorbasepair.com/) is a densely
packed portal with lots of bioinformatics links.
19
Genome Project
Ka-Him Leung
20
Genomics
Genome
– complete set of genetic instructions
for making an organism
Genomics
– attempts to analyze or compare the
entire genetic complement of a
species
21
Genomic Issues
Genomic DNA is a linear sequence of 4
nucleotides (A, C, G, T)
DNA forms the double helix by pairing with its
reverse complement (A-T, G-C)
Genomic DNA contains many genes, each of
which is formed from one or more exons
(stretches of genomic DNA), separated by
introns
A gene is copied into complementary RNA in a
process called transcription (U substitutes T)
22
Genomic Issues (cont.)
DNA sequencing, the process of determining the exact
order of the 3 billion chemical building blocks (called
bases and abbreviated A, T, C, and G) that make up the
DNA of the 24 different human chromosomes
In the human genome, about 3 billion bases are arranged
along the chromosomes in a particular order for each
unique individual.
One million bases (called a megabase and abbreviated
Mb) of DNA sequence data is roughly equivalent to 1
megabyte of computer data storage space. Since the
human genome is 3 billion base pairs long, 3 gigabytes
of computer data storage space are needed to store the
entire genome.
23
Different Genomics
Comparative Genomics: the management
and analysis of the millions of data
points that result from Genomics
Functional Genomics: ways of identifying
gene functions and associations
Structural Genomic: emphasizes highthroughput, whole-genome analysis.
24
History of Genome
1980
– First complete genome sequence for an organism is published
• FX174 - 5,386 base pairs coding nine proteins. (~5Kb)
1995
– First bacterial genome(Haemophilus influenzea) sequenced (1.8 Mb)
1996
– Saccharomyces cerevisiae genome sequenced (baker's yeast, 12.1 Mb)
1997
– E. coli genome sequenced (4.7 Mbp)
1998
– Sequence of first human chromosome completed
2000
– A. Thaliana genome (flower) (100 Mb)
– D. Melanogaster genome(Fruitfly) (180Mb)
2001
– 10,000 full-length human cDNAs sequenced
2003
– Human genome sequence completed
25
Human Genome Project
U.S. Human Genome Project was a 13-year
effort coordinated by the Department of Energy
and the National Institutes of Health.
Start at 1990. To complete mapping and
understanding of all the genes of human beings.
In June 2000, scientists completed the first
working draft of the human genome.
A high-quality, "finished" full sequence was
completed in April 2003.
26
Goals of HGP
– identify all the approximately 20,000-25,000 genes in
human DNA,
– determine the sequences of the 3 billion chemical
base pairs that make up human DNA,
– store this information in databases,
– improve tools for data analysis,
– transfer related technologies to the private sector, and
– address the ethical, legal, and social issues (ELSI)
that may arise from the project.
27
DNA Sequencing Process
Mapping
– Identify set of clones that span region of genome to be
sequenced
Library Creation
– Make sets of smaller clones from mapped clones
Template Preparation
– Purify DNA from smaller clones.
– Setup and perform sequencing chemistries
Gel Electrophoresis
– Determine sequences from smaller clones
Pre-finishing and Finishing
– Specialty techniques to produce high quality sequences
Data editing Annotation
– Quality assurance; Verification; Biological annotation;
– Submission to public database
28
29
Future of HGP
HGP is the first step in understanding humans at the molecular
level. Work is still ongoing to determine the function of many of
the human genes.
What still need to be done:
–
–
–
–
–
–
–
–
–
–
–
Gene number, exact locations, and functions
Gene regulation
DNA sequence organization
Chromosomal structure and organization
Noncoding DNA types, amount, distribution, information content, and
functions
Coordination of gene expression, protein synthesis, and posttranslational events
Interaction of proteins in complex molecular machines
Predicted vs. experimentally determined gene function
Evolutionary conservation among organisms
Protein conservation (structure and function)
Proteomes (total protein content and function) in organisms
30
31
Sequence Alignment
Joshua Gilkerson
32
Sequence Alignment
In genomics, many situations arise
when sequences need to be
compared or searched for similar
sub-sequences.
Both of these task are aided by
aligning the sequences to one
another.
The two sequences are called the
subject and the query.
33
Local vs. Global
Global alignment aligns the entire query
to the entire subject.
Local alignment aligns a piece one
sequence to a piece of the other.
Which is used depends on the
application.
Surprisingly, these are computationally
equivalent.
Sometime local-global mixed are used,
aligning the entire query sequence
against any one part of the subject.
34
Example Alignments
Global Alignment
AGCTCGA--GATTGCTGGACATGCTGCTGCT
| |||| ||||||
|||| ||||||
A--TCGAGCGATTGC-----ATGCAGCTGCT
Local Alignment
– Same subject as above
– Query Sequence: GAGAT
AGCTCGAGATTGCTGGACATGCTGCTGCT
|| | |||||
|| ||
AGAT GAGAT
GAGAT
35
Model for Alignment
The best alignment is the one
chosen from all possible
alignments that minimizes the
score.
Scoring is done pairwise at each
position along the alignment.
Introducing a gap is more
expensive than extending one
already introduced(affine gap
penalty).
36
Model for Alignment
Score = ∑ gap penalties + ∑ similarity
weights
Gap penalty = open penalty + size * size
penalty
Open penalty and size penalty are constants
>=0.
Similarity weight is zero for same base, >=0
for disparate bases.
BLOSUM similarity weights are most
commonly used.
37
Scoring Example
Same example as earlier
Using:
– Gap opening penalty of 1
– Gap size penalty of 1
– Similarity scores all 1
AGCTCGA--GATTGCTGGACATGCTGCTGCT
| |||| ||||||
|||| ||||||
A--TCGAGCGATTGC-----ATGCAGCTGCT
0210000210000002111100001000000=13
38
Needleman-Wunsch Algorithm
Sequences Q and S
Scoring matrix M len(Q) x len(S)
Similarity matrix s
Gap length penalty - g opening penalty 0
M(i,j) - score for best alignment of first i
elements of Q and first j elements of S.
M(i,j) = minimum of
– M(i-1,j)+g,
– M(i,j-1)+g,
– M(i-1,j-1)+s(Q(i),Q(j))
39
Needleman-Wunsch Example
CAT vs TAG
A C T G <-s
M->
g=1
A0 1 1 1
C
T
G
0 3 1
0 0
0
C A T
0 1 2 3
T 1
A 2
G3
40
Needleman-Wunsch Example
CAT vs TAG
ACT G
<-s
M->
A0 1 1 1
g=1
C
T
G
0 3 1
0 0
0
0
T 1
A 2
G3
C
1
2
2
3
A
2
2
2
3
T
3
2
3
2
41
Needleman-Wunsch Example
ACT
A0 1 1
C 0 3
T
0
G
G CAT vs TAG
<-s
M->
1
g=1
1
0
0
0
T 1
A 2
G3
C
1
2
2
3
A
2
2
2
3
T
3
2
3
2
42
Needleman-Wunsch Example
Two equally good alignments:
-CAT
C-AT
|
and
|
T-AG
-TAG
43
Needleman-Wunsch
Runs in n2 time.
Easily generalized to allow gap opening penalty
by using 3 copies of M, one for prefixes ending
with a match, one ending with a gap in each
sequence.
Easily generalized to local alignment by saying
s is best score for an alignment of some suffix
of the sequences ending at i and j. In practice,
this means:
– The first row and column are filled with all zeroes
instead of just the top-left-most position.
– The end of the alignment is at the globally minimal
position, not the lower-left corner.
– The beginning is at the location where backtracking
cannot continue.
44
Other Alignment Tools
The Basic Local Alignment Search Tool
(BLAST) is probably the most widely
used tool in genomics.
– Finds local alignments.
– Used on very large sequences (entire
genomes)
Smith-Waterman Algorithm - Adaptation
of Needleman-Wunsch for local
alignments.
FASTA package
45
The Importance of
Bioinformatics and Summary
David Owen
46
The importance of bioinformatics
Traditionally, molecular biology
research was done entirely in a
laboratory.
But the genome projects has
increased the data by a huge
amount. Thus the researchers need
to incorporate computers for
making sense of the vast amount of
data.
47
Challenges
Intelligent and efficient storage of the massive
data.
Easy and reliable access to the data.
Development of tools which allow the
extraction of meaningful information.
The developer of the tool must also consider the
following:
The user (biologist) might not be an expert with
computers.
The tool must be able to provide access across
the internet.
48
Processes
Three main processes a bioinformatics tool
must have:
DNA sequence determines protein sequence
Protein sequence determines protein structure
Protein structure determines protein function
The information obtained from these processes
allow us to understand better of the biology of
organisms.
49
Computer Scientist vs. Biologist
Computer scientist:
–
–
–
–
–
Logic
Problem-solving
Process-oriented
Algorithmic
Optimizing
Biologist:
–
–
–
–
–
Knowledge gathering
Experimentally-focused
Exceptions are as common as rules
Describe work as a story
Develop conclusions and models
The need for communication between computer scientist and
biologist.
50
Research Areas
Further research areas include:
Sequence alignment
Protein structure prediction
Prediction of gene expression
Protein-protein interactions
Modeling of evolution
51
Future of Bioinformatics
- Integration of a wide variety of data sources.
E.g. Combining the GIS data (maps) and
weather systems, with crop health and
genotype data, allows us to predict successful
outcomes of agricultural experiments.
- Large-scale comparative genomics. E.g. the
development of tolls that can do 10-way
comparisons of genomes.
- Modeling and visualization of full networks of
complex system.
52
Ultimate Goal
Obtain a better understanding of the
biology of organisms through the
examination of biological
information hidden in the vast
amount of data we have.
This knowledge will allow us to
improve our standard of life.
53
References
http://www.ornl.gov/sci/techresources/H
uman_Genome/project/about.shtml
http://www.genome.gov/
http://bioinfo.mbb.yale.edu/course/proje
cts/final-4/
http://www.dictionary.com
http://www.ebi.ac.uk/2can/bioinformatics
/index.html
http://bioinformatics.ca/workshop_pages
/bioinformatics/day1files/1.0_intro_bffo_2005.pdf
54
References (cont.)
http://elegans.uky.edu/520/Lecture/i
ndex.html
http://bioinformatics.org/
55