UGA Institute of Bioinformatics

Download Report

Transcript UGA Institute of Bioinformatics

An Introduction to Bioinformatics
(high-school version)
Ying Xu
Institute of Bioinformatics, and Biochemistry and
Molecular Biology Department
University of Georgia
[email protected]
The Basics
ccgtacgtacgtagagtgctagtctagtcgtagcgccgtagtcgatc
gtgtgggtagtagctgatatgatgcgaggtaggggataggatagc
aacagatgagcggatgctgagtgcagtggcatgcgatgtcgatga
tagcggtaggtagacttcgcgcataaagctgcgcgagatgattgc
aaagragttagatgagctgatgctagaggtcagtgactgatgatcg
atgcatgcatggatgatgcagctgatcgatgtagatgcaataagtc
gatgatcgatgatgatgctagatgatagctagatgtgatcgatggta
ggtaggatggtaggtaaattgatagatgctagatcgtaggta……
……………………………
cell
genes
chromosome
protein
genome and sequencing
metabolic pathway/network
Bioinformatics
(or computational biology)
ccgtacgtacgtagagtgctagtctagtcgtagcgccgtagtcgatc
gtgtgggtagtagctgatatgatgcgaggtaggggataggatagc
aacagatgagcggatgctgagtgcagtggcatgcgatgtcgatga
tagcggtaggtagacttcgcgcataaagctgcgcgagatgattgc
aaagragttagatgagctgatgctagaggtcagtgactgatgatcg
atgcatgcatggatgatgcagctgatcgatgtagatgcaataagtc
gatgatcgatgatgatgctagatgatagctagatgtgatcgatggta
ggtaggatggtaggtaaattgatagatgctagatcgtaggta……
……………………………
• This interdisciplinary science … is about
providing computational support to studies
on linking the behavior of cells, organisms
and populations to the information encoded
in the genomes
– Temple Smith
Information Encoded in Genomes
• What information? And how to find and interpret it?
ccgtacgtacgtagagtgctagtctagtcgtagcgccgtagtcgatcgtgtgggtagtagctgatatgatgcgag
gtaggggataggatagcaacagatgagcggatgctgagtgcagtggcatgcgatgtcgatgatagcggtag
gtagacttcgcgcataaagctgcgcgagatgattgcaaagragttagatgagctgatgctagaggtcagtgac
tgatgatcgatgcatgcatggatgatgcagctgatcgatgtagatgcaataagtcgatgatcgatgatgatgcta
gatgatagctagatgtgatcgatggtaggtaggatggtaggtaaattgatagatgctagatcgtaggta……
……………………………
• Working molecules (proteins, RNAs) in our cells
bacterial cell
Information Encoded in Genomes
• How to find where protein-encoding genes are in a genome?
ccgtacgtacgtagagtgctagtctagtcgtagcgccgtagtcgatcgtgtgggtagtagctgatatgatgcgaggtaggggataggatagcaacagatgagcggatgctgagtgcagtggcatgcg
atgtcgatgatagcggtaggtagacttcgcgcataaagctgcgcgagatgattgcaaagragttagatgagctgatgctagaggtcagtgactgatgatcgatgcatgcatggatgatgcagctgatc
gatgtagatgcaataagtcgatgatcgatgatgatgctagatgatagctagatgtgatcgatggtaggtaggatggtaggtaaattgatagatgctagatcgtaggta…………………………
• A genome is like a book written in “words” consisting of 4
letters (A, C, G, T), and each protein-encoding gene is like an
instruction about how the protein is made
• People have found that the six-letter words (e.g., AAGTGC)
have different frequencies in genes from non-gene regions
Information Encoded in Genomes
Frequency in genes (AAA ATT) = 1.4%;
Frequency in genes (AAA GAC) = 1.9%;
Frequency in genes (AAA TAG) = 0.0%;
….
Frequency in non-genes (AAA ATT) = 5.2%
Frequency in non-genes (AAA GAC) = 4.8%
Frequency in non-genes (AAA TAG) = 6.3%
AAAATTAAAATTAAAGACAAAATTAAAGACAAACACAAAATTAAATAGAAATAGAAAATT …..
Is this a gene or non-gene region if you have to make a bet?
Information Encoded in Genomes
• Preference model:
– for each 6-letter word X (e.g., AAA AAA), calculate its frequencies in
gene and non-gene regions, FC(X), FN(X)
– calculate X’s preference value P(X) = log (FC(X)/FN(X))
• Properties:
– P(X) is 0 if X has the same frequencies in gene and non-gene regions
– P(X) has positive score if X has higher frequency in gene than in nongene region; the larger the difference, the more positive the score is
– P(X) has negative score if X has higher frequency in non-gene than in
gene region; the larger the difference, the more negative the score is
• Gene prediction: given a DNA region, calculate the sum of P(X)
values for all 6-letter words X in the region;
– if the sum is larger than zero, predict “gene”
– otherwise predict non-gene
Information Encoded in Genomes
• You just learned your first bioinformatics method for
gene prediction – congratulations!
Information Encoded in Genomes
• Ok, we now have learned how to find genes encoded in a
genome
• How do we find out what they do (their biological functions,
e.g. sensors, transportors, regulators, enzymes)?
Information Encoded in Genomes
• People have observed that similar protein sequences tend to
have similar functions
• Over the years, many genes have been thoroughly studied in
different organisms,
e.g., human, mouse, fly, …., rice, …
– their biological functions have been identified and documented
• For a new protein, scientists can possibly predict its function
by identifying well-studied proteins in other organisms, that
have high sequence similarities to it
– This works for ~60% of genes in a newly sequenced genome
Information Encoded in Genomes
• Scientists have developed computational techniques for
–
–
–
–
identifying regulatory signals that controls gene transcription
predicting protein-protein interactions
elucidating biological networks for a particular function
…... and elucidating many other information
Information Encoded in Genomes
E. Coli O157 and O111 are human pathogenic while E. Coli K12 is not;
Can we tell why? Which genes or pathways in E. coli O157 and O111
are responsible for the pathogenicity?
Information Encoded in Genomes
Random seq
human chromosome #1
P. furiosus
B. pseudomallei
E. coli O157
E. coli K-12
Information Encoded in Genomes
Red: prokaryotes
Blue: eukaryotes
Green: plastids
Orange: plasmids
Black: mitochondria
x-axis: average of variations of the K-mer
frequencies,
y-axis: average barcode similarity among
fragments of a genome
Information Encoded in Genomes
• Yes, biologists can derive a lot of information from
genomes now
• … but we are far from fully understanding any genome
yet, even for the simplest living organisms, bacteria
• We can clearly use new ideas from bright young minds –
interested in doing bioinformatics?
Linking Genome Information to
Biological Systems Behaviors
• To fully understand cellular behaviors, we need to
– elucidate information encoded in the genome, and
– understand working molecules, encoded by the genome, behaves
according to the physical laws on earth!
ccgtacgtacgtagagtgctagtctagtcgtagcgccgtagtcgatcgtgtgggtagtagctgatatgatgcgaggtaggggataggatagcaa
cagatgagcggatgctgagtgcagtggcatgcgatgtcgatgatagcggtaggtagacttcgcgcataaag…………………………
protein
gene
Key Drivers of Bioinformatics
• Human genome project has fundamentally changed
biological science
• A key consequence of the genome project is scientists
learned that they can produce biological data massively
–
–
–
–
genome sequences
microarray data for gene expression levels
yeast two hybrid systems for protein-protein interactions
…… and other “high-throughput” biological data
These data reflect the cellular states, molecular
structures and functions, in complex ways
Key Drivers of Bioinformatics
• … and let bioinformaticians to (help to) decipher the
meaning of these data, like in genome sequences
• Together, high-throughput probing technologies and
bioinformatics are transforming biological science into a
new science more like physics
Key Drivers of Bioinformatics
• Like physics, where general rules and laws are taught at
the start, biology will surely be presented to future
generations of students as a set of basic systems .......
duplicated and adapted to a very wide range of cellular
and organismic functions, following basic evolutionary
principles constrained by Earth’s geological history.
– Temple Smith, Current Topics in Computational Molecular Biology
Biomarker Identification
• Our goal is to identify markers in blood that can tell if a
person has a particular form of cancer
…… in a similar fashion to doing pregnancy
test using a test kit, possibly at home
Biomarker Identification
• Microarray gene expression data allow comparative analyses of gene
expression patterns in cancer versus normal tissues
Finding genes showing maximum
difference in their expression levels
between cancer and normal tissues
on cancer tissues
on normal tissues
Biomarker Identification
proteins A, …, Z highly
expressed in cancer
Biomarker Identification
• Question: Can we predict which of these tissue marker proteins can
get secreted into blood circulation so we can get markers in blood?
• Through literature search, we found over proteins being secreted into
blood circulation due to various physiological conditions
• We then trained a “classifier” to identify “features” that distinguish
between proteins that can be secreted into blood and proteins that
cannot
Biomarker Identification
• We have developed a classifier to distinguish blood-secretory
proteins and other proteins
• On a test set with 52 positive data and 3,629 negative data, our
classifier achieves
– 89.6% sensitivity, 98.5% specificity and 94% AUC
Biomarker Identification
• The predicted marker proteins can be validated using
mass spectrometry experiment
Biomarker Identification
• If successful, it will be possible to test for cancer using a
test-kit like pregnancy test-kits
Take-Home Message
• Biological science is under rapid transformation because of highthroughput measurement technologies and bioinformatics
• As an emerging field, bioinformatics is about using computational
techniques to solve biological problems, and represents the future of
biology
THANK YOU!