What is bioinformatics?

Transcript What is bioinformatics?

Molecular Exercise Physiology
Bioinformatics
Presentation 5
Henning Wackerhage
Learning outcomes
At the end of this presentation, you should be able to:
• Find information on any DNA sequence, gene, RNA, protein in
various species online. This information includes the position
of genes in the genome, the function of the proteins, their
role in disease.
• Carry out blast searches to identify homologous sequences.
• Explain the cause of genetic variability.
• Explain how a microarray experiment is carried out.
This presentation will be supported by a computer
practical in bioinformatics. Please revise this presentation
carefully before the practical or otherwise you will
struggle.
Bioinformatics
Part 1
Why study bioinformatics?
Introduction
The human and many other genomes have now been sequenced
and this data has been deposited online. In addition, there is a
wealth of information on genes and their products on networked
computers. Numerous programmes that allow you to analyse this
data do also exist. Most of this data is freely accessible online via
user-friendly computer programmes. It is easy to download the
DNA sequence for any gene that might respond to exercise or to
find reliable information on a protein that is involved in the
response to exercise.
In this presentation, you will learn how to find, use and analyse this
information.
You will mainly learn by doing and you will sometimes need to be
stubborn and click numerous button using the trial and error
method to finally get the information you want. It is not rocket
science but it will require stamina and patience at times!
What is bioinformatics?
NIH bioinformatics definition: Research, development, or
application of computational tools and approaches for expanding the
use of biological, medical, behavioral or health data, including those to
acquire, store, organize, archive, analyse, or visualise such data.
Why study bioinformatics?
Why should a sports biomedicist study bioinformatics?
The DNA encodes all the information necessary for letting cells
develop into a functioning organism. The DNA thus also encodes all
the organs involved in exercise and their adaptive response to
exercise. In addition, the differences in the DNA between two
individuals encodes the differences in the structure and function of the
two organisms; this includes differences such as differences in muscle
size, adaptation to training or in the motor regions of the nervous
system.
Therefore, bioinformatics will help us among other to:
a) Identify differences in the DNA sequence (i.e. single nucleotide
polymorphisms) between individuals that correlate with athletic
talent or the extent of adaptation to exercise;
b) Discover the regulatory mechanisms that mediate the adaptation
to exercise;
c) Interpret the results of microarray experiments where the
expression of thousands of genes is measured in response to
exercise.
Bioinformatics
Part 2
Genome viewing
Genomes online
The genomes for many prokaryote, eukaryote, plant, invertebrate
and vertebrate model species have now been sequenced. The DNA
sequences of these genomes have been posted online.
However, these websites contain much more than just the “naked”
DNA sequence which has limited use. With the help of special
computer algorithms, genes (exons, introns) have been identified
based by using available research information and by de novo
prediction.
Identified genes have been linked to various other sites including
those that list information on the same gene in other species, the
gene product (protein databases), PubMed, disease databases etc.
Genome browsers are therefore powerful tools not only for the
specialist but also for the essay-writing student.
The following website shows an incomplete tree of sequenced
genomes and the slide thereafter the information available on
genenome browsers.
Genomes online (incomplete)
http://www.ncbi.nlm.nih.gov/mapview
Genomes online
Online
Mendelian
Inheritance in Man
(OMIM)
PubMed: Reference
search
Full-text electronic
journals
Nucleotide sequences
3D Structures
Maps & Genomes
Protein sequences
MQKLQLCVY …
Taxonomy
Genomes online
The by largest project was the human genome project, the
sequencing of our own DNA sequence. Some findings are surprising:
• Human genome size: about 3,200 Mb (mega
bases).
• Gene numbers: human: 31,000, yeast: 6000,
fly: 13,000, worm: 18,000, plant: 26,000.
• Only 1.1 to 1.4 % of the human sequence
encodes protein. The rest is non-coding.
• 28 % of the sequence is transcribed into RNA (5
% of this is translated into proteins).
• Only 94 of 1,278 protein families are
specific to vertebrates.
• Why do we differ? Humans differ from another
by about one base pair per thousand: single
nucleotide polymorphisms (SNPs).
Human genome project
Landmarks
1953
Watson-Crick structure of DNA published
1975
F.Sanger, and independently A. Maxam and
W. Gilbert, develop methods for sequencing
DNA
1981
Human mitochondrial DNA sequenced: 16560
base pairs
1990
International
Human
Genome
launched – target horizon 15 years
1991
J.C. Venter and colleagues identify active
genes via expressed sequence tags (ESTs)
2000
Joint announcement of complete
sequence of human genome
2003
Completion of human genome
Project
draft
Major genome browsers
You can browse genome data using one of the following
browsers. We will mainly use Ensembl, the European and
user-friendly version:
www.ensembl.org
www.ncbi.nlm.nih.gov
http://genome.cse.ucsc.edu/
Task: Enter each of these websites and just click many
buttons and see what information you can obtain. We will
mainly use the Ensembl website.
Searching for gene information
OK, browsing the genome browsers and clicking on chromosomes is
pretty simple. However, you will most of the time search for a specific
gene where you do not know the genomic location. In these cases,
you will have to use a search engine and type the name of the gene or
protein in.
To do so, open the Ensembl website (www.ensembl.org) and click the
species, normally human. On the top of the page it states “Search for
anything with” and a box follows where you have to type in your
search term. Click “Lookup” and you will obtain results.
Worked example: Type in “malate dehydrogenase” and click
“lookup”. Many items will be listed starting with “9 matches in the
homo sapiens disease index”. However, you are interested in the gene.
Therefore scroll down until you see “170 matches in the Homo sapiens
Gene index”. The first entry under this heading is “Malate
dehydrogenase, cytoplasmic (EC 1.1.1.37)”. Other isoforms of this
enzyme are listed as well and you might have to get more information
now in what isoform you are interested.
Searching for gene information
On the Ensembl human genome website, enter “troponin” into
the search box. Find the following gene among the search
results:
Troponin C, skeletal muscle
Click this and a website with numerous clickable links will
appear.
Task 1: Click “Export gene data in EMBL, GenBank or FASTA”.
Scroll down, select output format “text” and “export”. The DNA
sequence of the gene will appear. You can now analyse the
sequence or design primers for the polymerase chain reaction
(PCR)
Task 2: Return to the “Troponin C, skeletal muscle” website.
Now click “MIM” (or OMIM). It stands for “(Online) Mendelian
Inheritance in Man. Read the paragraph. It will inform you
about research on the gene. The text on troponin is very short
compared to other texts e.g. on major disease genes.
Searching for gene information
Task 3: Click “LocusLink”. The following bar will appear:
Click on each window and produce the following information:
a) Who has carried out a structural analysis of the human
troponin C gene?
b) There is an ion binding motif on the molecule. For what ion?
c) Name a gene that is a neighbour on the chromosome.
d) What is the percent homology (similarity of the DNA
sequence) between the human and rat troponin C genes?
Bioinformatics
Part 3
Genetic variability
Genetic variation
By now, you may have asked yourself the following question:
“How can they list one human genome sequence if we are all
different? Surely, our genomes will be different?”
Good question and yes, we are different. We differ because of
nature and nurture and the nature bit is due to differences in the
DNA between human beings.
Most of these differences in the DNA sequence do not occur at
random but at fixed positions approximately all 1300 base pairs
(bp). They are called single nucleotide polymorphisms (SNPs,
pronounced “snips”). There are roughly 2,500,000 SNPs in the
human genome.
Variation in the human
species: mainly the result of
SNPs.
Genetic variation
Worked example: I have used Ensembl and have picked the
following SNP. During sequencing (each sequence is sequenced
several times), the investigators note that there is a base pair
which is sometimes sequenced as an adenine (A) or a thymine (T)
with high variability. An ambiguity code W was used to indicate
this in the final sequence:
Alleles:
A|T (ambiguity code: W)
Sequence Region:
CACAACTGCTTGGAWAAAACAGGATAG
SNPs are not the only source of genetic variation. Here is an
example for a deletion mutation with some bases missing:
Deletion:
Insertion:
TCAAGGTATTCTTCA
AAAAGGTCCCAACCC
TCAAGGTATTCTTCAGATTCTAAAAGGTCCCAACCC
Genetic variation
Do all SNPs lead to a change in phenotype? No! Remember that
only <2 % of human DNA encodes proteins and that a lot of DNA
is non-coding or intergenic DNA. A SNP or deletion in a DNAsequence with “no” function will probably not have a noticeable
effect.
Which of the following SNPs (1-5) are likely to cause a change in
the expression or structure of the protein encoded by the gene?
Gene
DNA
Enhancer
Promoter Start Exon
Intron
SNPs
1
2
3
4
5
Exon
Termination
Genetic variation
Enhancer
Promoter Start Exon
DNA
SNPs
1
2
3
4
Intron
Exon
Termination
5
Answer:
SNP1. This SNP could affect the binding of transcription factors to
the enhancer and thus the expression of the gene.
SNP2. This SNP lies in a non-functional region and will probably
have no effect. It could affect histone binding, though!
SNP3. This SNP could affect the binding of the transcriptional
machinery (esp. RNA polymerase II) to the promoter
SNP4. This SNP is in an exon and will code an amino acid.
However, it will only have an effect if the change triplet will
encode a different amino acid (e.g. AGA and AGG both encode
arginine).
SNP5. This SNP will be spliced out and therefore it will not have
an effect.
Find a SNP!
Worked example: Find SNPs that lie in the exons of the
myostatin gene, whose protein product is a potent muscle growth
inhibitor.First search for “myostatin”. There is another
abbreviation for myostatin which is GDF-8. Click “view gene in
genomic location”.
Lower on the page you will find a features
menu. Open, cross the SNPs box and close
SNPs
again. The following window opens and you see
the coding, untranslated (UTR) and intronic
SNPs. You can additionally open “human
proteins” or “EMBL mRNAs” to see where the
myostatin gene lies. There are two SNPs in the
myostatin Exon.
Find a SNP!
Worked example: If you click on a snip, a new window appears.
You will find the SNP in the genomic sequence GTAARGGCC where
R stands for a A|G polymorphism. You also find the following
figure:
Myostatin (GDF8) gene (3 exons
shown in dark red)
Coding SNPs
with R (A|G)
ambiguity
Find a SNP!
Task: How many SNPs do you find in the exons and introns of
the human histidine decarboxylase (EC 4.1.1.22) gene?
By the way, what does the EC number stand for?
How to detect genetic variation?
So far, studies investigating the relation between genetic variation
and e.g. disease have focussed on dramatic mutations like
frameshift mutations, deletion/insertion mutations rather than the
more subtle SNPs. Larger mutations are easier to detect and the
effects are usually more dramatic.
How to detect genetic variation?
Method: DNA can be obtained from nuclear
blood cells. The correct DNA will be excised
and amplified using the polymerase chain
reaction with so-called primers that will only
amplify a specific DNA sequence.
Here, a DNA fragment either with a deletion
(D) or insertion (I) mutation of the
Angiotensin-converting enzyme (ACE) gene
has been amplified and electrophoresed.
Angiotensin II is a known inducer of cardiac
hypertrophy.
Because we have two copies of each gene,
the combinations DD, ID or II are possible. In
this study, DD patients had a larger left
ventricle (heart) than ID and II patients.
(figure from Lechin et al. 1995)
Genetic variation and performance
Figure. Montgomery et al. (1998) measured the genotype of the
angiotensin converting enzyme gene, where an insertion/deletion
mutation exists. The left shows the PCR results for the three
gentypes DD, ID and II (taken from Lechin et al. 1995). The right
figure shows the relation between the genotype and the increase in
repetitive elbow flexion in response to a specific 10 week training
programme among British army recruits. The data suggest that a
DD genotype is associated with low, and ID with medium and a II
genotype with high trainability for this specific task.
Actinin genotype and performance
Actinin (ACTN) is an actin-binding protein and the two ACTN2 and
ACTN3 isoforms are found in skeletal muscle. Yang et al. (2003)
reported the association of a ACTN3-RR and ACTN3-RX genotype with
power athletes (these athletes have more ACTN3).
Bioinformatics
Part 4
Homology searches
Homologies
Worked example: You have sequenced the following human
DNA fragment and you want to know more about it:
AAAACATCTATCTTGCTGTGTTTGGACAGGCCAGCCCCTGAAACATCTTGGGCAATGGAGGGTTAACTT
CTCAAAGTTTAATAGGCAAGACCAGCAACCATGCAACAAGGTAAATTGTCCTCACGAGAACTCCAAAGA
CTATTTTTCTCTCTCTTTTTTTGAGGCAGGGTCTCGCTATGTTACCCAGGCTGCTCTCGAACTCTTGGG
CTCAAGCAATCCCCCCATCTTAACCTCCCCAGCAGCTGGGACTACAGCCACGCGCCACTGCACCCAGCT
GACTTTTCCTTCTAAGCATCTTTGGCTGGGCGTGGTGGCTCATGCCTGTAATCCCTGCACTTTGGGAGG
CCAAGGTGGGTAGATCACTGGAGGTCAGGAGTTCTAGACCAGCCTGGCCAACATGGTGAAACCTCATCT
CTACTAAAAATACAAAAAAATTAGCTGGGCATGGTGGCAGGTGCCTGTAATCCTAGCTACTCGGGAGGC
TGAAGCAGGAGAATTGCTTGAACCCAGGAGGTAGAGGTTGCAGTGACCCAAGATTGTGCCACTGCACTC
CAGCCTGGGTACACAGCGAGTCTGTCTAAAAAAGAAAAAAAAAAAAGGAAGAGAGAGCATCTTTATCTT
CATTTTCTAACCTTTAAGTGTTACTTTCTCCCAGTAACATTTTGCCCAGAAAGAGGTGATGAATATAGA
TTTAAGAATAAGATTTTCCCCATGTTGCTGCCTTTCCAGAACAAGTGAGTTCATTCTCATTTGTCTTTC
TTCAGAAATCTTTTATCTGTCTTTCTCCCATTAGCTGGAATGGGTGCTCCATGAGAATAAAGACTTGGG
TTCCATTCTTCCTATTGTCCCCAGAGCCTACATACTGGCTGGCATTGAGTAGCAATTGAACAGTTTTCT
GAATGAATGAATGAATGAATGCTCAAATAAGCACATGAATTAATTATCACTTTCCTTTGAATCTCTCCA
TTCTTCTTCCTCACCCAATGGGGCTCGATCCTTATACACAGAAGATACTCTATAAATGATGATTCAATG
AATGCCAAGCCCTGTTCTATGCACTGAAGACCAAAAGAAATAAAAGACATCATTCCTGCTCTGTAAGAA
Homologies
Worked example: To do so, you have to carry out a Blast search.
Enter:
http://www.ensembl.org/Homo_sapiens/blastview
Paste the sequence into the large box, select “homo sapiens” as the
database to search against and “blastn” for a nucleotide search.
“Blastx” does searches for DNA against protein (amino acid sequence),
“blastp” for protein against protein.
Homologies
Worked example: After you have started the search, click
retrieve and the programme will display a “view” button. Click the
“view” button and the programme will display a list of matches
with a score and a % identity. There is one match with 100%
identity on chromosome 10 (red arrow on the chromosome).
Clicking “[A]” yields a graphical display of the homology:
AAACATCTATCTTGCTGTGTTTGGACAGGCCAGCCCCTGAAACATCTTGGGCAAT
|||||||||||||||||||||||||||||||||||||||||||||||||||||||
AAACATCTATCTTGCTGTGTTTGGACAGGCCAGCCCCTGAAACATCTTGGGCAAT
If there is not 100% homology, then the alignment looks as
follows:
CTCATGCCTGTAATCCCTGCACTTTGGGAGGCCAAGGTGGGTAGATCACTGGAGG
||||||||||||||||xx|x|||||||||||||||||||x|x|||||||xx||||
CTCATGCCTGTAATCCTAGTACTTTGGGAGGCCAAGGTGAGCAGATCACCTGAGG
The “x” indicate a difference between both sequences.
Homologies
Task: I have selected a mouse DNA sequence and your task is to
see whether there is a homologous human sequence.
GTGTCTTGCACAGTAATAGACCGCAGAGTCCTCAGATGTCAGGCTGCTGAGCTGCATGTA
GGCTGTGCTGGAGGATGTGTCTACAGTCAATGTGGCCTTGCCCTTGAACTTTTGATTGTA
GTTAGTATAGCTATCAGAAGGATCAATCTCTCCGATCCACTCAAGGCCCTGTCCAGGCCT
CTGTTTTACCCACTGCATCCAGTAGCTGGTGAAGGTGTAGCCAGAAGCCTTGCAGGACAG
CTTCACTGAAGCCCCAGGCTTCACAAGCTCAGCCCCAGGCTGCTGCAGTTGGACCTGAGA
GTGGACACCTGTGGAGAGAAAGGCAGAGTGGATGTCATTGTCACTCAAGTGTATGGCCAG
ACATCGAGCCTGCTACTGTGAGCCCCTTACCTGTAGCTGTTGCTACCAAGAAGAGGATGA
TACAGCTCCATCCCATGGCGAGGTCCTGTGTGCTCAGTAACTGTAAAGAGAACAGTGATC
TCATGTTTTTCTGTGTGTGGTATAGACAACCCTATATTTACCATGTAGACTCACAGGATT
TGCATATTCATGAGCAGGATACATATTAGATGAGCACCTACTCCTGCAGGAGAAGAAGAG
ACACCTGGGTCAGGAATCAGGATGCTGAAACCCAAGTCATAGTCTTGTCTGAGGTAATTC
ATCCCATACCTCATCCCTGAACCTTGTGTTGAGGCTATGGATGTAACATTATAGCCTGTG
CACTAAAAAGATTTGCATCCTGAGACAGTGGCCCCACTTGTGACACAGTTGACAGATGGA
Bioinformatics
Part 4
Microarrays
Microarrays
Microarrays or biochips are a technique increasingly used by leading
research groups in exercise physiology. Microarrays are used to
compare the mRNA levels in two samples, e.g. control (no exercise)
versus exercise. Importantly, this comparison is done for nearly all
mRNAs that can be found in a tissue (e.g. all genes expressed in
skeletal muscle).
Microarrays
The method works by printing thousands DNA dots that code for the
genes of the organism onto a slide. The experimenter then converts the
mRNA into DNA that is labelled with a fluorescent marker, usually green
for the control sample and red for the experimental sample. The labelled
control and exercise samples are allowed to hybridise (stick to) the
complimentary DNA that is printed onto the slide. If a dot appears
green, then there was more control mRNA in the sample (mRNA goes
down during exercise). If a dot is red, then the mRNA went up in
response to exercise. Yellow dots mean that the amount of mRNA was
roughly equal in the control and exercise sample; i.e. the gene’s
expression is not affected by exercise. No fluorescence indicates that
this gene is not expressed in muscle (e.g. brain gene). The following
slide schematically shows what has just been said.
Microarrays
Normal mRNA
Disease mRNA
RT/PCR
Label with
fluorescent dye
Labelled DNA
from mRNA
Combine
equal amounts
Hybridise probe
to microarray
Scan
Informatics
Image processing,
DBMS, WWW,
bioinformatics, data
mining and visualization
Microarray example
No mRNA
mRNA only
expressed in control
mRNA only
expressed in
response to
disease/exercise
Expression in control
and disease/exercise.
Microarray analysis
Microarray experiments usually show the differential expression of
hundreds or thousands of genes.
Task: Assume the following two genes are expressed at higher levels
in response to 1 h of cycling exercise.
HSPD13982_i_at
Cathepsin D (lysosomal aspartyl protease)
NM_006457_r_at
LIM protein (similar to rat protein kinase Cbinding enigma)
a) What is the function of these genes?
b) Is there any link to exercise (e.g. changes in similar proteins in
response to exercise?
The End

What is bioinformatics?

Transcript What is bioinformatics?

Directory