The CMBI: Bioinformatics

Download Report

Transcript The CMBI: Bioinformatics

Introduction to genomes
Content

the human genome




CNVs
SNPs
Alternative splicing
genome projects
Celia van Gelder
CMBI
UMC Radboud
June 2009
[email protected]
The human genome
• Genome: the entire sequence of DNA in a cell
• 3 billion basepairs (3Gb)
• 22 chromosome pairs + X en Y chromosomes
• Chromosome length varies from ~50Mb to ~250Mb
• About 22000 protein-coding genes
• Human genome is 99.9% identical among individuals
Eukaryotic Genomes: more than collections of genes
• Protein coding genes
• RNA genes (rRNA, snRNA, snoRNA, miRNA, tRNA)
• Structural DNA (centromeres, telomeres)
• Regulation-related sequences (promoters, enhancers, silencers,
insulators)
• Parasite sequences (transposons)
• Pseudogenes (non-functional gene-like sequences)
• Simple sequence repeats
Annotating the genome
• Genome annotation is the process of attaching biological
information to sequences.
It consists of two main steps:
1.
identifying elements on the genome, a process called Gene Finding,
and
2.
attaching biological information to these elements.
• Automatic annotation tools try to perform all this by computer
analysis, as opposed to manual annotation which involves human
expertise. Ideally, these approaches co-exist and complement each
other in the same annotation pipeline.
The human genome cntnd
• Only 1.2% codes for proteins, 3.5-5% is under selection
• Long introns, short exons
• Large spaces between genes
• More than half consists of repetitive DNA
From: Molecular Biology of the Cell
(4th edition) (Alberts et al., 2002)
Eukaryotic Genomes: High fraction non-coding DNA
From: Mattick, NRG, 2004
Blue: Prokaryotes
Black: Unicellular eukaryotes
Other colors: Multicellular eukaryotes (red = vertebrates)
Variation along genome sequence
• Nucleotide usage varies along
chromosomes
– Protein coding regions tend to have
high GC levels
• Genes are not equally distributed
across the chromosomes
– Housekeeping generally in genedense areas
– Gene-poor areas tend to have many
tissue specific genes
From: Ensembl
Chromosome organisation (1)
From: Lodish (4th edition)
Chromosome organisation (2)
• DNA packed in chromatin
Genes
that are
OFF
• Non-active genes often in
densely packed chromatin
(30-nm fiber)
Genes
that are
ON
• Active genes in less dense
chromatin (beads-on-a-string)
• Gene regulation by changing
chromatin density,
methylation/acetylation of
the histones
From: Lodish (4th edition)
Today’s focus
1. Copy number variations (CNV)
2. Single Nucleotide Polymorphisms (SNPs)
3. Alternative transcripts
Copy Number Variation
• People do not only vary at the nucleotide level (SNPs)
• Copy Number Variations (CNVs):
duplications and deletions of pieces of chromosome
• When there are genes in the CNV areas, this can lead to variations
in the number of gene copies between individuals
• CNVs may either be inherited or caused by de novo mutation
Why study CNVs?
• CNVs are common in cancer and other diseases.
• CNVs are also common in normal individuals and contribute to our
uniqueness. These changes can also influence the susceptibility to
disease.
• Since CNVs often encompass genes, they can have important roles
both in characterizing human disease and discovering drug
response targets.
• Understanding the mechanisms of CNV formation may also help us
better understand human genome evolution.
CNV & disease, examples
CNVs have been implicated in
• Cancer
EGFR higher copy number in non-small cell lung cancer
• Low copy number of FCGR3B can increase susceptibility to SLE &
other autoimmune disorders
• Autism
• Schizophrenia (dept. human genetics)
• Mental retardation (dept. human genetics)
Single Nucleotide Polymorphisms (SNPs)
T
T
A
A
A
T
A
T
C
G
C
G
G
T A
G
C
Single NucleotidePolymorphism
(SNP)
G
T A
G
A
T A
C
T
T A
C
G
C
G
A
T
A
T
G
T
C
A
G
T
C
A
• SNPs are DNA sequence
variations that occur when a
single nucleotide (A,T,C,or G)
in the genome sequence is
altered.
• Similar to mutations, but are
simultaneously present in the
population, and generally
have little effect
• Are being used as genetic
markers (a genetic disease is
e.g. associated with a SNP)
SNP fact sheet
• For a variation to be considered a SNP, it must occur in at least 1%
of the population.
•
SNPs, which make up about 90% of all human genetic variation,
occur every 100 to 300 bases along the 3-billion-base human
genome.
• Two of every three SNPs involve the replacement of cytosine (C)
with thymine (T).
• SNPs can occur in coding (gene) and non coding regions of the
genome.
SNPs & medicine
• Although more than 99% of human DNA sequences are the same,
variations in DNA sequence can have a major impact on how
humans respond to:
– disease;
– environmental factors such as bacteria, viruses, toxins, and chemicals;
– and drugs and other therapies.
• This makes SNPs valuable for biomedical research and for
developing pharmaceutical products or medical diagnostics.
• SNPs are also evolutionarily stable—not changing much from
generation to generation—making them easier to follow in
population studies.
SNP & disease, example
Alzheimer's disease & apolipoprotein E
• ApoE contains two SNPs that result in three possible alleles for this
gene: E2, E3, and E4.
• Each allele differs by one DNA base, and the protein product of
each gene differs by one amino acid.
• Each individual inherits one maternal copy of ApoE and one
paternal copy of ApoE.
• Research has shown that a person who inherits at least one E4
allele will have a greater chance of developing Alzheimer's disease.
HapMap
• The HapMap Project is a multi-country effort to identify and
catalog genetic similarities and differences in human beings.
• Using HapMap, researchers will be able to find genes that affect
health, disease, and individual responses to medications and
environmental factors.
• HapMap is a collaboration among scientists and funding agencies
from Japan, the United Kingdom, Canada, China, Nigeria, and the
United States
• All of the information generated will be
released into the public domain.
• www.hapmap.org
Alternative splicing
Alternative splicing (2)
~
~ 15
15 %
% of
of the
the mutations
mutations that
that cause
cause genetic
genetic diseases
diseases affect
affect pre-mRNA
pre-mRNA splicing
splicing
Genome projects, a bit of history
http://www.genomesonline.org/
Sequenced genomes
•
•
•
•
•
•
•
•
•
•
•
•
•
1995
1996
1998
1999
2000
2001
2002
2002
2004
2006
2007
2008
2009
Haemophilus influenzae
Yeast
C. elegans
Fruit fly
Arabidopsis
Human (draft)
Mouse
Rice
Human (“finished”)
Sea urchin
Grapevine
Platypus (draft)
Cow
1.8 Mb
12 Mb
100 Mb
125 Mb
115 Mb
2.6 Gb
3 Gb
Some genome sizes
Organism
Genome size (base pairs)
Virus, Phage Φ-X174;
Virus, Phage λ
Bacterium, Escherichia coli
Plant, Fritillary assyrica
Fungus,Saccharomyces cerevisiae
Nematode, Caenorhabditis elegans
Insect, Drosophila melanogaster
Mammal, Homo sapiens
5387
First sequenced genome
5×104
4×106
13×1010 Largest known genome
2×107
8×107
2×108
3×109
Genome browsers can be used to examine ….
–Genomic sequence conservation
–Duplications en deletions of pieces chromosome (Copy
Number Variations, CNVs)
–Single Nucleotide Polymorphisms (SNPs)
–Alternative splicing
–And much more….
LET’S GO BROWSE GENOMES!
Alternative Transcripts
Source: Wikipedia
(http://www.wikipedia.org/)