The CMBI: Bioinformatics

Download Report

Transcript The CMBI: Bioinformatics

Introduction to genomes & genome browsers
Content



Introduction
The human genome
Human genetic variation




SNPs
CNVs
Alternative splicing
Browsing the human genome
Celia van Gelder
CMBI
UMC Radboud
December 2011
[email protected]
Exponential Growth in Genomic Sequence Data
# of genomes
First 2
bacterial
genomes
complete
Currently
1000+ completed
genomes
First eukaryote
complete
(yeast)
First metazoan
complete
(flatworm)
Exponential Growth in Genomic Sequence Data
© Pevzner 2011
The cow genome
Houston Chronicle
Houston scientists milk cow genome for its secrets
Weekly Times Now
Bovine genome to revolutionise food production
National Geographic
Cow Genome Decoded -- Cheaper Beef for Everybody?
BBC News
Cow genome 'to transform farming
The human genome
• Genome: the entire sequence of DNA in a cell
• 3 billion basepairs (3Gb)
• 22 chromosome pairs + X en Y chromosomes
• Chromosome length varies from ~50Mb
to ~250Mb
• About 22000 protein-coding genes
• Human genome is 99.9% identical among
individuals
Eukaryotic Genomes: more than collections of genes
• Genes & regulatory sequences make up 5% of the genome
Protein coding genes
RNA genes (rRNA, snRNA, snoRNA, miRNA, tRNA)
Structural DNA (centromeres, telomeres)
Regulation-related sequences (promoters, enhancers, silencers,
insulators)
– Parasite sequences (transposons)
– Pseudogenes (non-functional gene-like sequences)
– Simple sequence repeats
–
–
–
–
The human genome cntnd
• Only 1.2% codes for proteins
• Long introns, short exons
• Large spaces between genes
• More than half consists of repetitive DNA
Alu repeat
~300 bp
> million copies
From: Molecular Biology of the Cell
(4th edition) (Alberts et al., 2002)
Variation along genome sequence
• Nucleotide usage varies along
chromosomes
– Protein coding regions tend to have
high GC levels
• Genes are not equally distributed
across the chromosomes
– Housekeeping generally in genedense areas
– Gene-poor areas tend to have many
tissue specific genes
From: Ensembl
Chromosome organisation (1)
Genes that are ON
Genes that are OFF
Chromosome organisation (2)
• DNA packed in chromatin
Genes
that are
OFF
• Non-active genes often in
densely packed chromatin
(30-nm fiber)
Genes
that are
ON
• Active genes in less dense
chromatin (beads-on-a-string)
• Gene regulation by changing
chromatin density,
methylation/acetylation of
the histones
From: Lodish (4th edition)
Introduction to genomes & genome browsers
Content



Introduction
The human genome
Human genetic variation




CNVs
SNPs
Alternative splicing
Browsing the human genome
Human Genetic Variation
•
Every human has essentially the same set of genes, but there are different
forms of each gene -- known as alleles
•
Genetic variation explains some of the differences among people, such as:
– Blood group
– Eye color
– Skin color
– Hair color
– Higher or lower risk for getting particular diseases
•
•
•
•
•
Cystic fibrosis, Sickle cell disease,
Diabetes, Cancer, Arthritis, Asthma
Stroke, Heart disease
Alzheimer's disease, Parkinson's disease
Depression, Alcoholism
Variations in the Genome
Common Sequence
Variations
Polymorphism
Deletions
Insertions
Chromosome
Translocations
Today’s focus
1. Single Nucleotide Polymorphisms (SNPs)
2. Copy number variations (CNV)
3. Alternative transcripts
Single Nucleotide Polymorphisms (SNPs)
• SNPs are DNA sequence variations that occur when a single
nucleotide (A,T,C,or G) in the genome sequence is altered.
• For a variation to be considered a SNP, it must occur in at least 1%
of the population.
• SNPs, which make up about 90% of all
human genetic variation, occur every
100 to 300 bases along
the 3-billion-base human genome.
SNPs & medicine
• Although more than 99% of human DNA sequences are the same,
variations in DNA sequence can have a major impact on how
humans respond to:
– disease;
– environmental factors such as bacteria, viruses, toxins, and chemicals;
– and drugs (& side-effects).
• This makes SNPs valuable
for biomedical research and
for developing pharmaceutical
products or medical diagnostics.
SNP & disease, Alzheimer
Alzheimer's disease (AD) & apolipoprotein E
•
•
•
The APOE gene encodes the protein apolipoprotein E, a cholesterol carrier that is
found in the brain and other organs.
Its exact role in the development of AD is unclear.
Several studies have indicated a role in amyloid beta aggregation and clearance,
influencing the onset of amyloid beta deposition.
SNP & disease, Alzheimer (2)
Two SNPs - three APOE variants
•
APOE contains 2 SNPs that result in 3 possible alleles: E2, E3, E4.
•
Variant
E2
E3
E4
•
rs429358
T
T
C
+
+
+
rs7412
T
C
C
A person who inherits at least one E4 allele will have a greater chance of
developing AD.
Today’s focus
1. Single Nucleotide Polymorphisms (SNPs)
2. Copy number variations (CNV)
3. Alternative transcripts
Copy Number Variation
• People do not only vary at the nucleotide level (SNPs)
• Copy Number Variations (CNVs):
gains and losses of large chunks of DNA sequence consisting of
between ten thousand and five million letters
• When there are genes in the CNV areas, this can lead to variations
in the number of gene copies between individuals
• CNVs contribute to our uniqueness. CNVs can also influence the
susceptibility to disease.
• CNVs may either be inherited or caused by de novo mutation
Copy Number Variation
Normal cell
CN=2
deletion
CN=0
amplification
CN=1
CN=3
CN=4
CNVs and their possible effects on gene expression.
Cabianca D S , Gabellini D J Cell Biol 2010;191:1049-1060
© 2010 Cabianca and Gabellini
CNVs & disease
• Many inherited genetic diseases result from CNVs;
–
–
–
–
–
Gene copy number can be elevated in cancer cells
Autism
Schizophrenia (dept. human genetics)
Mental retardation (dept. human genetics)
Parkinsons disease
• There are CNVs that protect against HIV infection and malaria.
• The contribution of CNV to the common, complex diseases, such as
diabetes and heart disease, is currently less well understood
Today’s focus
1. Copy number variations (CNV)
2. Single Nucleotide Polymorphisms (SNPs)
3. Alternative transcripts
Alternative splicing
Alternative splicing
• Defects of the machinery of alternative splicing have been
implicated in many diseases, including:
–
neuropathological conditions such as Alzheimer disease
–
cystic fibrosis, those involving growth and developmental defects
–
many human cancers, e.g. BRCA1 in breast cancer
– Beta-globin in Beta-thalassemia
Introduction to genomes & genome browsers
Content



Introduction
The human genome
Human genetic variation




CNVs
SNPs
Alternative splicing
Browsing the human genome
Annotating the genome
• A genome sequence is of limited use without functional
annotation.
• Genome annotation is the process of attaching biological
information to sequences. It consists of two main steps:
• identifying elements on the genome
• attaching biological information to these elements.
• Annotating the genome – Bioinformatics!
• The genome browser is a tool for visualizing genome annotation.
why present the whole genome?
• Browsers provide context to understand genomic regions
of interest
• See features in and around a specific gene
• Explore larger chromosome regions
• Search & retrieve information on a gene- and genomescale
• Investigate genome organization
• Compare genomes
Basic Genome Annotation
• Genomic location
• Gene features
•
•
•
Exons
Introns
UTRs
• Transcript(s)
•
•
Pseudogenes
Non-coding RNA
• Protein(s)
• Links to other sources of information
Advanced Genome Annotation
•
•
•
•
•
•
•
•
Cytogenetic bands
Polymorphic markers
Genetic variation
Repetitive sequences
Expressed Sequence Tags (ESTs)
cDNAs or mRNAs from related species
Regions of sequence homology
Genomic sequence variation
Possible research questions
P. Schattner,
Genomics 93 (2009):187-195
[Human] Genome Browsers
Not limited to
only human data
EBI
Ensembl
NCBI
Map Viewer
UCSC Genome Browser
Other Ensembl Installations
http://www.ensemblgenomes.org/
Organized Data Based on Chromosome Location
Gene X
tracks
genes & predictions
variations &
repeats
cross-species
comparative data
& many more types of data from expression
& regulation to mRNA and ESTs…
Description
Transcript data
Structure
Gene Ontology
Pathway Data
Homologous
Genes
Expression Data
Etc….
Ensembl Genes – biological basis
• All Ensembl transcripts are based on proteins and mRNAs in:
– UniProt/Swiss-Prot (manually curated)
– UniProt/TrEMBL
– NCBI RefSeq (manually curated)
36
↔
Ensembl Homepage
37
HGNC
• HGNC – a unique name and symbol for every gene in
human
http://www.genenames.org/
Names in Ensembl
•
•
•
•
•
ENSG### Ensembl Gene ID
ENST### Ensembl Transcript ID
ENSP### Ensembl Peptide ID
ENSE### Ensembl Exon ID
For other species than human a suffix is added:
– MUS (Mus musculus) for mouse: ENSMUSG###
– DAR (Danio rerio) for zebrafish: ENSDARG###, etc.
Tabs in Ensembl
• Location Tab
• Transcript Tab
• Gene summary Tab
42
tracks
tracks
Ensembl: An Example
Click for
more
details
Gene Structure in Ensembl
Synopsis- What can I do with Ensembl?
•
View, examine & explore annotated information for any chromosomal
region:
– Genes,
– ESTs, mRNAs, alternative transcripts
– Proteins
– SNPs, and SNPs across strains (rat, mouse), populations (human), or
even breeds (dog)
– homologues and phylogenetic trees across more than 40 species
– whole genome alignments
– conserved regions across species
– gene expression profiles
•
Upload your own data and use BLAST/BLATagainst any Ensembl genome
•
Export sequence, or create a table of gene information
• Extra slides follow now
53/37
©CMBI 2009
Chromosome organisation (1)
From: Lodish (4th edition)
titel
Alternative Transcripts
Ensembl: Many Additional Tools
best scoring
match
BLAST/BLAT
BioMart
data retrieval
and download
Copyright OpenHelix. No
use or reproduction without