Transcript Slide 1

Introduction
to Bioinformatics
and
University
of Brawijaya
Genomics2013
2013
4th December
Austen Ganley
INMS
Understanding the
Human Genome:
Lessons from the
ENCODE project
1
Glossary
•
•
•
•
•
•
•
•
•
Genome
Genes
DNA/RNA
Protein
Cell
Transcription
Chromatin
Histones
Nucleosomes
• Non-coding
RNA
• Sequencing
• Microarray
• Transcription
start site
• Active/open
• Inactive/repressi
on
transcriptional
terminator
transcriptional
start site
intron
promoter
exon
Introduction
• Individual scientists worked together
• Aim was to understand 1% of the human
genome (2007), and 100% (2012)
• Looked at:
• Transcription
• Chromatin/transcription factors
• Replication
• Evolution
Genes
• Now estimated to be about 21,000
protein-coding genes (taking about
3% of the whole genome)
• In addition, there are about 9,000
microRNAs, and about 10,000 long
non-coding RNAs
Transcription
• Transcription was measured by two
different methods:
• Whole genome microarrays
• RNA-sequencing
Detecting transcription
using tiled microarrays
Transcription
• Transcription was measured by two
different methods:
• Whole genome microarrays
• RNA-sequencing
• They found at least 62% of the
whole genome is transcribed
(remember, genes only account for
about 3% of the whole genome)
Transcriptional start sites
• Goal is to identify the transcription start
sites
• Not easy to do!
• Use a technique called CAGE (Cap
Analysis Gene Expression)
CAGE
• Makes use of the 5’ CAP on mRNA
• First, mRNA is reverse-transcribed, to
form cDNA (RNA-DNA hybrid)
• Then, biotin is attached to the 5’ CAP,
and the cDNA is fragmented
• The biotin fragments are isolated
(representing the 5’ end of mRNA), and
these fragments are sequenced
• About 60,000
transcription start
sites found
• Only half of these
match known
genes
• What do the other
ones do? May
explain high level
of transcription
• The transcription
start sites are often
far upstream of
the gene start, and
can overlap genes
Overlapping Genes
Transcriptional start sites from the DONSON gene
• An overlapping gene, starting far upstream
• The DONSON gene is a known gene
• However, some transcripts start in the ATP50
gene, and include some ATP50 exons
• Two genes are skipped out
Chromatin: histones and nucleosomes
• Nucleosomes are formed
from DNA that is packaged
around histones
• Histones are a set of
proteins that usually
associate as an octamer
www.mun.ca/biochem/courses/3107/Topic
s/supercoiling.html
www.palaeos.com/Eukarya/Eukarya.Origins.5.html
Dnase I hypersensitive sites (DHS)
Hebbes Lab, University of Portsmouth, UK
Gilbert, Developmental Biology, Sinauer
• DNase I preferentially
digests nucleosomedepleted regions (DNase
I hypersensitive sites)
• These are associated
with gene transcription
• Chromatin is digested
with DNase I: only digests
nucleosome-free regions
• The remaining DNA is
isolated, and put on a
microarray or sequenced
• Find the open, active
regions of the genome
DNase I hypersensitive sites
• In total, about 3 million DNase I
hypersensitive sites in the genome,
covering about 15% (versus about 40,000
genes covering about 4%)
• Transcriptional start sites are regions of
DNase I hypersensitivity, as expected
• Most DNase I hypersensitive sites are not
associated with transcriptional start site,
though
Genome
Transcription
start sites
Genes
Transcribed
region
DNase I
hypersensitiv
e region
Histone
Modification
Effects
• Modifications occur
on the histone tails
• They alter the
strength of DNAhistone binding, and
influence the binding
of other proteins to
the DNA
• Thus they can
activate or silence
gene expression
The “Histone Code”
• The combination of histone modifications determine a
gene’s transcriptional status – histone code
• Some modifications are associated with active gene
expression
–
–
–
–
H3K4me2
H3K4me3
H3ac
H4ac
• Some with repression
– H3K27me3
– H3K4me1
www.nature.com/nrm/index.html
ChIP (Chromatin
immunoprecipitation)
• Method to find where your protein of
interest is binding to
• You cross-link the sample, and fragment
the DNA into pieces
• Immunoprecipitate using an antibody to
your protein of interest
• Reverse the cross-links, and isolate the
DNA
• To find where in the genome the protein
was bound:
• Hybridise the DNA to a microarray (ChIPchip) OR sequence it (ChIP-seq)
www.rndsystems.com/product_detail_objectname_exactachip_
assayprinciple.aspx
Histone modification profiles
• They found that histone modifications
associated with active transcription
were found around transcription start sites
• They found that histone modifications
associated with gene repression were
depleted around transcription start sites
• This is as expected
• Around DNase I hypersensitive sites not
near transcription start sites, they found
almost the opposite pattern
Enrichment of active
histone marks and
depletion of inactive
histone marks at a
transcription start site
Enrichment of inactive
histone marks but little
enrichment of active
histone marks at a DNase
I hypersensitive site
Histone modification profiles
• They also found other patterns
• Combining all the results (plus results for
transcription factor binding), they say that
the human genome is divided into seven
different types of chromatin states
• Which state it is depends on what
combination of histone
modifications/transcription factor binding
there is
The seven chromatin states
The seven chromatin states
Promoter (red)
Enhancer Gene body
Inactive
(yellow)
(green)
region (grey)
Grand Summary
Transcription:
• a lot of non-coding transcription
(~60% of the genome
transcribed) – much more than
needed just to transcribe all the
genes
Transcription start sites:
• Twice as many transcription
start sites as traditional
“genes”
• transcripts span large
regions, even between genes
DNase I hypersensitive sites:
• more than just at transcription
start sites
• two types: those found both at
TSS, and those found at other
regions
• these have different chromatin
profiles
ENCODE
Overview:
• genome can be generalised into seven
different states
• the function of some of these states is
known – e.g. promoter
Chromatin states:
• the function of others is not known, but • The genome can be divided
may explain the high level of
into seven different types
transcription and open chromatin
• these are determined by the
structure
combination of histone
modifications and transcription
factor binding that occur
Histone modifications:
• active marks correlate with
TSS/DHS
• distal DHS have a different
histone modification profile