Classification and Analysis of CpG islands on Gene
Download
Report
Transcript Classification and Analysis of CpG islands on Gene
Sequence analysis of CpG islands reveals possible
functional correlation between genes and its CpG
island sequence
Henry Hyun-il Paik
Bioinformatics, School of Informatics
Indiana University
Outline
• What CpG islands are
• The Known Relations between CpG islands and
Genes
• Motivation and Goal
• Data set
• Procedures
• Results
• Discussion
What CpG islands are?
• CpG dinucleotides are rare in mammal DNA
• DNA Methylation only occurs at CpG sites
• Methylated cytosines may be converted to thymine by
deamination over evolution
– CpG TpG
• CpG islands are short stretches of DNA with higher
frequency of the CG sequence
• Usually they are not methylated
What CpG islands are?
• Definition from Gardiner-Garden & Frommer
– At least 200 bases long
– G+C content: > 50%
– observed CpG/expected CpG ratio: >= 0.6
• Definition from Takai & Jones
–
–
–
–
Longer than 500 bp
G+C content: > 55%
observed CpG/expected CpG ratio: >= 0.65
With this definition, these CpGi’s are more likely to be associated
with the 5’ regions of genes and exclude most Alu’s
• There are about 29,000 such regions in the human
genome
What CpG islands are?
CpG islands & Genes
• CpG islands located in the promoter regions of genes
can play important roles in gene silencing
• Housekeeping genes
– Almost all housekeeping genes are associated with at least one
CpG island
– CpG islands are starting 5’ to the transcription start site and
covering one or more exons and introns
• Tissue specific genes
– About 40 % tissue specific genes are associated with islands
– The position of these islands is not strongly toward the
transcription start site as in the housekeeping genes
CpG islands & Genes
• Not all CpG islands are associated with genes
– Ioshikhes & Zhang determined the features to discriminate the
promoter-associated and non-associated CpG islands
• There are methylation-prone and methylation-resistant
CpG islands
– Feltus et. al. found patterns to discriminate methylation-prone
from methylation-resistant CpG islands
CpG islands & Genes
5’ end
CpGi
Gene
Promoter CpG islands
Gene
Gene
Gene
CpG islands in body
3’ end CpG islands
Motivation and Objective
• Our project was inspired by these ideas
• Mechanical definition follows the definition as it is
– At least 200 bases long
– G+C content: > 50%
– observed CpG/expected CpG ratio: >= 0.6
• We tried to find “Semantic meaning” of CpG islands : Corelation between CpG islands & Gene Functions
• Are there any significant CpGi patterns related to the
gene functions?
Motivation and Objective
CpGi 1
CpGi 2
Gene 1
Gene 2
We assume that gene1 and gene2 have similar function
1) Then gene 1 sequence and gene 2 sequence are probably similar.
2) Our Goal is to find CpGi patterns when genes have similar function
Data Set
•
•
•
•
Reference:
Larsen F., Gundersen, G., Lopez L., Prydz H.
CpG island as Gene Markers in the Human Genome
Genomics 13:1095-1107 (1992)
•
•
•
•
Total number of entries: 1711
Entries with no islands: 1212
Entries with islands: 499
Total number of islands: 928
•
The Length of CpG islands
–
–
–
Average size of islands: 465 bp
Shortest detectable island: 200 bp
Largest island: 3340 bp
Expression of gene
Number
Number associated with islands
Widespread
217
216 (99%)
Limited
719
261 (36%)
a Snap Shot of Data set
Procedures
Fasta all-to-all Comparison
Clustering
Clustering By BAG
MEME
MAST
BLAST
Motif (Pattern)
Discovery & Search
for each cluster
Database search
with CpG islands patterns
Clustering
• We use a clustering program, BAG by Sun Kim
• We compare each CpG island to all CpG islands using
fasta for the input of BAG
• BAG makes clusters based on sequence similarity
Motif Discovery & Search
• MEME discovers patterns for each cluster
• To see the significance of a pattern, MAST searches all
CpG islands with the pattern
• We can see how significant the pattern is or how often
the pattern occur according to E value
• Profiles are made to represent each cluster
Motif Discovery & Search
BLAST
• The entire GenBank was searched with CpG island
profile, not with Gene
• We see how efficiently the profile can find the genes that
have similar function
• This verifies the validity of the profile
Results
• There are 26 clusters in which members have similar
gene function among total 115 clusters
• These 26 clusters are divided into two categories
depending on CpGi location
– 18 clusters have CpGi’s in coding region
– 8 clusters have CpGi’s in promoter region
Results
• One example from CpGi in body
• Cluster # 18 : Human heat-shock protein HSP70B' gene
– Meme
– Mast
– profile sequence
ATCATCGCCAACGACCAGGGCAACCGCACCACCCCCAGCT
ACGTGGCCTT
– Blast
Results
• One example from promoter CpGi
• Cluster # 25 : Human gene for creatine kinase B
– Meme
– Mast
– Profile sequence
GAGGAGTCCTACGAAGTGTTCAAGGATCTCTTCGACCCCAT
CATTGAGGA
– Blast
Gene & CpG islands
in promoter region
cluster
Description
Acc No.
7
Human MAGE-4a antigen (MAGE4a)
gene
U10687.1
U10687.5
14
Aldose Reductase gene
M59856.1
L14440.1
25
Human creatine kinase
M60806.1
X15334.1
72_73
Human metallothionein gene
M10942.1(arti)
J03910.1
79_80
Human gene for neurofilament subunit
X05608.1(arti)
X15306.1 Y00067.2
85
Phenylethanolamine N-methyltransferase
gene
J03280.1
92
Human U1 small nuclear RNA
pseudogene
M14387.1
96
Human trichohyalin (TRHY) gene
L09190.1 L09190.3
U10687.3
U10687.4
M13003.1 K01383.1
X52730.1
M28010.1
U10687.2
M28011.1
Gene & CpG islands in CDS
cluster
Description
Acc No.
9
alpha 2 adrenergic receptor gene
D13538.1
M23533.2
M34041.1
M67439.1(arti)
M83181.1
M28269.1
X13556.1
10
actin gene
M19283.2
13
alkaline phosphatase gene
J03252.1
18
Human heat shock protein
M19645.1 ARTI
32
Neurophysin gene
X62890.1
M11166.1
41
Human v-erbA related ear-2 gene
X12794.1
X12795.1
52
histone H1 (H1F4) gene
X57130.1
M60748.1
X57129.1
53
histone H3 gene
X57128.1
M60746.1
M26150.1
54
Human histone H4 (H4) gene
X60482.1
X60483.1
X60484.1
X00091.1
X00038.1
M16707.1
M60749.1
X60487.1
X67081.1
X60486.1
56
serotonin receptor gene
K02405.1
K02773.1 ARTI
58
Human histone H2b gene
M60751.1
X57985.1
59
Human histone H2a gene
M60752.1
X00089.1
64
Human heat shock protein
X03901.1
L39370.1
69
proto oncogene (JUN)
J04111.1
M29039.1
87
Human beta-tubulin pseudogene
X00734.5
J00315.1
90
H.sapiens gene for 28S rRNA V8 region
X69341.1
X69358.1
91
Human POU daomain factor (Brn-3a) gene
U10063.1
U10061.1
M20543.2
J03930.1
M31008.2
M59830.1
M11717.1
X51757.1
M11186.1
K01499.1
X02228.1
X00088.1
X69357.1
M11167.1
M77285.1
Discussion
• The blast result implies that both CpG islands in
promoter region and in CDS are good markers for gene
sequences
• Even though there are small numbers of promoter CpG
islands, they represented their clusters significantly
• Since many CpG islands tend to cover exons, they can
be used to identify transcripts
• Need more data to support this result and to make
generic patterns
Acknowledgement
•
•
•
•
Dr. Sun Kim
Dr. Paul Ma
Arvind
Bioperl community
Comments & Questions