EXPLORING DEAD GENES

Download Report

Transcript EXPLORING DEAD GENES

EXPLORING DEAD GENES
Adrienne Manuel
I400
What are they?




Dead Genes are also called Pseudogenes
Pseudogenes are non functioning copies of
genes in DNA
Results from reverse transcription from an
mRNA transcript
Or from gene duplication and subsequent
disablement
Expression of Pseudogenes

Evidently transcribed

Expression of pseudogenes vary

Snail (lymnaea stagnalis) example of an
organism that still has functioning
Pseudogenes, Good and Bad!

- Raised expression for tumor cells

+ Useful in studying molecular evolution

+ Helpful in determining rates of genomic
DNA Loss for an organism
Size and Distribution of Pseudogenes
DEFINING POPULATIONS AND
SUBPOPULATIONS
 ‘G’ the total population of confirmed and
predicted protein-encoding genes

ΨG is the estimated population of
pseudogenes that correspond to G



The Set of genes with at
least one verifying EST
match was derived GE
A set of genes that were
deemed to be highly
expressed was derived from
microarray expression data
and denoted GM
The corresponding predicted
tool or pseudogenes is
denoted ΨGM
Data Files



Sanger Sequencing Centre ftp
(ftp://ftp.sanger.ac.uk)
in this website are the six complete
sequences of worm chromosomes
GFF Data Files with annotations for genes
and other genomic features that correspond
to wormpep18
Arranged were the pseudogene population in
the form of a pipeline
Pipelines
Step 1: Sanger centre pseudogene annotations


Start with list of 332 pseudogenes
Pseudogene population was derived by
looking for gene disablement
Step 2: FASTA matching to find potential
pseudogenes
PIPELINES (continued)



Worm genes masked for low complexity
region with the program SEG
TFASTX and TFASTY are next used to
compare the complete wormpep18 against
the worm genome
After comparison Pseudogene matches were
refined with the next step
Pipeline (continued)
Step 3: reduction for overlaps on the genomic DNA
 Significant matches of protein sequences to the DNA
were reduced for redundancy where homologs match
the same segment of DDNA
 Matches are then sorted
Step 4: Prevention of over counting for adjacent
matches.
 Initial matches may correspond to same pseudogene
 To avoid over counting matches were realigned
Pipeline
Step 5: Masking against Sanger Centre annotation
and Transposon library.
 Potential pseudogenes filtered for overlap with any
other annotations in the Sanger Centre GFF files
e.g. exons of genes, tandem or inverted repeats
Step 6: Reduction for possible additional repeat
elements
 At this point there is a set of 3814 pseudogenic
fragments
Pipeline (final step)
Step 7: reducing threshold stringency

e-value match threshold reduced from .01 to .001
Check the web!


http://bioinfo.mbb.yale.edu/genome/womr/pseudogene
To find pseudogene population, the data can be viewed
either by searching for protein name or viewing specific
range in the chromosome
Size of Pseudogene Popuation


Composed of 2168 sequence, that’s about 12% of
total gene complement
Factors that affect the size: 1. Dead copies of
transposable elements 2. Size of pseudogene
underestimated because pseudogenes with less
obvious disablement aren't included. 3.Annotated
genes might be pseudogenes because disablement
is undetectable 4. Pseudogenes still part of
functioning gene 5. Some pseudogenes arise due to
sequencing errors 6. Possible genomic repeats
SUBPOPULATIONS

Highly expressed genes have fewer dead
gene copies

The most reliable subset of the pseudogene
population is about half the total for ΨG.

39% of pseudogenes are intronic-these kinds
of pseudogenes aren't ailing families of
proteins
Chromosomal Distributions

More abundant near the ends of
chromosome (the “arms”)

For each chromosome, there is a calculated
proportion of dead genes


The data plot above
indicates genome to
genome over all age.
The percentage composition
for each of the 20 amino
acids is graphed in
decreasing order of the
implied amino acid
composition in the
pseudogene set. In the
bottom part of the figure, the
G difference for each amino
acid composition is indicated
by a bar.



Listed are the largest sequence families in
the worm ranked by genes and pseudogenes
They’re named for their particular
representative. Four of the 10 paralog genes
family when ranked by number are
functionally uncharacterized
Three of the pseudogenes top 10 are
amongst the biggest families when we rank
according to number of genes
Pseudofolds



These charts ranked in
terms of implied
structural pseudofolds
Proteins encoded by
the worm genome have
been assigned to
globular domain folds
From the SCOP
database
Why was this studied again?




To provide an initial estimate of the size distribution
and characterizations of the pseudogene comparing
C.elegans in attempt to estimate the total number in
humans.
Found few pseudogenes that are apparently due to
processing in the worm genome
Found large uncharacterized gene family that makes
up 2/3 of dead genes
Arms of chromosome are an unreliable for encoding
genes but more likely to spawn new proteins