Analysis of the Positively Selected and Non-Positively
Download
Report
Transcript Analysis of the Positively Selected and Non-Positively
Analysis of the Positively Selected and
Non-Positively Selected Non-Protein
Coding Sequences of Chromosome 16
Kyle Tretina
with a team led by Dr. Pattle P. Pun
in collaboration with Mr. Ross Leung of CUHK
Introduction: Story of Evolutionary History
•Story: increasing
organismal complexity as
evolution proceeds
Bacteria < Fish < Primate < Human
WHY?
“But little Mouse,
you are not alone,
In proving foresight
may be in vain: The
best laid schemes of
mice and men Go
often askew, And
leave us nothing but
grief and pain, For
promised joy!” –
Robert Burns
(1785)
Genetics
Central Dogma: DNA RNA Protein
Complexity ~ Number of Genes?
Humans ~30,000
Flies ~ 14,000
G-Value Paradox
Complexity (K) ~ Gene Number (N)?
Relationship?
proportional: K ~ N
polynomial: K ~ Na
exponential: K ~ aN
factorial: K ~ N!
Jean-Michel Claveries: ON/OFF states
230,000 / 214,000 ≈ 3x104816
Goal
Determine the role of non-coding DNA in gene regulation
by looking at the functions of non-coding SNPs that are
positively selected or non-positively selected on chromosome
16
Definitions
SNP: single nucleotide polymorphism
Variable between populations
Importance likely due to stability of variation
Selection: description of phenomena that only organisms best
adapted to their environment tend to survive and create
progeny
Gene-selection algorithm and neutral selection theory (wrench)
Methods Overview
HapMap Database Selection Data List of Chr16 SNPs
UCSC Genome Database Mirror SNP flanking sequence
TRANSFAC related transcription factor data for each
SNP flanking sequence
PReMod confirm results
HapMap Phase I Data
HapMap Project: an international effort to identify and
catalog genetic similarities and differences in human beings
(Haplotype Maps), also includes:
Selection Data List of Chr16 SNPs
~25,000 non-positively selected
~5,000 positively selected
UCSC Genome Browser
Genome.UCSC.edu: a website containing several reference
sequences and tools for visual and computational analysis
Methods:
Enter in each from list of RSID’s (SNP Identifiers)
Note intersecting sequences
Copy/Paste Sequences
UCSC Genome Browser Mirror
Efficiency
~70seq/hr for 1.5yrs = ~1/3 sequences gathered
2hrs
Online Instructions, but Complicated Data Structure
Henry Ford: 1.1 million lines source code
Many thanks to the Dr. Hayward (Wheaton College CS Faculty)
Sequences Collected
Graph 1. The distributions of the positively selected SNPs
used in the study across human chromosome 16
Graph 2. The distributions of the non-positively selected
SNPs used in the study across human chromosome 16
TRANSFAC
TRANSFAC: a relational database, available via the web as
six flat files including various data concerning transcription
factors, DNA-binding sites, and target genes
Automation at CUHK
PReMod
PReMod: a new database of genome-wide cis-regulatory
module (CRM) predictions for both the human and the
mouse genomes.
Enter ranges for SNP sequences
Look for same pattern as TRANSFAC
Analysis
MySQL Tables
Programmed Scripts:
Word Patterns: i.e. keywords, recurring identifiers
Unique Entries
Progress Statistics
Overlap between N+ selected and + selected SNPs
Results
SNP
Selection
Non-Positive
Positive
RS
Numbers
Sequence Gathered
25,622
6173 (24%)
4750
4750 (100%)
Table 1. A summary of the manual SNP flanking sequence gathering from the UCSC Genome Browser
Results
SNP
Selection
Total
No Sites
Unique
TRANSFAC
Matches in
Entries to Be
Other Dataset
Looked Up
NonPositive
25,594
1,611 (6%) 3,218 (13%)
20,765 (81%)
Positive
33,770
2,437 (7%)
30,972 (92%) 10,641(32%)
361 (1.0%)
82 (<1%)
Conclusions
Data not all in yet
Possible implications:
Central Dogma Biology: information flow
Quantification Genetic Natural Selection
Views of Complexity of Humans
Lesson Learned: value of bioinformatics
High volume data requires computational analysis, not manual
Acknowledgements
Many thanks to Dr. Pun, for letting me get involved in this
project, for his vision and mentorship.
Special thanks to Dr. Hayward, for putting in extra hours unpaid
so that a student can follow his dreams of graduate school.
Thanks to our collaborators at the Chinese University of Honk
Kong – Dr. Tsui and Mr. Leung – for accessing the TRANSFAC
database for us, and for being flexible to the demands of our
project.
The most thanks to God, for blessing me with the opportunity to
work hard and learn. I pray that I might always be able to do these
two things earnestly and voraciously.