Analysis of the Positively Selected and Non-Positively

Download Report

Transcript Analysis of the Positively Selected and Non-Positively

Analysis of the Positively Selected and
Non-Positively Selected Non-Protein
Coding Sequences of Chromosome 16
Kyle Tretina
with a team led by Dr. Pattle P. Pun
in collaboration with Mr. Ross Leung of CUHK
Introduction: Story of Evolutionary History
•Story: increasing
organismal complexity as
evolution proceeds
Bacteria < Fish < Primate < Human
WHY?
 “But little Mouse,
you are not alone,
In proving foresight
may be in vain: The
best laid schemes of
mice and men Go
often askew, And
leave us nothing but
grief and pain, For
promised joy!” –
Robert Burns
(1785)
Genetics
 Central Dogma: DNA  RNA  Protein
 Complexity ~ Number of Genes?
 Humans ~30,000
 Flies ~ 14,000
G-Value Paradox
Complexity (K) ~ Gene Number (N)?
 Relationship?
 proportional: K ~ N
 polynomial: K ~ Na
 exponential: K ~ aN
 factorial: K ~ N!
 Jean-Michel Claveries: ON/OFF states
 230,000 / 214,000 ≈ 3x104816
Goal
 Determine the role of non-coding DNA in gene regulation
by looking at the functions of non-coding SNPs that are
positively selected or non-positively selected on chromosome
16
Definitions
 SNP: single nucleotide polymorphism
 Variable between populations
 Importance likely due to stability of variation
 Selection: description of phenomena that only organisms best
adapted to their environment tend to survive and create
progeny
 Gene-selection algorithm and neutral selection theory (wrench)
Methods Overview
 HapMap Database Selection Data  List of Chr16 SNPs
 UCSC Genome Database Mirror  SNP flanking sequence
 TRANSFAC  related transcription factor data for each
SNP flanking sequence
 PReMod  confirm results
HapMap Phase I Data
 HapMap Project: an international effort to identify and
catalog genetic similarities and differences in human beings
(Haplotype Maps), also includes:
 Selection Data  List of Chr16 SNPs
 ~25,000 non-positively selected
 ~5,000 positively selected
UCSC Genome Browser
 Genome.UCSC.edu: a website containing several reference
sequences and tools for visual and computational analysis
 Methods:
 Enter in each from list of RSID’s (SNP Identifiers)
 Note intersecting sequences
 Copy/Paste Sequences
UCSC Genome Browser Mirror
 Efficiency
 ~70seq/hr for 1.5yrs = ~1/3 sequences gathered
 2hrs
 Online Instructions, but Complicated Data Structure
 Henry Ford: 1.1 million lines source code
 Many thanks to the Dr. Hayward (Wheaton College CS Faculty)
Sequences Collected
 Graph 1. The distributions of the positively selected SNPs
used in the study across human chromosome 16
 Graph 2. The distributions of the non-positively selected
SNPs used in the study across human chromosome 16
TRANSFAC
 TRANSFAC: a relational database, available via the web as
six flat files including various data concerning transcription
factors, DNA-binding sites, and target genes
 Automation at CUHK
PReMod
 PReMod: a new database of genome-wide cis-regulatory
module (CRM) predictions for both the human and the
mouse genomes.
 Enter ranges for SNP sequences
 Look for same pattern as TRANSFAC
Analysis
 MySQL Tables
 Programmed Scripts:
 Word Patterns: i.e. keywords, recurring identifiers
 Unique Entries
 Progress Statistics
 Overlap between N+ selected and + selected SNPs
Results
SNP
Selection
Non-Positive
Positive
RS
Numbers
Sequence Gathered
25,622
6173 (24%)
4750
4750 (100%)
Table 1. A summary of the manual SNP flanking sequence gathering from the UCSC Genome Browser
Results
SNP
Selection
Total
No Sites
Unique
TRANSFAC
Matches in
Entries to Be
Other Dataset
Looked Up
NonPositive
25,594
1,611 (6%) 3,218 (13%)
20,765 (81%)
Positive
33,770
2,437 (7%)
30,972 (92%) 10,641(32%)
361 (1.0%)
82 (<1%)
Conclusions
 Data not all in yet
 Possible implications:
 Central Dogma Biology: information flow
 Quantification Genetic Natural Selection
 Views of Complexity of Humans
 Lesson Learned: value of bioinformatics
 High volume data requires computational analysis, not manual
Acknowledgements
 Many thanks to Dr. Pun, for letting me get involved in this
project, for his vision and mentorship.
 Special thanks to Dr. Hayward, for putting in extra hours unpaid
so that a student can follow his dreams of graduate school.
 Thanks to our collaborators at the Chinese University of Honk
Kong – Dr. Tsui and Mr. Leung – for accessing the TRANSFAC
database for us, and for being flexible to the demands of our
project.
 The most thanks to God, for blessing me with the opportunity to
work hard and learn. I pray that I might always be able to do these
two things earnestly and voraciously.