Detection of Transcription Factor Binding Sites
Download
Report
Transcript Detection of Transcription Factor Binding Sites
Detection of Transcription Factor
Binding Sites
MICHAEL MORRA
CSE 4939W
Background
DNA is comprised of a
combination of 4
chemical bases
Adenine – A
Thymine – T
Guanine – G
Cytosine - C
Image from : http://www.genetest.org/page5.html
Background (Continued)
Each individual
organism has a unique
DNA sequence
The DNA sequence
contains information
which can be used by a
cell to construct proteins
Each set of instructions
within this sequence is
called a gene
Image from: http://www.buzzle.com/articles/point-mutations.html
Transcription Factors
To regulate the
expression of genes,
proteins known as
transcription factors are
used
Each transcription factor
binds to the DNA
sequence, turning a gene
on or off
Image from: http://www.cs.uiuc.edu/homes/sinhas/work.html
Binding Sites
The portions of the DNA
where the transcription
factors are able to bind
are known as binding
sites
A single transcription
factor’s binding sites may
vary
Introduction
The detection of binding sites is important to
understanding the regulatory network of an
organism
As binding sites can vary considerably, searching for
them within a DNA sequence is tedious
Project
Implement a method used to accurately and
precisely discover the locations of transcription
factor binding sites within a DNA sequence.
Data
4 species (Human, Mouse, Fruit Fly & Yeast)
Human
26 Transcription Factors, 300 binding sites
Mouse
12 Transcription Factors, 98 binding sites
Fruit Fly
6 Transcription factors, 51 binding sites
Yeast
8 Transcription Factors, 75 binding sites
Multiple Sequence Alignment
To be able to analyze the data effectively, each
transcription factor’s binding sites need to be aligned
http://www.ebi.ac.uk/Tools/clustalw2/index.html
>s1
GACTTTTCGCT
>s2
CGATTTTCTCG
>s3
GCATTTTCCCA
>s4
AGAGAAAACCC
>s5
GAATAACCCAAGAGAAA
>s6
ACAGAAAAATC
>s7
CGAGAAAATCG
>s8
TGGTTTTCCCG
>s9
GGGTTTCTCCC
Scoring
Berg and von Hippel method
l = length of the sequence to be scored
j = position in the sequence
nj = number of times a base occurs at position j in the alignment
tj = base at position j in the sequence to be scored
nj(0) = most common base at position j
Scoring Example
ACTCA
n1(0) = 3
n2(0) = 2
n3(0) = 2
n4(0) = 2
n5(0) = 2
n1(A) = 3
n2(C) = 1
n3(T) = 2
n4(C) = 1
n5(A) = 2
Score = log(1) + log(1.5/2.5) + log(1) +
log(1.5/2.5) + log(1) = -0.443697499
Leave One Out Cross Validation
To determine the effectiveness of the algorithm, a
cross validation technique is used
This technique involves leaving one binding site out
when the multiple sequence alignment is performed,
and then scoring that left out sequence
If the algorithm is effective, the left out sequence
should score higher than the majority of other
binding sites within that species. (>80-90%)
Implementation
C++
Input
Multiple Sequence Alignment of a transcription factor’s binding
sites
All binding sites of a species
Output
Scores
Results of Leave One Out Cross Validation
Desired Functionality
Deal with cases where the sequence to be scored is
longer or shorter than the multiple sequence
alignment
Slide the sequence over the alignment and take the highest
scoring portion
Timeline
Oct 4th – Oct 18th
Create multiple sequence alignments for all transcription
factors
Oct 18th – Nov 15th
Implement scoring algorithm in C++
Nov 15th – Nov 29th
Implement leave one out methods
Nov 29th – Dec 6th
Tweaks and Improvements
Questions?
Image from: http://www.ideacenter.org/contentmgr/showdetails.php/id/954