Detection of Transcription Factor Binding Sites

Download Report

Transcript Detection of Transcription Factor Binding Sites

Detection of Transcription Factor
Binding Sites
MICHAEL MORRA
CSE 4939W
Background
 DNA is comprised of a




combination of 4
chemical bases
Adenine – A
Thymine – T
Guanine – G
Cytosine - C
Image from : http://www.genetest.org/page5.html
Background (Continued)
 Each individual
organism has a unique
DNA sequence
 The DNA sequence
contains information
which can be used by a
cell to construct proteins
 Each set of instructions
within this sequence is
called a gene
Image from: http://www.buzzle.com/articles/point-mutations.html
Transcription Factors
 To regulate the
expression of genes,
proteins known as
transcription factors are
used
 Each transcription factor
binds to the DNA
sequence, turning a gene
on or off
Image from: http://www.cs.uiuc.edu/homes/sinhas/work.html
Binding Sites
 The portions of the DNA
where the transcription
factors are able to bind
are known as binding
sites
 A single transcription
factor’s binding sites may
vary
Introduction
 The detection of binding sites is important to
understanding the regulatory network of an
organism
 As binding sites can vary considerably, searching for
them within a DNA sequence is tedious
Project
 Implement a method used to accurately and
precisely discover the locations of transcription
factor binding sites within a DNA sequence.
Data
 4 species (Human, Mouse, Fruit Fly & Yeast)
 Human
 26 Transcription Factors, 300 binding sites
 Mouse
 12 Transcription Factors, 98 binding sites
 Fruit Fly
 6 Transcription factors, 51 binding sites
 Yeast
 8 Transcription Factors, 75 binding sites
Multiple Sequence Alignment
 To be able to analyze the data effectively, each
transcription factor’s binding sites need to be aligned
 http://www.ebi.ac.uk/Tools/clustalw2/index.html
>s1
GACTTTTCGCT
>s2
CGATTTTCTCG
>s3
GCATTTTCCCA
>s4
AGAGAAAACCC
>s5
GAATAACCCAAGAGAAA
>s6
ACAGAAAAATC
>s7
CGAGAAAATCG
>s8
TGGTTTTCCCG
>s9
GGGTTTCTCCC
Scoring
 Berg and von Hippel method
 l = length of the sequence to be scored
 j = position in the sequence
 nj = number of times a base occurs at position j in the alignment
 tj = base at position j in the sequence to be scored
 nj(0) = most common base at position j
Scoring Example
 ACTCA











n1(0) = 3
n2(0) = 2
n3(0) = 2
n4(0) = 2
n5(0) = 2
n1(A) = 3
n2(C) = 1
n3(T) = 2
n4(C) = 1
n5(A) = 2
Score = log(1) + log(1.5/2.5) + log(1) +
log(1.5/2.5) + log(1) = -0.443697499
Leave One Out Cross Validation
 To determine the effectiveness of the algorithm, a
cross validation technique is used
 This technique involves leaving one binding site out
when the multiple sequence alignment is performed,
and then scoring that left out sequence
 If the algorithm is effective, the left out sequence
should score higher than the majority of other
binding sites within that species. (>80-90%)
Implementation
 C++
 Input
Multiple Sequence Alignment of a transcription factor’s binding
sites
 All binding sites of a species


Output
Scores
 Results of Leave One Out Cross Validation

Desired Functionality
 Deal with cases where the sequence to be scored is
longer or shorter than the multiple sequence
alignment

Slide the sequence over the alignment and take the highest
scoring portion
Timeline
 Oct 4th – Oct 18th
 Create multiple sequence alignments for all transcription
factors
 Oct 18th – Nov 15th
 Implement scoring algorithm in C++
 Nov 15th – Nov 29th
 Implement leave one out methods
 Nov 29th – Dec 6th
 Tweaks and Improvements
Questions?
Image from: http://www.ideacenter.org/contentmgr/showdetails.php/id/954