GeneGUI Project Presentation

Download Report

Transcript GeneGUI Project Presentation

Motif Detection in Yeast
Vishakh
Joe Bertolami
Nick Urrea
Jeff Weiss
Overview
1.
2.
3.
4.
5.
6.
7.
8.
Problem Statement
Motivation
History
Our Approach
Evaluation
Results
Discussion
References
1. The Problem
Find regulatory sequences in the upstream
region of yeast DNA.
 Regulatory sequences are segments of
DNA where proteins can bind to enhance
transcription of a gene.

The Problem

We are given:

Upstream Genome- consists of:

Gene Families- consists of:
 Individual Genes- consists of:
 Strings like ATGC

We had to find substrings unusually
frequent in gene families given their
distribution in the whole upstream
genome.
The Problem
We emulated techniques devised by van
Helden.
 Worked on similar data set and tried to
emulate and even better his findings.

2. Motivation
Organisms like yeast share many genes
with humans.
 As a result, they share diseases too.
 Finding regulatory sequences in yeast
might lead to medical advances.
 Might lead to therapies for diseases such
as cystic fibrosis.

3. History
Previous century saw rapid advances in
genetics.
 Scientific community trying to get a better
understanding of various genomes.
 This particular technique was developed
by Jacques van Helden.

4 .Our approach
Extract all substrings of lengths 6-8 in the
upstream genome.
 Calculate frequency of occurrence of each
substring.
 Put this data in a table.

Our Approach
Consider a gene family.
 Find all substrings in it and frequencies
and build table.
 For each entry, add the probability of
occurrence.
 Use above data to calculate three scores.

Our Approach
Score 1: Expected Occurrence / Actual
Occurrence
 Use probability of occurrence and size of
gene family to calculate expected
occurrence.
 Divide by actual occurrence.
 Low score -> Unusually frequent
substring.

Our Approach
Score 2: Poisson Distribution
 Use expected and actual number of
occurrences.
 If substring occurs ‘n’ times, calculate
probability of ‘n’ occurrences using Poisson
Distribution.
 Lower probability -> Unusually frequent

Our Approach
Score 3: Binomial Theorem
 Use probability of occurrence, sizes of
genome and gene family and actual
occurrences.
 If substring occurs ‘n’ times, calculate
probability of ‘n’ occurrences using
Binomial Distribution.
 Lower probability -> Unusually frequent

Our Approach
Sort substrings by a score.
 Take top sequences, create a probability
matrix.
 Iterate probability matrix to get
probabilistic model of regulatory
sequence.

5. Evaluation Metrics
Van Helden’s results in ’98 paper and his
website.
 ’98 paper used old data, not very reliable
for evaluation.
 Website very useful since it works on
current data and dynamically calculates
results.
 Compared our output to his.

Evaluation Metrics

Also, compare three scores types to find
best method.
6. Results
Comparison of Results
for MET FAMILY
Van Helden’s site
Binomial Dist
Poisson Dist
Expected / Actual
Old Paper
CACGTG
1
1
3
4
1
ACGTGA
2
2
1
2
3
TCACGT
3
3
2
1
2
ATATAT
4
4
N/A
N/A
5
TATATA
5
5
N/A
N/A
10
AACTGT
6
7
4
28
4
ACAGTT
7
6
N/A
29
N/A
ACACAC
8
9
7
N/A
N/A
GTGTGT
9
8
6
N/A
N/A
Gene
Results

Probability matrices generated successfully!
7. Discussion
Paper results clearly outdated.
 Close co-relation with van Helden’s site.
 Binomial distribution best, followed by
Poisson and Expected/Actual

Discussion

Why don’t Binomial results perfectly
match van Helden’s site?





Van Helden paper only outlines general
method.
He uses many filters and adjustments.
Limited info about them on site.
We used similar, but not same, filters.
Example: Purge sequences that appear twice
in a row.
Discussion

Future work



Find more filters.
Try other similar organisms’ genomes.
Biologically verify results!
Discussion

What we learnt

Biology!




First-hand look at genetic data
Became more familiar with genes
Clearly understood what the fuss about genetics is
about
Computer Science


Teamwork
Interfacing CS with other scientific disciplines
References

van Helden, J., André, B. & Collado-Vides,
J. (1998). Extracting regulatory sites from
the upstream region of yeast genes by
computational analysis of oligonucleotide
frequencies. J Mol Biol 281(5), 827-42.

van Helden, J., Rios, A. F. & Collado-Vides,
J. (2000). Discovering regulatory elements
in non-coding sequences by analysis of
spaced dyads. Nucleic Acids Res.
28(8):1808-18.