projects - Computer Science and Engineering
Download
Report
Transcript projects - Computer Science and Engineering
CSE280a: Projects
Vineet Bafna
CSE280
Vineet Bafna
Project Logisitics
•
•
•
Research project (70%)
Work individually, or in groups of 2
Two presentations:
–
Introductory presentation: Feb 1st week (20 minutes) (20% grade)
•
•
•
•
•
–
One on one meeting with instructor (end February) (10% grade)
•
–
Discuss preliminary results
Final presentation (last 2-3 classes): (30% grade)
•
•
CSE280
Describe the goals of the project
Describe your (computational) formulation
Summarize/critique reading assignment
Present an algorithm
Constructive criticism of other projects
Submit a final report
Final presentation
Vineet Bafna
Project 1: disease gene mapping
•
•
Recall, Linkage Disequilibrium
In the absence of recombination,
–
–
•
Correlation between columns
The joint probability Pr[A=a,B=b] is different from
P(a)P(b)
With extensive recombination
–
CSE280
Pr(a,b)=P(a)P(b)
Vineet Bafna
Measures of LD
•
•
Consider two bi-allelic sites with alleles
marked with 0 and 1
Define
–
–
•
•
P00 = Pr[Allele 0 in locus 1, and 0 in locus 2]
P0* = Pr[Allele 0 in locus 1]
Linkage equilibrium if P00 = P0* P*0
D = abs(P00 - P0* P*0) = abs(P01 - P0* P*1) = …
CSE280
Vineet Bafna
LD can be used to map disease genes
LD
0
1
1
0
0
1
D
N
N
D
D
N
•
•
LD decays with distance from the disease allele.
By plotting LD, one can short list the region
containing the disease gene.
CSE280
Vineet Bafna
Multiple loci
LD
0
0
1
0
0
1
D
N
N
D
D
N
•
0
1
1
0
0
0
In complex diseases, multiple loci interact to confer
disease susceptibility
CSE280
Vineet Bafna
Testing for multiple loci
•
Assume SNP matrix with n individuals, m loci. Testing
for all sets of 5 SNPs implies a huge number of
computations?
m
n
5
•
Can you come out with computational strategies that
can speed it up?
CSE280
Vineet Bafna
Speeding up multiple locus computations
•
•
•
•
A filtering strategy?
Input: a SNP matrix with one or more pairs
that interactively associate
Output: a set of SNP pairs that includes the
interacting pair(s).
Method should be fast, and should NOT
consider all pairs.
CSE280
Vineet Bafna
Speeding up the computations
k=2
•
•
•
0
0
1
0
0
1
1
1
0
0
1
1
1
0
1
0
1
1
Correlated SNPs should also have low hamming distance.
Random SNPs should have high hamming distance.
Strategy: select k individuals at random.
– Hash each individual restricted to k individuals
– Correlated SNPs should fall in the same bin with high
probability
CSE280
Vineet Bafna
Project 2: mtDNA phylogeny
•
•
In the absence of recombination, the history
of mitochondrial DNA can be expressed by a
tree.
The goal of this project is to build a robust
phylogeny using a heuristic modification of
the perfect phylogeny.
CSE280
Vineet Bafna
The Genographic project
https://www3.nationalgeographic.com/genographic/atlas.html
•
The genographic project aims to trace geographic origins of the
human race using mitochondrial DNA.
CSE280
Vineet Bafna
Without recurrent mutations
•
Unique tree can
explain the
evolutionary
history
A
B
C
D
E
1
1
0
1
0
1
2
1
0
1
0
0
3
0
1
0
1
0
4
0
0
1
0
0
r
1
E
3
2
B
5
4
D
C
CSE280
Vineet Bafna
A
5
0
0
0
1
0
With recurrent mutations
•
•
•
•
Adding another
individual F
destroys perfect
phylogeny
Why?
It is not so easy to
place F
Can you suggest a
strategy?
r
A
B
C
D
E
F
1
E
1
1
0
1
0
1
0
3
0
1
0
1
0
0
4
0
0
1
0
0
0
3
2
B
2
4
2
1
0
1
0
0
1
5
1
D
C
CSE280
Vineet Bafna
A
F
5
0
0
0
1
0
0
Tests of Selection
•
In class, we have discussed alleles that can
be selectively neutral, or under active
selection
–
•
•
Active selection may be positive or negative
How do we identify regions under positive, or
negative selection?
Balancing selection: sometimes it is helpful
for a population to
CSE280
Vineet Bafna
Adaptive Selection
•
•
Selection leads to loss of heterozygosity (will be
explained in detail in the next lecture).
Can you come up with a test for selection?
CSE280
Vineet Bafna
Balancing selection
•
•
•
Sometimes both alleles are useful in a population,
and it helps to have both around
A simple example is when diversity is important (the
two variants help maintain diversity)
Bipolar disorder genes could be under balancing
selection
–
–
•
•
High creativity which might confer some
selective/reproductive advantage.
Depression offers a disadvantage
If so, the tests for this disorder might be tricky.
How can we identify regions under balancing
selection?
CSE280
Vineet Bafna
Testing for Balancing Selection
•
•
•
Adaptive selection leads to loss of heterozygosity (will be explained in
detail in the next lecture).
Balancing selection leads to two dominant haplotypes
Can you come up with a test for balancing selection?
CSE280
Vineet Bafna
Project: Primer design for cancer
genomics
CSE280
Vineet Bafna
The Science behind Gleevec
Fusions
–
observed in leukemia, lymphoma,
and sarcomas
•
“Philadelphia Translocation”
–
CSE280
Drugs target this fusion protein
Vineet Bafna
Fluoroscent in situ hybridization
•
Cancer genomes show extensive structural variation
CSE280
Vineet Bafna
Assaying for tumor variants
•
•
•
Most tumors start off with a
single cell, which then
proliferate.
Drugs like Gleevec are used
well after cancer has taken
hold.
Can we detect the cancer
early by detecting the
genomic abnormality?
–
•
If a very few cells in the
person are cancerous, can
we still detect it?
Can we track a patient
through his treatment?
CSE280
Vineet Bafna
Cancer genomics
deletion
•
•
In cancers, large genetic changes can occur,
including deletions, inversions, and rearrangements
of genomes
In the early stages, only a few cells will show this
CSE280
Vineet Bafna
Polymerase Chain Reaction
•
•
PCR is a technique for amplifying and detecting a specific portion
of the genome
Amplification takes place if the primers are ‘appropriate’ distance
apart (<2kb)
CSE280
Vineet Bafna
Assaying for Rare Variants
•
PCR can be used to assay for a given
genomic abnormality, even in a heterogenous
population of tumor and normal cells
PCR
Extract
Genomic
DNA
Tumor cell
CSE280
Distance too large
for amplification
Vineet Bafna
Detection
Variant Variants
•
•
What if the variant is the minority in the cell
population?
What if deletion boundaries are uncertain?
Patient A
Deletion
Patient B
Deletion
Patient C
Deletion
CSE280
Vineet Bafna
Observed variation in deletion size
Sizes of homozygous deletions in cell lines from different human cancers.
(scale is in megabases).
CSE280
Vineet Bafna
Primer Approximation Multiplex PCR (PAMP)*
•
Multiple primers are optimally spaced, flanking a
breakpoint of interest
–
–
•
Upstream of breakpoint, forward primers
Downstream of breakpoint, reverse primers
The primers are run in a multiplex PCR reaction
–
Any pair can form a viable product
Patient B
Patient C
Deletion
CSE280
Deletion
Vineet Bafna
Experimental Design (500Kb region)
•
10 sets of 25 primers:
upstream and
downstream
–
–
250 upstream
250 downstream
•
Primer-pairs closest to
breakpoint amplified
•
Assay by oligo array
Goal: Computational
selection of an ‘optimal’
primer set
CSE280
Vineet Bafna
Goal
•
•
•
Input, a collection of primers
Identify a subset of primers that do not cross-hybridize, are
unique, yet cover the region completely
Use combinatorial optimization, simulated annealing, integer
linear programming…..
CSE280
Vineet Bafna
Spectral Networks Algorithms for
De Novo Interpretation of Tandem Mass
Spectra
Nuno Bandeira, Ph.D.
Department of Computer Science and Engineering,
University of California, San Diego
ProtIG seminar series
September 21, 2007
CSE280
Vineet Bafna
Proteins and their modifications
Proteins are fundamental players in the regulation
of biological processes.
encodes for
DNA
Proteins
regulate
Knowing proteins involves knowing
many things. This dissertation focuses on:
- Identification
- Sequencing
- Post-translational modifications ( )
CSE280
Vineet Bafna
Protein sequences and modifications
From a computational perspective, a protein can be
represented as a string over a weighted alphabet:
Protein sequence:
…AFSRLEMILGF…
Amino acid
Mass
A
71
F
147
(obtained via enzymatic
digestion)
S
87
Modifications change amino acid masses:
R
156
L
113
E
129
M
131
I
113
G
57
AFSRL
SRLEMILGF
EMILG
SRLEMILGF
Subsequences are
called peptides
Mass(M)=131
Mass(SRLEMILGF)=1047
SRLEM ILGF
Mass( )=16
Mass(M )=147
CSE280
Mass(SRLEM ILGF)=1063
Vineet Bafna
Nobel prize in chemistry, 2002
CSE280
Vineet Bafna
What is mass spectrometry?
Amino acid
Mass
A
71
F
147
S
87
…
CSE280
http://nobelprize.org/chemistry/laureates/2002/chemadv02.pdf
Vineet Bafna
Tandem Mass Spectrometry (MS/MS)
Protein Sequence: …THISISAVERYLARGESAMPLEPRTEINSEQENCE…
Modified peptide
LARG*E
Prefix masses: b
L
R
A
L
A
L
A
L
R
A
G
Prefix masses : b
L
R
A
E
G
E
L
A
R
G
E
L
A
R
G
E
L
G*
R
A
E
R
G*
E
R
G*
E
G*E
LA
E
L
Intensity
LA
GE
LAR
RGE
LARG
ARGE
E
L
Intensity
Suffix masses : y
Mass (m/z)
CSE280
MS/MS spectrum
E
G*
y: Suffix masses
PM
Modification: any event
that changes the mass
at a specific site.
RG*E
LARG*
ARG*E
Peptide LARGE
Mass (m/z)
Vineet Bafna
Mass shifts
Example of a real MS/MS spectrum
b10
Symmetric
CSE280
y12
Vineet Bafna
Tandem Mass Spectrometry (MS/MS)
Peptides
Large set of
MS/MS spectra
Tandem
Mass Spectrometry
Enzymatic digestion
Proteins
…
…
Database search
De novo sequencing
Database of
known peptides
u
MDERHILNM, KLQWVCSDL,
PTYWASDL, ENQIKRSACVM,
TLACHGGEM, NGALPQWRT,
HLLERTKMNVV, GGPASSDA,
GGLITGMQSD, MQPLMNWE,
ALKIIMNVRT,
ALKIIMNVRT,SEQUENCE,
SEQUENCE,
HEWAILF, GHNLWAMNAC,
GVFGSVLRA, EKLNKAATYIN..
s
e
s
f
e
q
q
n
c
e
q
Peptide SEQUENCE
CSE280
Vineet Bafna
u
e
u
n
e
e
n
c
c
e
es
Mixture spectra
Peptide A: NLAFFQLR
Peptide B: ALDDILNLK
Sometimes, the instrument generates a single spectrum from two or more peptides:
?
CSE280
Vineet Bafna
Mixture spectrum
How to identify mixture spectra?
CSE280
Vineet Bafna
Proposed approach
•
•
When identifying a mixture spectrum of peptides A,B,
assume you have non-mixture spectra for the same
peptides.
Compare the non-mixture spectra of known peptides
to putative mixture spectra to determine peptide
identifications
CSE280
Vineet Bafna
Project description
•
Implement an algorithm to identify mixture spectra
from pairs of peptides by combining previously
identified spectra from isolated peptides.
•
Test the above implementation by simulating mixture
spectra using an existing database of spectra from
isolated peptides.
•
Propose a scoring procedure to separate correct
from false identifications.
Nuno Bandeira
[email protected]
CSE280
Vineet Bafna