slides - Istituto di Analisi dei Sistemi ed Informatica

Download Report

Transcript slides - Istituto di Analisi dei Sistemi ed Informatica

BioinfoGRID Symposium 2007
Mathematical methods for the analysis
of the Barcode Of Life
P. Bertolazzi, G. Felici
Istituto di Analisi dei Sistemi e Informatica del
Consiglio Nazionale delle Ricerche
Optimization Laboratory for Data Mining
DNA Barcoding 1/4
• DNA barcoding is a new technique that uses a short DNA
sequence from a standardized and agreed-upon position in the
genome as a molecular diagnostic for species-level identification.
• The chosen sequence for Barcode is a small portion of
mitocondrial DNA (mt-DNA) that differs by several percent,
even in closely related species, and collects enough information
to identify the species of an individual.
• It is easy to isolate and analyze.
• Moreover it resumes many properties of the entire mt-DNA
sequence.
DNA Barcoding 2/4
A typical animal cell. Within the cytoplasm, the major organelles and cellular structures
include: (1) nucleolus (2) nucleus (3) ribosome (4) vesicle (5) rough endoplasmic
reticulum (6) Golgi apparatus (7) cytoskeleton (8) smooth endoplasmic reticulum (9)
mitochondria (10) vacuole (11) cytosol (12) lysosome (13) centriole.
DNA Barcoding 3/4
• The first studies on barcode (2003) are due to Hebert (see [1]
for the last results and a complete bibliography)
• Two mt-DNA subsequences (genes) are proposed as barcode:
– Cytochrome c Oxidase I (COI)
– Cytochrome b
• Since 2003 COI has been used by Hebert to study fishes, birds,
and other species
• Hebert employs the Neighbor Joining (NJ) [2] method,
proposed to obtain phylogenetic trees, and identifies each
species as represented by a distinct, non overlapping cluster of
sequences in the NJ tree.
DNA Barcoding 4/4
• Recent studies [1] show that even fragments of the COI
sequence have the same expressive power than the entire
sequence
• The Consortium of Barcode of Life (CBOL) is an
international initiative devoted to developing DNA barcoding
as a global standard for the identification of biological species
http://www.barcoding.si.edu/
• Data Analysis Working Group (CBOL and DIMACS,
Rutgers)
http://dimacs.rutgers.edu/Workshops/BarcodeResearchChallen
ges2007/
http://dimacs.rutgers.edu/Workshops/DNAInitiative/
Research challenges 1/2
• Specimen identification versus species discovery : the knowledge about
species is not always complete
• Is a species similar to another or not?
• Optimizing sample size: barcodes are not easy to measure, large samples
are very expensive
• Using character-based barcodes: an alternative approach to comparing
specimens in terms of overall percent sequence similarity
• Shrinking the barcode
Research challenges 2/2
• Shrinking the barcode: we want to identify the relevant portions of the
barcode wrt each species; this would make easier to identify new data on
species and study it
A barcode fragment for 2 species
Species 1: CTGGCATAGTAGGTACTGCCCTTAGCCTCCCCCCAGTCCTCTCCC
Species 2: CTGGCATAGTCGGAACCGCTCTCAGCCTACCCCCAGCCCTTTCTC
• 45 sites
• Difference bases on only 8 sites
• In this case only 1 site is sufficient to distinguish the 2 species
Main goal
• Given samples from different species, the
objective is to identify those combinations of
muted nucleotides that have determined the
differences among species in the evolution
path
Our work 1/2
– We have addressed these challenges using supervised learning and
some special methods that we have produces for classification of logic
data of large size
– Such methods already proved successful in other bio-computational
problems (Tag SNPs selection, microarray data analysis, genetic
diagnostics)[3]:
• Feature selection methods based on Integer Programming: aims
at finding a subset of bases that allows to distinguish among species
• Logic classification methods based on logic programming:aims at
finding rules or formulas that can distinguish between individuals
of different species
Our Work 2/2
• Feature Selection is used to select a limited number of positions where
the values of the bases differs consistently from species to species;
• Logic classification methods use the selected positions to find formulas
with high semantic value that can provide a deep understanding of the
analyzed data.
Logic classification is computationally expensive
Many
features
Compact logic
formulas
Feature
selection
Few
features
Logic classification
Supervised Learning: the standard setting
• Given:
• A finite number of objects described by a finite number of measures
X = (x1, x2, …, xn)
• An interesting characteristic of these objects Y (TARGET)
• Determine the relation between TARGET and measures Y = F(X)
• It is only required that, for each set of measures X, it produces an output Y
Classification: Y = (0,1,2, …)
Estimation: Y Q  R
• Very important
– The form of the relation is decided up front;
– The parameter of the relation are determined by the data;
– The results of the estimation need to be validated.
Supervised Learning: training and testing
Data
Training Set
Choose model
Estimate model
Testing Set
Performance
?
ok
• If data contains good information and
the choice of the model is correct, the
model has a good fit on the training
data
• Good fitting to training data may
provide information and knowledge
• The model should also adapt well to
test data
• Only good fitting to test data
strengthen the knowledge extracted
and validates the forecasting or
classification function of the model
The Special case of Logic Data 1/2
•
•
Frequently, objects are (or can be) described only by logic attributes
(True/False)
The classification function is expressed in terms of:
– Logic variables p, q, r, with True/False value
– conjunction (p  q), disjunction (p  q), negation ( p)
Logic variables
(property of the object)
A
A
A
I
I
S1
T
T
T
T
F
S2
T
F
F
T
?
S3
F
F
F
T
F
S4
?
T
F
?
T
S1

S3
A
A
A
I
I
S1
1
1
1
1
-1
S2
1
-1
-1
1
0
S3
-1
-1
-1
1
-1
(1, 0, -1, 0)
S4
0
1
-1
0
1
T
“If the object shows
both the presence of
property 1 and the
absence of property
3, then it is class A,
else is class I”
The Special case of Logic Data 2/2
X1
0
0
1
1
1
0
1
0
0
0
0
0
0
0
0
0
0
0
1
0
1
0
0
0
1
1
1
1
0
1
1
1
1
1
1
0
1
0
1
1
1
1
1
1
1
0
1
0
0
X2
0
0
0
0
0
0
0
0
1
0
0
0
0
1
0
0
0
1
0
1
0
0
0
0
0
1
1
0
1
1
1
1
1
1
1
1
1
1
1
1
0
1
1
1
0
1
1
1
1
X3
1
1
0
0
0
1
0
1
0
1
1
0
1
0
0
0
1
0
0
0
0
1
1
1
0
1
0
1
1
1
1
1
1
1
1
1
0
1
1
0
1
1
1
0
1
1
1
1
1
X4
0
0
1
1
0
1
1
0
0
0
1
1
1
0
1
0
0
1
1
0
0
1
1
1
0
1
1
0
1
1
0
0
1
1
0
1
1
0
1
1
0
1
1
1
0
1
0
1
1
X5
0
1
0
0
1
0
1
1
1
1
1
1
1
0
0
0
1
0
1
1
0
0
0
0
0
1
0
0
1
1
0
0
0
0
1
0
0
1
1
1
0
1
0
1
0
0
1
1
0
X6
0
0
1
1
0
1
0
0
0
0
1
0
0
0
0
1
1
0
0
0
0
1
1
1
1
1
1
0
1
0
1
0
0
0
0
0
1
1
1
1
0
1
0
0
1
1
0
1
1
IF [ -X1 & -X2 ] OR [ -X2 & -X3 ] OR [ -X1 & -X3 ]
IF [ X2 & X6 ] OR [ X1 & X3 ] OR [ X2 & X3 ]
OR [ X1 & X2 ]
True for red,
false for blue
True for blue,
false for red
Mining Logic Data
•
•
•
•
The Lsquare Logic Miner [4,5,6]
Builds logic separations in Disjunctive Normal Form
(DNF)
Identifies iteratively the clauses of the DNF that separates
the largest part of object in one class from all the objects
of the other class
Clause identification is based on the solution of a
Minimum Cost Satisfiability Problem (MINSAT),
computationally hard
 1 pi  True, qi  False

si   1 pi  False, qi  True
 0 p  q  False
i
i

Satisfying solution:
S1
S2
S3
S4
A
T
T
F
?
A
T
F
F
T
A
T
F
F
F
q1  d 3 , p2  d 3 , p3  d 3 , p4  d 3
I
T
T
T
?
q1  q2  q3  q4  p4
I
F
?
F
T
p1  p2  q2  p3  q4
p1
True, q1
False
p2
False, q2
False
p3
False, q3
True
p4
False, q4
False
q1  d1 , q2  d1 , p3  d1 , p4  d1 , q4  d1
q1  d 2 , p2  d 2 , p3  d 2 , q4  d 2
s1   s3
Mining Logic Data
Two MINSAT problems are solved at each iteration
min
 cd 
aA
s.t.
min
pi  qi

 

  qi     pi 
 i:b  1   i:b 1 
 i
  i

qi  d a
pi  d a
i  1, 2,...n
Lsquare finds 4 DNF formulas:
b  B
a  A, i : ai  1
a  A, i : ai  1
 c'  p   c ' q 
i 1, n
s.t.
a
i
i  1, 2,...n
b  B
a  A' , i : ai  1
a  A' , i : ai  1
 C ' (True)  1
c ' ()  
C ' ( False )  0
1: A from B, min support
2: A from B, max support
3: B from A, min support
4: B from A, max support
Identify the
clause with
desired
support
i
pi   qi

 

  qi     pi 
 i:b  1   i:b 1 
 i
  i

qi
pi
Select
Largest
separable
subset
C ' (True)  0
c ' ()  
C ' ( False )  1
ALL: majority vote, use undecided
The behavior of the formulas
strongly interacts with the quality
of the data…
Feature Selection

Most methods for FS are based on a greedy construction of the feature set, and do not take
into proper account the interactions among the selected features (correlation, collinearity)

An interesting method adopts Integer Programming (Set Covering) to select the smallest
set of features that enables to differentiate logic records belonging to different classes
The Set Covering Approach for FS (Rutgers - LAD approach)
A binary variable is associated with each feature
A linear constraint is associated with each pair of objects belonging to different classes
A set covering model is used requires that each pair of objects belonging to different
classes are differentiated by at least one of the selected features
min

j
xj
s.t.

j
a(ik ) j x j  1 (i, k ) : i  A, k  I
x j 0,1
a(ij)k = 1 if
i,k belong to different classes
i,k are different on feature j
a(ij)k = 1 feature j selected
Feature Selection : a simplified version
To overcome the untreatable dimensions of the quadratic model, we propose a
simplified version of the set covering model (Linear SC), where the number of
constraints is equal to the number of samples
 fi feature i

PA(i) = proportion of elements with fi = 1 in class A;

PB(i) = proportion of elements with fi = 1 in class B;

xi = 1 iff feature i is chosen in the solution

if PA(i) > PB(i)
Coverage is given
min

i
xi
s.t.
d
i
x   j
ij i
xi 0,1
1 if row j is in class A and f i 1
0 if row j is in class A and f 0

i
d ij  
1 if row j is in class B and f i  0
0 if row j is in class B and f i 1

Rows are linear in the # of examples

Redundancy can be controlled

May still need heuristics for large
number of features…
Coverage is
maximized, given
the bound  on the
# of features
max 
s.t.
d
x
x   j
i
ij i
i
i

xi 0,1
Application to Barcode Data
•
•
•
•
•
•
•
Data Set from the 2006 conference
1623 samples belonging to 150 different species
Each sample is described by 690 nucleotides (columns)
We search for a compact rule for each one of the 150
classes
For each species k, we solve a 2-class learning problem:
• class A: all samples in class k
• class B: samples of all classes different from k
We use Linear set covering for feature selection and logic
mining to determine the formulas on a training subset (8090%) of the available data, and then test their classification
capabilities on the remaining data.
Training and testing samples are drawn at random
maintaining the same proportion in each class
Application to Barcode Data
DATA SET
TRAINING SET
TRAINING SET
FEATURE
TEST SET
Integer Programming model associated with the
Linear set Covering model is solved optimally
with commercial solver ILOG CPLEX with 10, 20
and 30 as values for 
SELECTION
FORMULA
EXTRACTION
TEST SET
LSQUARE is used to separate each species from
the others 149, and a compact formula explaining
each specie is obtained
The formulas are used to predict the specie of
each element in the test set.
Application to Barcode Data
•
•
•
•
•
•
•
•
•
LSC construction
PAjk = proportion of samples in class k with nucleotide = A in column j
PCjk = proportion of samples in class
k with nucleotide = C in column j
PGjk = proportion of samples in class
k with nucleotide = G in column j
PTjk = proportion of samples in class
k with nucleotide = T in column j
PNAjk = proportion of samples in class <>k with nucleotide = A in column j
PNCjk = proportion of samples in class <> k with nucleotide = C in column j
PNGjk = proportion of samples in class <> k with nucleotide = G in column j
PNTjk = proportion of samples in class <> k with nucleotide = T in column j
•
•
•
•
•
aij = 1 iff:
Sample i is in class k, nucleotide i of j = A and PAjk > 2 PNAjk
Sample i is in class k, nucleotide i of j = C and PCjk > 2 PNCjk
Sample i is in class k, nucleotide i of j = G and PGjk > 2 PNGjk
Sample i is in class k, nucleotide i of j = T and PTjk > 2 PNTjk
max 


j
aij x j   i
j
xj  
x j 0,1
Select  columns with largest 
Use these columns to formulate a separation
problem for each class with Lsquare (1 vs all)
Obtain a logic formula for each class
An example
CLASS
CLASS
CLASS
CLASS
1
1
2
3
1
A
A
A
C
2
C
A
A
G
3
G
A
T
T
4
T
T
A
C
PA
PC
PG
PT
1
1.0
0.0
0.0
0.0
2
0.5
0.5
0.0
0.0
3
0.5
0.0
0.5
0.0
4
0.0
0.0
0.0
1.0
PNA
PNC
PNG
PNT
1
0.50
0.50
0.00
0.00
2
0.50
0.00
0.50
0.00
3
0.00
0.00
0.00
1.00
4
0.50
0.50
0.00
0.00
PA
PC
PG
PT
1
1.0
0.0
0.0
0.0
2
1.0
0.0
0.0
0.0
3
0.0
0.0
0.0
1.0
4
1.0
0.0
0.0
0.0
PNA
PNC
PNG
PNT
1
0.67
0.33
0.00
0.00
2
0.33
0.33
0.33
0.00
3
0.33
0.00
0.33
0.33
4
0.00
0.33
0.00
0.67
PA
PC
PG
PT
1
0.0
1.0
0.0
0.0
2
0.0
0.0
1.0
0.0
3
0.0
0.0
0.0
1.0
4
0.0
1.0
0.0
0.0
PNA
PNC
PNG
PNT
1
1.00
0.00
0.00
0.00
2
0.67
0.33
0.00
0.00
3
0.33
0.00
0.33
0.33
4
0.33
0.00
0.00
0.67
CLASS 1
CLASS 2
CLASS 3
With  = 1:
separable
With  = 2:
x3 = 1, x1=x2=x4=0: non
x4 = 1, x1=x2=x3=0: separable
x3 = x4 =1, x1=x2=0: separable
IF (X4=T) THEN CLASS 1
IF (X4=A) THEN CLASS 2
IF (X4=C) THEN CLASS 3
1
1
2
3
4
2
1
1
0
1
0
1
1
1
3
1
1
1
1
4
1
1
1
1
max 
x1  x3  x4  
x1  x2  x3  x4  
x2  x3  x4  
x1  x2  x3  x4  
x1  x2  x3  x4  
Results
v a lue s o f
For each row we solve:
 1 set covering problem
 150 logic classification
problems
   {10, 20, 30}
 Test %  {10, 20}
 3 random repetitions for
each setting
erro r


te st%
tr a ining
te sting
10
2
10
11.14%
11.38%
10
2
10
5.62%
8.98%
10
3
10
5.99%
7.19%
10
3
20
9.21%
10.17%
10
3
20
7.11%
9.75%
10
2
20
7.81%
8.05%
7 .8 1 %
9 .2 5 %
a v e r a ge
20
6
10
2.06%
2.40%
20
6
10
0.84%
1.20%
20
6
10
0.28%
0.60%
20
6
20
1.90%
2.12%
20
6
20
0.30%
1.27%
20
6
20
0.30%
1.27%
0 .9 5 %
1 .4 8 %
a v e r a ge
30
9
10
0.37%
0.60%
30
9
10
0.28%
0.60%
30
9
10
0.37%
0.60%
30
9
20
0.30%
1.69%
30
9
20
0.30%
0.85%
a v e r a ge
0 .3 3 %
0 .8 7 %
o v e r a ll
3 .0 3 %
3 .8 7 %
Results
Site 580 appears 54 times in the
formulas that discriminate each
class from the rest (approx. 1/3)
average error on test for different values of 
10.00%
9.00%
Se le c te d Site
8.00%
O c c u r r e nc e s in
F o r m u la s
7.00%
6.00%
580
54
490
49
346
41
469
41
544
41
637
41
331
39
445
34
154
30
10
85
29
9
307
29
8
379
29
31
25
211
25
28
22
3
40
22
2
10
21
16
15
34
15
4
14
5.00%
4.00%
3.00%
2.00%
1.00%
0.00%
10
20
30
mode of optimal  for different values of 
7
6
5
4
1
10
20
30
Results
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
SPECIES
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
DIM
1
1
1
1
1
1
1
1
2
1
1
1
1
1
1
1
CLAUSE(S)
* 274 A 499 T 580 C
* 19 T 172 T
* 340 G 343 A 445 C
* 445 T 499 T 580 G
* 172 C 445 T 493 G 499 C 580 A
* 58 G 430 T
* 268 C 289 A 334 C 430 T
* 136 A 277 C 445 T
* 58 A 121 T 172 C
* 277 T 499 G
* 163 T 274 A 334 G
* 19 T 331 C 652 T
* 4 C 10 T 274 C 289 A 445 A
* 340 G 445 G
* 277 A 340 A 430 T 445 C
* 121 C 331 T
* 58 A 283 C 331 T
Results
IF BASE IN POSITION 19 IS T AND
BASE IN POSITION 172 IS T THEN SPECIES IS …1
IF BASE IN POSITION 340 IS G AND
BASE IN POSITION 343 IS A AND
BASE IN POSITION 445 IS C THEN SPECIES IS … 2
IF BASE IN POSITION 58 IS A AND
BASE IN POSITION 121 IS T AND
BASE IN POSITION 172 IS C OR
BASE IN POSITION 277 IS T AND
BASE IN POSITION 499 IS G THEN SPECIES IS … 8
Related work
• Haplotype Inference by Parsimony: new very efficient
heuristic
• Tag SNP and SNP reconstruction problem
• Phylogenetic tree in polyploid organisms
Conclusions and Future work
• The logic technique is very powerful for identifying
small non contiguous subsequences of the barcode
• We are testing the technique on a very huge set of
data from Lepidoptera
• We will compare our technique with NJ method [2]
• CBOL and DAWG have asked us to implemented
our technique as a web service
• We are designing a software platform that implements
algorithms for all the above problems
References
•
•
•
•
•
•
•
•
[1] M. Hajibabaei , G. A.C. Singer, E. L. Clare, P.D.N. Hebert Design and
applicability of DNA arrays and DNA barcodes in biodiversity monitoring BMC
Biology, 2007
[2] M. Saitou., M. Nei Neighbour Joining Method, Mol Biol Evol. 1987
[3] P.Bertolazzi, G. Felici, P. Festa, G. Lancia, Logic Classification and Feature
Selection for Biomedical Data, Computers & Mathematics with Applications, online version (2007)
[4]G. Felici, K. Truemper, A Minsat Approach for Learning in Logic Domains,
INFORMS JOC, 2002;
[5] K. Truemper, Design of Logic-Based Intelligent Systems, Wiley-Interscience,
2004
[6] G. Felici, K. Truemper, The Lsquare System for Mining Logic Data,
Encyclopedia of Data Warehousing and Mining, 2005
[7] P. Bertolazzi. G. Felici Species Classification with Optimized Logic Formulas
poster, EMBO Conference, Rome, May 2007.
[8] P. Bertolazzi, G. Felici Species Classification with Optimized Logic Formulas
invited talk , Second BOL Conference, Taipei, Sept. 2007