Ming Li Talk about Bioinformatics

Download Report

Transcript Ming Li Talk about Bioinformatics

Information Distance
More Applications
1. Information Distance from a
Question to an Answer
Question & Answer

Practical concerns:





Partial matching does not satisfy triangle
inequality.
When x is very popular, and y is not, x contains a
lot of irrelevant information w.r.t. y, then C(x|y) <<
C(y|x), and d(x,y) prefers y.
dmax does not satisfy universality.
Neighborhood density -- there are answers that
are much more popular than others.
Nothing to compress: a question and an answer.
Partial matching
Triangle inequality does not hold:
d(man,horse) ≥ d(man, centaur) + d(centaur, horse)
Separate Irrelevant Information

In max theory, we wanted smallest p, converting x,y:
x

p
Now let’s remove redundant information from p:
t
x

y
s
q
y
We now wish to minimize q+s+t.
The Min Theory
(Li, Int’l J. TCS, 2007, Zhang et al, KDD’2007)


Emin (x,y) = smallest program p needed to convert between x and
y, but keeping irrelevant information out from p.
Formally:
Emin (x,y) =min{|p| : U(x,p,r)=y, U(y,p,q)=x, |p|+|q|+|r| ≤ E(x,y) }
Fundamental Theorem II:
Emin (x,y) = min { C(x|y), C(y|x) }

All other development similar to E(x,y). Define:
dmin
min {C(x|y), C(y|x) }
(x,y) = ----------------------min {C(x),C(y)}
Other properties
Theorem 1. dmin(x,y) ≤ dmax(x,y)
Theorem 2. dmin(x,y)




is universal,
does not satisfy triangle inequality
is symmetric
has required density properties: good guys
have more neighbors.
How to approximate dmax(x,y), dmin(x,y)

Each term C(x|y) may be approximated by
one of the following:
1.
2.
3.
Compression.
Shannon-Fano code (Cilibrasi, Vitanyi): an object
with probability p may be encoded by –logp + 1
bits.
Mixed usage of (1) and (2) – in question and
answer application. This is especially useful for
Q&A systems.
Shannon-Fano Code



Consider n symbols 1,2, …, N, with decreasing
probabilities: p1 ≥ p2 ≥, … ≥ pn. Let Pr=∑i=1..rpi.
The binary code E(r) for r is obtained by
truncating the binary expansion of Pr at length
|E(r)| such that
- log pr ≤ |E(r)| < -log pr +1
Highly probably symbols are mapped to shorter
codes, and
2-|E(r)| ≤ pr < 2-|E(r)|+1
Near optimal: Let H = -∑rprlogpr --- the average
number of bits needed to encode 1…N. Then we
have
- ∑rprlogpr ≤ H < ∑r (-log pr +1)pr = 1 - ∑rprlogpr
Query-Answer System
X. Zhang, Y. Hao, X. Zhu, M. Li, KDD’2007
Adding conditions to normalized information
distance, we built a Query-Answer system.
 The information distance naturally measures




Good pattern matches – via compression
Frequently occurring items – via Shannon-Fano code
Mixed usage of the above two.
Some comparisons
MSN
Mean RAR
Top 1 Hit Rate
Yahoo
Google
LCC
Ask
BrainBoost
ASU
Start
DFKI
NSIR
AskED
QUANTA
0%
10%
20%
30%
40%
50%
60%
70%
80%
1. Benchmark is based on a common factoid QA test set of 109 questions
that are provided by Text Retrieval Conference (TREC) sponsored by
National Institute of Standard and Technology (NIST).
2. MRAR = (∑C(i)/i)/N = (C1 + C2*0.5 + C3*0.33 + C4*0.25 +
C5*0.2)/109
2. Parameter-Free Data Mining
(Keogh, Lonadi, Ratanamahatana, KDD’04)

Most data mining algorithms require setting
many input parameters. Parameter-laden
methods have 2 dangers:




Wrong parameter setting leads to errors.
Over fitting causes more problems.
Data mining algorithms should have ideally no
parameters – thus, no prejudices, expectations,
or presumptions.
Compared (a variant of) information distance
with every time serious distance (51 of them)
appeared in SIGKDD, SIGMOD, ICDM, ICDE,
VLDB, ICML, SSDB, PKDD, PADDD during the
previous decade.
Experiment on
18 pairs of time series
(length 1000 each):
Q = correct /total
• Inf. Dist. Q=1
• ¾ of measures, Q=0
• Best of them, HMM,
Q=0.33
• Data cover: finance,
science, medicine, industry
• Data: all in SAX format
Anomaly detection (KLR, KDD’04) – algorithm: use
divide and conquer to find a region that have large inf. dist. from other parts.
* All other methods produced wrong results – different from cardiologists.
Anomaly Detection
(KLR KDD’04)
3. Identifying Multiword Expressions
(Fan Bu, Tsinghua University)
Multiword expressions (MWEs) appear
frequently and ungrammatically in English.
 An MWE is a sequence of neighboring
words whose meaning cannot be derived
from the meaning of its components.
Example: Kolmogorov complexity
 Automatically identifying MWEs is a major
challenge in computational linguistics.

Distance from an n-gram to its semantics
Given an n-gram, let me define the
“semantics” of the n-gram to be the set of
all web pages containing all the words in
the n-gram.
 The (plain) information distance simplifies
to:
Dmax(n-gram g, its semantics) =
log (#pages(g)/#pages(g’s semantics))

Experiment on
1529 idioms, 1798 compositional phrases
Complex name entity extraction:
4. Texture classification (Campana & Keogh, 2010)
Why are we interested in this
Their method

Used, as information distance:
d(x,y) = [K(x|y) + K(y|x) ] / K(xy)

MPEG-1 format for images x, y. To do
K(x|y), they created a movie of a pair of
frames x and y. They used MPEG video
compression (lossy).
Note that the
algorithm has no
access to color or
shape information,
this clustering is
based only on
texture.
Dictionnaire D'Histoire Naturelle
by Charles Orbigny. 1849
The algorithm can
handle very subtle
differences.
Ornaments from the Hand-Press period. The particularity of this period is the use of block of wood to print the
ornaments on the books. The specialist historians want to record the individual instances of ornament occurrence
and to identify the individual blocks. This identification work is very useful to date the books and to authenticate
outputs from some printing-houses and authors. Indeed, numerous editions published in the past centuries do not
reveal on the title page their true origin. The blocks could be re-used to print several books, be exchanged
between the printing-houses or duplicated in the case of damage. Mathieu Delalandre
Egyptian
Clovis
5. Gene Expression Dynamics (Nykter et al,
PNAS, 2008)
Macrophage (white blood cells)
 94 microarrays, 9941 differentially
expressed genes.
 Computed Normalized information
distance between two time-point
measurements.
 Observed the underlying dynamical
network of the macrophage exhibits
criticality.
 Somebody figure out what this is about –
and present in class.

6. Tree from metabolic network (Nykter
et al, Phy. Rev. Lett. 2008)
Metabolic networks of 107 organisms from
KEGG.
 Normalized information distance is
computed between each pair and a tree
was generated.

Red – Bacteria
Blue – archaea
Green -- eukaryotes
Phylogenetic Compression (Ane-Sanderson, Syst.
Biol. 2005)

Several interesting questions from
bioinformatics:



Is the parsimony tree the best tree?
DNA sequence compression by sequence only
can be optimal?
The authors propose a scheme to encode
the tree and data simultaneously to
minimize the descriptive complexity of
tree(s) plus data.


Better compression
More economical than the parsimony tree.