The Analysis of Patterns

Download Report

Transcript The Analysis of Patterns

The Analysis of Patterns
Nello Cristianini
The Value of Patterns
• Patterns are
everywhere, and
people have always
been fascinated by
them.
• Detecting patterns
confers an advantage
to an organism
Temperature and Rainfall in Lake Shasta over 5 years
Patterns Help Us in Many Ways…
e.g., compress, predict, remove errors
Benefits of Detecting Patterns
Patterns and Intelligence
• We care so much
about pattern finding
skills, that we even
use them (partly) to
quantify intelligence…
•
The Instinct for Patterns
• We see patterns everywhere
• Even where there are no patterns:
Patterns and Randomness
• We are poorly equipped to deal with
randomness:
• 3.14159265358979323846264338327
9502884197169399375105820974944
5923078164062862089986280348253
421170679...
• In first million digits we see Erice’s ZIP
code 11 times…
• Does it mean anything?
Patterns and Randomness
•
•
•
•
•
•
•
•
•
ABRAHAM LINCOLN WAS ELECTED TO
CONGRESS IN 1846.
JOHN F. KENNEDY WAS ELECTED TO
CONGRESS IN 1946.
•
ABRAHAM LINCOLN WAS ELECTED PRESIDENT
IN 1860.
JOHN F. KENNEDY WAS ELECTED PRESIDENT
IN 1960.
•
THE NAME LINCOLN AND KENNEDY EACH
CONTAIN SEVEN LETTERS.
BOTH WIVES LOST CHILDREN WHILE LIVING IN
THE WHITE HOUSE.
•
•
•
•
BOTH PRESIDENTS WERE SHOT ON FRIDAY.
BOTH WERE SHOT IN THE HEAD.
•
BOTH SUCCESSORS WERE NAMED JOHNSON.
•
•
•
•
•
ANDREW JOHNSON, WHO SUCCEEDED
LINCOLN, WAS BORN IN 1808.
LYNDON JOHNSON, WHO SUCCEEDED
KENNEDY, WAS BORN IN 1908.
JOHN WILKES BOOTH, REPORTEDLY
ASSASSINATED LINCOLN.
LEE HARVEY OSWALD, REPORTEDLY
ASSASSINATED KENNEDY.
BOTH ASSASSINS WERE KNOWN BY THREE
NAMES.
BOTH NAMES CONTAINED FIFTEEN LETTERS.
BOOTH AND OSWALD WERE ASSASSINATED
BEFORE THEIR TRIALS.
Coincidences ?
Visualizing Patterns
• We are naturally equipped to detect CERTAIN
types of patterns, not others…
5.9400
24.1700
19.1100
5.8900
23.4400
20.5000
9.1700
22.6700
8.6100 12.2800 11.6100 20.2800 23.8300 25.8300 27.3900
19.0600 11.1700 8.8900 8.3300 7.8900 12.1100 15.3900
24.6100 28.1100 25.7800 23.1100 16.8400 13.1700 8.4400
10.8300 12.1100 15.7800 18.8300 26.5600 27.5600 25.0000
15.5600 10.7200 7.1700 7.8300 11.1700 9.7800 14.9400
23.3300 27.8300 29.2200 25.1100 20.6700 12.8900 11.8900
9.8300 14.2800 18.5000 19.0000 26.3900 29.6100 26.7200
20.3900 13.8900 8.8900
Finding Patterns
• We are naturally interested in finding relations in
data.
• We are naturally ill-equipped in dealing with
randomness.
• We have developed sophisticated technology to
do this for us.
• In last decade we have made one more step…
• As a society we rely on pattern discovery
technology, in many ways
Computational Pattern Finding
•
•
•
•
•
We want to find relations
They need to be reliable
They need to be explored efficiently
We want to do it automatically
On MASSIVE amounts of data
The Analysis of Patterns
• Data driven approach to
• Science
• Business
• Technology
• Modern society relies on our capability to
automatically detect reliable patterns in
vast sets of data
The Analysis of Patterns
• Science:
– The Genome Project
– Surveys of the Universe
• Business:
– Amazon automatically exploiting trends and relations
in transactions database
– Fraud Detection in Credit Card Companies
• Technology:
– Voice recognition
– Handwriting recognition
A Scientific Gold Rush
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
1
61
121
181
241
301
361
421
481
541
601
661
721
781
841
901
961
1021
1081
1141
GATCACAGGT
CGTCTGGGGG
GCAGTATCTG
ACAGGCGAAC
ACAATTGAAT
AACCCCCCCC
AAACAAAGAA
ACTTTTAACA
ATCTCATCAA
TACCCCGAAC
AAGCAATACA
TCCTAGCCTT
GTTCACCCTC
TCAAAACGCT
AAACGAAAGT
GCGGTCACAC
CCTCCCCAAT
ACTACGAAAG
GATACCCCAC
AACACTACGA
CTATCACCCT
GTGTGCACGC
TCTTTGATTC
ATACCTACTA
GTCTGCACAG
TCCCCCCGCT
CCCTAACACC
GTCACCCCCC
TACAACCCCC
CAACCAAACC
CTGAAAATGT
TCTATTAGCT
TAAATCACCA
TAGCCTAGCC
TTAACTAAGC
GATTAACCCA
AAAGCTAAAA
TGGCTTTAAC
TATGCTTAGC
GCCACAGCTT
ATTAACCACT
GATAGCATTG
CTGCCTCATT
AAGTGTGTTA
CCGCTTTCCA
TCTGGCCACA
AGCCTAACCA
AACTAACACA
GCCCATCCTA
CCAAAGACAC
TTAGACGGGC
CTTAGTAAGA
CGATCAAAAG
ACACCCCCAC
TATACTAACC
AGTCAATAGA
CTCACCTGAG
ATATCTGAAC
CCTAAACCTC
AAAACTCAAA
CACGGGAGCT
CGAGACGCTG
CTATTATTTA
ATTAATTAAT
CACAGACATC
GCACTTAAAC
GATTTCAAAT
TTATTTTCCC
CCCAGCACAC
CCCCCACAGT
TCACATCACC
TTACACATGC
GGACAAGCAT
GGGAAACAGC
CCAGGGTTGG
AGCCGGCGTA
TTGTAAAAAA
ACACAATAGC
AACAGTTAAA
GGACCTGGCG
CTCCATGCAT
GAGCCGGAGC
TCGCACCTAC
GCTTGTAGGA
ATAACAAAAA
ACATCTCTGC
TTTATCTTTA
CTCCCACTCC
ACACACCGCT
TTATGTAGCT
CCATAAACAA
AAGCATCCCC
CAAGCACGCA
AGTGATTAAC
TCAATTTCGT
AAGAGTGTTT
CTCCAGTTGA
TAAGACCCAA
TCAACAAAAC
GTGCTTCATA
TTGGTATTTT
ACCCTATGTC
GTTCAATATT
CATAATAATA
ATTTCCACCA
CAAACCCCAA
GGCGGTATGC
CATACTACTA
GCTAACCCCA
TACCTCCTCA
ATAGGTTTGG
GTTCCAGTGA
GCAATGCAGC
CTTTAGCAAT
GCCAGCCACC
TAGATCACCC
CACAAAATAG
ACTGGGATTA
TGCTCGCCAG
TCCCTCTAGA
Yeast protein interaction map (Barabasi)
Another Gold Rush
• The world wide web contains billion of
pages, with text, images, data…
• Semantic web, XML-based, provides high
quality annotated information…
• Soon all books ever written will be in
digital form
• Are we ready?
2001
2004
2005
The Analysis of Patterns
• Traditionally, the role of analyzing data
belongs to Statistics.
• Or: does it ?
• Data analysis performed by physicists,
biologists, engineers… each with their own
set of tools.
• Even the task of making or validating
these tools is not just part of statistics.
The Analysis of Patterns
•
•
•
•
•
•
•
Signal processing
Data mining
Information retrieval
Pattern recognition (*)
Pattern matching
Machine Learning
…
The Analysis of Patterns
• Pattern Recognition
– Syntactical / Structural
– Statistical
– Visual
• Pattern Discovery vs Pattern Matching
– In sequences
– In graphs
– In images
The Analysis of Patterns
• Grammatical Inference
• Mining for Association Rules
• Patterns in Vector Data
(classical multivariate statistics;
neural networks; machine learning; etc)
• Etc, Etc
The Analysis of Patterns
• Many communities working almost
independently
• Occasionally re-discovering the same
things
• A small and fairly stable set of ideas
– Efficient search for patterns in data
– Statistical validation issues
– Pattern visualization
• Often same tools and concepts
Searching for Patterns
• The search problem can be framed within
Operational Research / Optimization.
– (e.g., Integer Programming, Convex Programming,
etc…)
• Many key ideas from exact optimization have
revolutionized this field in recent years
• Where exact solution are theoretically
impractical (and only then!) we can use
approximations, then heuristic approaches.
• Again: same heuristics appear in many fields
Statistics
• How do we know that a relation found in a
finite set of data is reliable, or significant,
or even interesting?
• Many issues of hypothesis testing
• Classical statistics vs statistical learning
theory
What Are Patterns?
• This is a rather difficult question to answer.
I hope we will have an answer by the end
of this meeting.
• I encourage all speakers and participants
to suggest some definitions.
Gregory Chaitin: "Patterns,
Randomness and Information"
• Information, Complexity, Patterns, Randomness
and Compression.
• What are regularities in data? How can they be
defined? And quantified?
• Predictability and Compressibility are connected.
• Randomness can be defined in algorithmic
ways.
Gregory Chaitin
• Chaitin will explain what it means that a
sequence “has no pattern”, and some far
reaching consequences
• ideas can be traced back through
Hermann Weyl to Leibniz in 1686,
• connect with Godel & Turing
• the question of how math compares &
contrasts with physics and with biology
Patterns in Sets of Points (Vectors)
• Probably the most
developed part of
pattern analysis
• Includes much
multivariate stats,
much statistical
pattern recognition
(e.g., Duda and Hart)
and machine learning
Tijl De Bie:
Patterns in Sets of Points
• Patterns in sets of points: an overview
–
–
–
–
the role of optimization
examples of patterns
Dimensionality reduction, classification, clustering
Emphasize linear patterns (connect to later kernels talk)
• Patterns in sets of points: the myriad virtues of eigenproblems

– the eigenvalue problem.
– principal component analysis, canonical correlation analysis, Fisher's
discriminant, partial least squares, and spectral clustering.
– More from thiss area will be covered in Kernel Methods’ talk
Patterns in Sequences
• After vectors,
probably the most
important type of
data:
• DNA
• Text (web)
• How to find patterns
within and among
sequences?
• What data structures?
What statistical
models?
Esko Ukkonen
Suffix Tree and Hidden Markov
techniques for pattern analysis
• Efficient Pattern Discovery in sequences
requires appropriate data structures
– Suffix tree construction.
– linear time array constructions
– using suffix trees for finding motifs with gaps
– finding cis-regulatory motifs by comparative
genomics
– Hidden Markov techniques for haplotyping
Dan Gusfield
Trees, Arrays, Networks and Optimization
for Finding Patterns in Biological Sequences
a) The use of suffix trees and integer
programming for finding optimal virus
signatures.
b) A current treatment of suffix-arrays and their
uses.
c) Algorithms for finding signatures (patterns) of
historical recombination and gene-conversion
in SNP (binary) sequences.
Raffaele Giancarlo
Patterns and Compression
• Patterns are not just necessary for prediction,
they are also needed for data compression.
• Many relations between PA and Data
Compression.
• Raffaele Giancarlo (University of Palermo) - On
Indexing and Compression: Two Sides of the
Same Coin
Conceptual Foundations
• Alberto Apostolico (University of Padova
and Georgia Tech) - "Algorithmic and
Combinatorial Foundations of Pattern
Discovery"
• Will discuss various aspects of the
interplay between algorithmics and
statistics, as well as the notion itself of
pattern.
Kernel Methods
• An idea: if we are so good at finding
(linear) patterns in sets of points…
• Why not transforming all other problems
into a points problem?
• Good idea…
• Kernels Methods (from machine learning)
can do this “automatically” …
Bernhard Schoelkopf
Kernel Methods
• Kernel methods: combine ideas from
statistics and optimization
• State of the art machine learning systems
• Operate on general types of data
• Work by embedding data into a euclidean
space
• The structure of the space determined by
choosing a special “kernel” function…
• KMs connect various aspects of PA…
Patterns in Sets
• The most classic textbook example of data
mining:
You shop at the supermarket, and the market-basket
contents are recorded by the computer system at
check-out…
Discover when some items are associated, when it is
possible to predict your next purchase, etc…
• (this is what Amazon does automatically…)
Heikki Mannila:
Finding frequent patterns
•
•
•
Part I: Finding frequent patterns from data
Discovery of frequent patterns = finding positive conjunctions that are true
for a given fraction of the observations
this basic idea can be instantiated in many ways:
– finding frequent sets from 0/1 data (association mining)
– finding frequent episodes in sequences
– finding frequent subgraphs in graphs etc.
•
•
efficient algorithms exist -- the levelwise approach
theoretical analysis of the algorithms is not trivial (leads to connections to
hypergraph transversals etc.)
•
•
•
•
•
Part II: how can the patterns be used?
sometimes interesting in themselves
can be used to approximate the joint distribution
maximum entropy approaches
combining information from several patterns - ordering patterns
When can we trust the patterns we
found?
• Statistical issues:
– Patterns can be the result of chance
– Multiple testing increases this risk
– Small samples, interest in weak patterns, etc… are
other factors
• Statistical learning theory and Classical statistics
have developed tools to deal with this
• These criteria can also guide the search
algorithms towards more reliable patterns…
John Shawe-Taylor
Statistical Aspects of Pattern Analysis
• We want significant / reliable patterns
• Reliable: give us predictive power
• Significant: cannot be explained by chance
• Factors affecting pattern reliability:
– Pattern magnitude (how strong is the relation)
– Sample size (how large is the support from data?)
– Multiple testing (how many other patterns have been tested at
the same time)
• This translates into classical machine learning and
statistics themes
Nicolo' Cesa-Bianchi
On-line linear learning algorithms
• Machine learning has various ways to model the
pattern discovery process.
• An approach completely different from classical
statistics: on-line learning.
–
–
–
–
–
–
Prediction with expert advice.
Learning with linear experts.
The Perceptron algorithm and its extensions.
On-line learning with kernels.
Mistake bounds.
From mistake bounds to risk bounds
Grammatical Inference
• Very classical theme in pattern
recognition, based on Chomsky’s theory of
formal languages and grammars:
• Given a finite sample from a language,
infer the grammar that generates it (with
various constraints).
• A child’s game…
Colin de la Higuera : "Grammatical
Inference, a Tutorial"
• The lectures will introduce the key ideas of
grammatical inference and concentrate specially
on the algorithmic aspects.
• Some algorithms that will be described are:
– The "State merging" family : Gold, Rpni, Edsm...
– The "Window" languages : Local and k-testable
– Learning with queries.
•
• This class of approaches often goes under the
name “Syntactical Pattern Recognition”
Patterns in Graphs
• Edwin Hancock (University of York, UK) ``Pattern Analysis with Graphs and Trees'‘
–
–
–
–
–
Spectral representations of graphs,
Pattern spaces from graph spectra,
Spectral approaches to matching,
Heat kernel methods
Probabilistic and spectral methods for graph matching
and clustering.
– Applications in computer vision.