talkGradsx - University of California, Riverside

Download Report

Transcript talkGradsx - University of California, Riverside

Eamonn Keogh
Data Mining
What is data mining?
• Generally, data mining (sometimes called
data or knowledge discovery) is the
process of analyzing data from different
perspectives and summarizing it into
useful information.
• In my lab, we tend to look at data and
problems that no one else looks at.
Data Mining People
• Eamonn Keogh
• Vagelis Hristidis
• Vassilis Tsotras
•
•
•
•
Chinya Ravishankar
Michael Pazzani
Christian Shelton (AI)
Stefano Lonardi (Bioinformatics)
My PhD Students
•
•
•
•
•
•
•
•
•
•
Jessica Lin (Ph.d 2005: George Mason University)
Chotirat (Ann) Ratanamahatana (Ph.d 2005: Chulalongkorn University)
Li Wei (Ph.d 2006, Google)
Xiaopeng Xi (Ph.d 2007, Yahoo)
Dragomir Yankov. (Ph.d 2008, Yahoo)
Lexiang Ye (Ph.d 2010 Google)
Xiaoyue (Elaine) Wang (Ph.d 2010 Nokia)
Jin-Wien Shieh (Ph.d 2010 Microsoft)
Qiang Zhu (Ph.d 2011 stumbleupon.com)
Abdullah Mueen (Ph.d 2012 Microsoft)
•
•
•
•
•
•
Bilson Campana (Ph.d going to Google at Xmas)
Thanawin (Art) Rakthanmanon (Ph.d ongoing)
Bing Hu (Ph.d ongoing)
Yuan Hao (Ph.d ongoing)
Jesin Zakaria (Ph.d ongoing)
Yipeng Chen (Ph.d ongoing)
stinging nettles
false nettles
false nettles
stinging nettles
false nettles
Shapelet Dictionary
Shapelet
I
5.1
I
yes
0
false nettles
Leaf Decision Tree
no
1
stinging nettles
false nettles
stinging nettles
Of course, this is a decision tree,
we want to eventually do
clustering.
However, in general, features that
are good for classification, are
good for clustering.
Decision Tree for Arrowheads
Clovis
Avonlea
Mix
Training data (subset)
Avonlea
Clovis
To do:
On a small labeled subset of data,
learn a dictionary of shaplets.
Code the large unlabeled dataset
with reference to that dictionary.
(Clovis)
11.24
I
(Avonlea)
85.47
II
Shapelet Dictionary
1.5
1.0
0.5
0
0
100
I
200
300
400
Arrowhead Decision
Tree
II
1
0
2
The shapelet decision tree classifier achieves an
accuracy of 80.0%, the accuracy of rotation
invariant one-nearest-neighbor classifier is 68.0%.
There now exists, perhaps
tens of million of digitized
pages of historical
manuscripts dating back to
the 12th century, that feature
one or more heraldic shields
The images are often
stained, faded or torn
Wouldn’t it be great if we could automatically
hyperlink all similar shields to each other?
For example, here we
could link two
occurrence of the Von
Sax family shield.
To do this, we need to
consider shape, color
and texture. Lets just
consider shape for
now…
Manesse Codex
an illuminated manuscript
in codex form, copied and
illustrated between 1304 and 1340
in Zurich
Indexing and Mining Rock Art
Rock art is found on every continent
except Antarctica.
Australia may
have 100
million
examples
To date, computer science has had little
impact on analysis of rock art.
A decade ago, Walt et al. summed up the state
of petroglyph research by noting, “Complete-site
and cross-site research thus remains impossible,
incomplete, or impressionistic”
If we assume that we have
high quality binary images of
rock art, then we can do
clustering, classification,
indexing motif discovery.
Atlatls
Anthropomorphs
One challenge is designing
distance measures.
For example, we would like
to find
and
Bighorn Sheep
similar,
even though one is solid and
one is hollow.
*Zhu, Wang, Keogh, Lee (2009). Augmenting the Generalized Hough Transform to Enable the Mining of Petroglyphs. SIGKDD 2009
Why Insects Matter I
Because they eat/destroy $40 billion+ worth of food each year
One Example Crop/Insect
Apple Maggot
Rhagoletis pomonella
Surround WP Crop Protectant against
Apple maggots cause two types of injury: dimpling and
insects. Derived from Kaolin clay, a
natural mineral it forms a barrier that acts tunneling. Dimpling occurs around the site where eggs are
laid, causing the flesh to stop growing, resulting in a
to control insect pests.
sunken, misshapen, dimpled area. Tunneling, done by the
Effective & safe, but very expensive
larvae (maggots) eating in the fruit, causes the pulp to break
down, discolor, and start to rot. The tunnels are often
Carbaryl is an insecticide that
enlarged by bacterial decay. Damaged fruit eventually
is widely used agriculturally.
becomes soft and rotten and cannot be used.
Effective, but likely a human
carcinogen, and it kills honey
bees and other pollinators [1].
[1] http://npic.orst.edu/factsheets/carbgen.pdf
[2] http://www.maine.gov/agriculture/pesticides/gotpests/bugs/factsheets/apple-maggot-cornell.pdf
Why Insects Matter II
Because they kill over one million people each year
Our Sensor
One second of audio from our sensor.
The Common Eastern Bumble Bee
(Bombus impatiens) takes about one tenth
of a second to pass the laser.
0.2
0.1
0
Background noise
-0.1
Bee begins to cross laser
Bee has past though the laser
-0.2
0
0.5
1
1.5
2
2.5
3
3.5
4
x 10 4
4.5
Peak at 705 Hz
0
100
200
300
400
500
600
700
800
900
1000
Frequency (Hz)
Culex
quinquefasciatu
Aedes aegypti
Bombus
impatiens
100
200
300
400
500
600
Frequency (Hz)
Almost certainly a Aedes aegypti
700
800