Transcript shape

Using Shapes of Trends in
Active Data Mining
Duy Lam
Norris Boothe
Shape Querying and Active Data
Mining



Historical time sequences make up a large
portion of data stored in computers
Mining trends in histories useful
Many applications, including observing
trends in stock prices, online bids, and
rule mining
Overview



Overview of SDL
SDL language
Applications to data mining
A (Very) Simple History
Shape Definition Language



SDL is a shape definition language used to
query the “shapes” of histories
Small, powerful language that allows
“blurry” matching
Designed to make it easy and natural to
query


Easily implementable
Little non-determinism
Alphabet


SDL allows you to specify an “alphabet” defining transitions
Example:
Symbol
Description
up
Slightly increasing transition
Up
Highly increasing transition
down
Slightly decreasing transition
Down
Highly decreasing transition
appears
Transition from zero to non-zero
disappears
Transition from non-zero to zero
stable
The final value nearly equal to initial value
zero
Both initial and final value are zero
Describing a shape

So with this alphabet we can describe a
shape
(shape name(parameters) descriptor)

Use such a description to query a history
to produce all subsequences that match
the shape
(shape spike()
(concat Up up down Down))
Derived Shapes

any

allows a shape to have multiple values
(any up Up)

concat

shapes can be concatenated together
contiguously
(concat down up down up)
Multiple Occurrence Operators

Shapes made of multiple contiguous
occurrences of the same shape
(exact 5 (any up Up))
(atleast 3 stable)
(atmost 2 (concat disappear appear))

Resulting subsequences are such that they
are neither preceded nor followed by a
subsequence that matches P
Bounded Occurrence Operators

in

permits “blurry” matching by allowing users to state an overall
shape without specific details
(in 7 (nomore 5 up))

within the specified time period length, we can have a
specified number of occurrences of a shape
(precisely n P)
(noless n P)
(nomore n P)

can have arbitrary gaps and can have overlap
Bounded Occurrence Operators

inorder

specifies shapes that must appear in a specific
order
(inorder P1 P2 ... Pn)
Shape Definition Examples
(in 5 (and (noless 2 (any up Up))
(nomore 1 (any down Down))))
Shape Definition Examples
(in 7 (inorder (atleast 2 (any up Up))
(in 4 (noless 3 (any down Down))))))
Parameterized Shapes

Can parameterize shape definitions
instead of using concrete values
(shape spike(upcnt dncnt)
(concat (exact upcnt (any up Up))
(exact dncnt (any down Down))))
(shape doublepeak(width ht1 ht2)
(in width (inorder spike(ht1 ht1)
spike(ht2 ht2))))
Advantages of SDL




natural and powerful language for
expressing shape queries
capability of blurry matching
reduction of output clutter
efficient implementation
SDL’s Expressive Power


SDL is equivalent to regular expressions
for regular matching
several features enchance its
effectivesness, however

greedy matching and “lookahead” capabilities
help reduce output clutter
SDL’s Expressive Power


“blurry” matching enables a much more natural
and compact specification of certain shapes
For example, if we wanted precisely one
occurrence of each ai in any order


in SDL: (and (precisely 1 a1)
(precisely 1 a2)
...
(precisely 1 an))
regular expressions requires at least exponential size
to specify!
SDL Summary



SDL is a small, powerful language for
naturally and intuitively expressing shapes
found in histories
Equivalent in power to regular
expressions, but much more effective
Permits “blurry” matching
Using SDL in
Active Data Mining
Static Data Mining

Discovery of rules for





Associations
Sequences
Classification
Entire data set is mined
Inherent weakness: Rules are not static
Active Data Mining




Partition into time periods
Run data mining algorithm on each period
Gather rules into a ‘rulebase’
Create triggers to discover


Trends in rules
Associations between rules
Active Data Mining Process
Large
Data
Base
Period 1
Rules
Period 2
Rules
Rule
ID
History (support,
confidence, etc)
1
1
2
…
Period 3
Rules
2
…
n
Active Data Mining Process
(cont).
Rule
ID
History (support,
confidence, etc)
1
2
…
n
1
Selected
Rules
2
…
Trigger
Definition
Language
Active
Data
Mining
Shape
Definition
Language
Active Data Mining Components

Shape definitions (SDL)




(shape name(parameters) descriptor)
Ex:
(shape spike(upcnt dncnt)
(concat (atleast upcnt (any up Up))
(atleast dncnt (any down Down))))
Queries
Triggers
Queries


For rule selection
Syntax:




(query (shape (history-name start-time end-time)))
‘start’ and ‘end’ specify the end points of history
Result: rules that match the desired shape
Ex:
(shape ramp() (concat Up Up))
(query (ramp() (confidence start end)))
Larger Query Example
(shape upramp(len cnt)
(in len (noless cnt (any up Up))))
(shape dnramp(len cnt)
(in len (noless cnt (any down Down))))
(query (and
(upramp(5 3) (support start 10))
(dnramp(5 3) (confidence start 10))))
Results: rules where support is increasing but
confidence is decreasing
Triggers

Datastream type functionality
ECA (Event Condition Action) model used

Syntax:

(Chakravarthy et al. 1989)


(trigger trigger-name
(events events-spec)
(condition (shape history-spec))
(actions action-spec))
Events:


Rule creation
History updates
Wave Execution Semantics

Stratified execution of triggers – similar to
Datalog
Set of
Events
Triggers
for those
Events
Queries
for those
Triggers
Set of
Actions/
Events
Trigger Example

Identifying rules where support is
increasing, but confidence is decreasing
(trigger detect_up
(events updatehistory)
(condition (upramp 5 4) (support (- end 5) end)))
(actions upward))
(trigger detect_dn
(events upward)
(condition (dnramp 5 4) (confidence (- end 5) end)))
(actions notify))
Implementation

Implemented on AIX system


Part of IBM’s Quest project
Successfully tested:


Large set (5 years) of mail order data (2.9
million records)
Large set (3 years) of POS (point-of-sale)
transactions (6.8 million records)
Future Work

At time of paper…



Integrate constructs into a SQL relational system
Improve incremental computations using partial
results of current trigger queries
Since then…



Integrated into the Quest Data Mining System
Subsumed into IBM’s data mining products, including
Intelligent Miner
Referenced for work in Active Data Mining and
“blurry” pattern matching
References




“Querying Shapes of Histories”, by Rakesh Agrawal,
Giuseppe Psaila, Edward L. Wimmers, and Mohamed
Zait of the IBM Almden Research Center, 1995
“Active Data Mining”, by Rakesh Agrawal and Giuseppe
Psaila of the IBM Almden Research Center, 1995
“The Quest Data Mining System”, by Rakesh Agrawal,
Manish Mehta, John Shafer, and Ramakrishnan Srikant
of the IBM Almden Research Center in coordination with
Andreas Arning and Toni Bollinger of the IBM German
Software Laboratory, 1996
IBM Almden Research Center Website:
http://www.almaden.ibm.com/software/quest/