Database Issues in Smart Homes

Download Report

Transcript Database Issues in Smart Homes

Database Issues in
Smart Homes
Pervasive Intelligent
Environments
Spring 2004
March 2, 2004
CRESCENT
TCU Dept. of Computer Science
Topics: Lecture 3
• Preparing for prediction & decision
making: Data Mining/KDD
• An example of some of the issues
we’ve discussed
– “Towards Sensor Database Systems”,
Bonnet, Gehrke, Seshadri
Data mining taken from
Elmasri & Navathe, 4th
edition
CRESCENT
TCU Dept. of Computer Science
Data Warehouses
(1 more thing)
• Repositories for data mining activities
– Aggregates/summaries of data help efficiency
• Optimized for decision-support, not
transaction processing
• Definition (Elmasri, page 900)
– A subject-oriented, integrated, non-volatile,
time-variant collection of data in support of
management’s decisions”
• Replace “management”, with “smart home agents”
CRESCENT
TCU Dept. of Computer Science
Data Mining Definition
• Discovery of new information in terms of patterns
or rules from vast amounts of data
• Extracts patterns that can’t readily be found by
asking the right questions (queries)
– TOO MUCH DATA FOR HUMANS
• Emerged from
– Artificial Intelligence:Machine learning, Neural nets,
Genetic Algorithms
– Statistics
– Operations Research
CRESCENT
TCU Dept. of Computer Science
6 STEPS TO DM:
some may be done as part of warehouse creations
• Data selection -- pick the data needed
• Data cleansing
– Fix bad data (e.g., spelling, zip codes)
– Hard to deal with missing, erroneous, conflicting,
redundant data
• Enrichment
– Add data (e.g., age, gender, income)
• Data transformation
– Aggregate (e.g., zip codes  regions)
• Data mining
• Reporting on discovered K
CRESCENT
TCU Dept. of Computer Science
Types of results
• Association rules
– Buy diapers  buy lots of beer
• Sequential patterns
– Buy house  buy furniture within months
• Classification trees
– Types of buyers (upscale,bargain-conscience, …)
• Why do it?
– Make more money
– Science & medicine
CRESCENT
TCU Dept. of Computer Science
DM/KDD Goals
• Find patterns to predict future
events
• Find major groupings
– Groupings of buyers, stars, diseases …
• Find which group something belongs
to
– creditworthiness
CRESCENT
TCU Dept. of Computer Science
What are we learning?
•
•
•
•
•
•
•
Association rules
Classification hierarchies
Clustering
Sequential patterns
Patterns within time series
Type of result, inputs & algorithms vary
Often interested in some combination of
these types of K
CRESCENT
TCU Dept. of Computer Science
Clustering
– Unsupervised learning techniques
–
–
–
–
• Training samples are unclassified
• Vs. supervised learning (classification)
Drug categories for depression
Categories of TV viewers
Categories of buyers (likely, unlikely)
Categories of households?
• Single male, mother/children, conventional
(M/D/kids), DINKs.
CRESCENT
TCU Dept. of Computer Science
Sequential patterns
• Detecting associations among events
with certain temporal relationships
• Example:
– Cardiac bypass for blocked arteries
– AND within 18 months, high blood urea
– THEN kidney failure likely in next 18
months
• Particularly important in smart homes
CRESCENT
TCU Dept. of Computer Science
Sequential Pattern Discovery
• Sequence of itemsets
– Grocery store purchases by 1 person
(3 itemsets)
• {soy milk, bread, chocolate}, {bananas,
chocolate}, {lettuce, tomato, chocolate}
• 2 Subsequences
– {soy milk, bread, chocolate}, {bananas, chocolate},
– {bananas, chocolate}, {lettuce, tomato, chocolate}
CRESCENT
TCU Dept. of Computer Science
Sequential pattern discovery
• The support for a sequence S is the % of the given
set U of sequences of which S is a subsequence.
– That is: how many times does S show up?
• Find all subsequences from the given sequence
sets that have a user-defined minimum support.
• The sequence S1, S2, … Sn, is a predictor of “fact”
that a customer that buys itemset S1 is likely to
buy itemset S2, then S3, …
• Prediction support based on frequency of this
sequence in the past
• Many research issues to create good algos
CRESCENT
TCU Dept. of Computer Science
Patterns within time series
• Finding 2 patterns that occur over
time
– 2003 stock prices of Choice Homes and
Home Depot
– 2 products show same sales pattern in
summer but different one in winter
– Solar magnetic wind patterns may
predict earth atmospheric changes
CRESCENT
TCU Dept. of Computer Science
Time series pattern discovery
• Time series are sequences of events
– Event could be a transaction (closing
daily stock price)
– Look at sequences over n days, or
– Longest period in which change is no
greater than 1%
• Comparing
– Must define similarity measures
CRESCENT
TCU Dept. of Computer Science
Other approaches in DM/KDD
• Neural nets
– Infer a function from a set of examples
–
–
–
–
• Non-parametric curve-fitting
• Interpolates to solve new problems
Supervised & unsupervised algorithms
 classification
 time-series
 can’t see what it learned (not
declarative)
CRESCENT
TCU Dept. of Computer Science
Other approaches in DM/KDD
• Genetic algorithms
– Set up
• Representation (strings over an alphabet)
• Evaluation (fitness) function
• Parameters: # of generations, cross-over
rate, mutation rate, etc.
– Randomized (probabilistic operators),
parallel search over search space
– Used for problem solving and clustering
CRESCENT
TCU Dept. of Computer Science
Sensor DB Article
• Design
– Distributed vs warehouse approach
– Sensor data
• Measurement uncertainty, communications failures
• Data representation
• Data model
– Relational +
• Sensor descriptions, including location
– Special rep for sensor sequences
• ADT attribute represents sensor data as output of
ADT functions
CRESCENT
TCU Dept. of Computer Science
Sensor DB Article: Queries
• Sample queries/characteristics (2nd page)
and sample extended SQL (3.1)
• Long running (continuous) queries
– Incremental queries retrieves all data over t
second interval, repeated every t seconds, take
union of them
– WHERE $every() in SQL
• Aggregates over time windows
• Virtual joins for ADT (slow) functions
CRESCENT
TCU Dept. of Computer Science