Transcript The data
Managing and mining
smart meter data – at scale
CSE Project Showcase
9 July 2013
Twitter: @cse_bristol #SmartMeterData
Introduction
Contents
- Introduction to the project, the data, and its applications
- Managing SM data at scale
- Getting valuable knowledge out of SM data
- Demo: Smart Meter Analytics, Scaled by Hadoop (SMASH)
- Where next?
- Discussion
Introduction
Project Background
“Generating Value from Smart Electricity Meter Data”
18 Month TSB-supported collaboration
CSE, University of Bristol, SSE and Western Power Distribution
Three themes:
• Managing the data at scale
• Extracting useful knowledge
• Integrating the above in a user-facing application
Introduction
The data
A half-hourly timeseries for each smart meter / register
Content: date, time, consumption in the half hour.
For a single register: 17,520 records per year.
This is what 18 months look like:
Introduction
The data
EDRP:
•
•
•
•
•
•
18 months
16,250 smart metered households
16,250 smart electricity meters
9,364 smart gas meters
670m half-hourly records (E: 420m, G: 250m)
40GB of raw csv file data
Post rollout, per year, domestic only:
• 25m smart metered households
• 25m smart electricity meters
• 20m smart gas meters
• 800 billion half-hourly records (E: 450Bn, G: 350Bn)
• 50TB of raw csv file data
EDRP ~ 0.1% of a year’s domestic data
Introduction
What might we use it for?
Improve existing processes
•
•
•
•
Settlement
Billing, reconciliation, audit
Demand profiling
Customer profiling & segmentation
New processes not possible without HH data at scale
•
•
•
•
•
Localised prediction
Distribution network planning and modelling
Automated DSM – prediction and verification
System state detection
Individualised consumer energy services
Introduction
What are the essential processes?
Ingestion – getting the data into the system
Storage – keeping it there securely
Analysis and reporting
•
•
•
•
•
Ad-hoc queries
Transaction reports
Descriptives and summaries (e.g. OLAP)
Mining and modelling
Visualisation
Data management & processing
More fundamentally
Moving data between storage, memory and CPU
Transforming it in the CPU into desired forms
There are physical constraints on the speed of this.
(These are relevant at the scale of smart meter datasets).
Data management & processing
Single machine RDBMS
CPU
~ 2.5GHz
~ 1000 MB/s
MEMORY
~10s of GB per machine
~ 100 MB/s
STORAGE ~ 1TB per disk
Using SQL Server to sum half
hourly consumption:
4 bn records: ~ 1 hour
40 bn records: ~ 10 hours
1 years’ worth: ~ 200 hours
Data management & processing
Single machine RDBMS
CPU
~ 2.5GHz
~ 1000 MB/s
MEMORY
~10s of GB per machine
~ 100 MB/s
STORAGE ~ 1TB per disk
Problem: the throughput of a
single machine has not kept up
with the growth in the size of
datasets.
Data management & processing
Single machine RDBMS
CPU
~ 2.5GHz
~ 1000 MB/s
MEMORY
~10s of GB per machine
~ 100 MB/s
STORAGE ~ 1TB per disk
Problem: the throughput of a
single machine has not kept up
with the growth in the size of
datasets.
Solution: harness multiple
individual machines (‘horizontal
scaling’).
Data management & processing
Single machine RDBMS
CPU
~ 2.5GHz
~ 1000 MB/s
Problem: the throughput of a
single machine has not kept up
with the growth in the size of
datasets.
Solution: harness multiple
individual machines.
MEMORY
~10s of GB per machine
~ 100 MB/s
STORAGE ~ 1TB per disk
Problem: this is difficult and
expensive using traditional
relational database applications
Data management & processing
Solution
Move away from traditional databases and use a purposedesigned (‘big data’) framework to get horizontal scaling:
1 machine
~£10k
2.5GHz
1 GB/s
100MB/s
~ a week
Data management & processing
Solution
Move away from traditional databases and use a purposedesigned (‘big data’) framework to get horizontal scaling:
1 machine
~£10k
10 node cluster
~£50k
2.5GHz
1 GB/s
100MB/s
25GHz
10 GB/s
1 GB/s
~ a week
~ a day
Data management & processing
Solution
Move away from traditional databases and use a purposedesigned (‘big data’) framework to get horizontal scaling:
1 machine
~£10k
10 node cluster
~£50k
100 node cluster
~£300k
2.5GHz
1 GB/s
100MB/s
25GHz
10 GB/s
1 GB/s
250GHz
100 GB/s
10 GB/s
~ a week
~ a day
~ an hour
Data management & processing
Hadoop
Designed to solve the problem of exponentially growing data
volumes (originally, google’s searchable copy of the web)
Harness a large number of commodity machines and low cost
networking and storage.
Software takes a job (query, calculation, whatever) and ‘maps’ it
out across the cluster.
In parallel each node locally processes a subset of the problem,
before the results are ‘reduced’ back to a single dataset.
(Hence ‘Map/Reduce’)
Data management & processing
Experiments: SQL server
Single high performance machine: bottlenecked by the speed of the hard drive
Aggregation query performance versus dataset size
2,500,000
Runtime in seconds
2,000,000
1,500,000
1,000,000
SQL Rows/second
500,000
0
2,000,000,000
4,000,000,000
6,000,000,000~
400GB
Data management & processing
Experiments: Hadoop
11 node physical cluster (~£50k hardware cost)
Aggregation query performance versus dataset size
3,500,000
3,000,000
Runtime in seconds
2,500,000
2,000,000
1,500,000
1,000,000
SMASH Rows per second vs dataset size
500,000
0
0
10,000,000,000
20,000,000,000
30,000,000,000
40,000,000,000 ~2,500GB
Data management & processing
Experiments compared
Not straightforward to get SQL Server to run over ~ 10Bn records.
Aggregation query performance versus dataset size
3,500,000
3,000,000
Runtime in seconds
2,500,000
2,000,000
1,500,000
SMASH Rows per second vs dataset size
1,000,000
SQL Rows/second
500,000
0
0
10,000,000,000
20,000,000,000
30,000,000,000
40,000,000,000 ~2,500GB
Data management & processing
Experiments: growing the cluster
Fixed dataset size of 500m records
Aggregation query performance versus cluster size
1,200,000
Rows per second
1,000,000
800,000
R² = 0.9148
600,000
400,000
SMASH speed in records per second vs cluster size
200,000
0
1
2
3
4
5
6
7
Cluster size (nodes)
8
9
10
11
Data management & processing
Hadoop
Pros
• Open source software – free and customisable
• Adjustable data redundancy (data is replicated over the cluster)
• Incrementally scalable – on both performance and cost
measures: just add machines, system adapts automatically.
• Responsive and cooperative developer community
Cons
• Not the last word in user-friendliness (but this is changing)
• Sledgehammer to crack a nut below a certain scale
• Less mature (but rapidly developing) software ecosystem
• Algorithms must fit the framework
Conclusion: low cost option for smart meter data processing
Data mining and visualisation
Finding value in the data
Improve existing processes
•
•
•
•
Settlement
Billing, reconciliation, audit
Demand profiling
Customer profiling & segmentation
New processes not possible without HH data at scale
•
•
•
•
•
Localised prediction
Distribution network planning and modelling
Automated DSM – prediction and verification
System state detection
Individualised consumer energy services
Data mining and visualisation
Finding value in the data
Collaborative approach with industry partners to identify business
needs
Focus on:
(1) Datamining for subgroup discovery – classifying end users
(2) Cluster analysis on demand data – finding profiles
(3) Innovative visualisation of consumption data and datamining
results
Data mining and visualisation
Subgroup discovery
“Pattern features”: 14 variables describing each household
• Income, geography, access to gas, size of house, value of
house etc.
“Target features”: describe the behaviour of interest
• Profile error: how different is usage from the assigned
profile?
Outputs:
• groups of households with significantly different profile errors
Data mining and visualisation
Subgroup discovery
Looking at % annual profile error against sociodemographics
Data mining and visualisation
Subgroup discovery
Looking at % annual profile error against sociodemographics
Data mining and visualisation
Subgroup discovery
Looking at % annual profile error against sociodemographics
Data mining and visualisation
Subgroup discovery
Looking at % annual profile error against sociodemographics
Data mining and visualisation
Subgroup discovery
Looking at % annual profile error against sociodemographics
Data mining and visualisation
Clustering
Can we use demand data to create better profiles?
Define target features: waveform’s properties of interest
Two examples: using imposed and emergent properties.
Each using 3 clusters.
Data mining and visualisation
Clustering
Consumption (not to scale)
E.g. 1 the average weekday as 5 pairs of numbers:
Time of day (half hours from midnight)
Data mining and visualisation
Clustering
E.g. 2: Frequency spectrum of the demand timeseries
Data mining and visualisation
Cluster analysis
Project competition results (the University won)
Average % difference from the cluster centroid
0.35
0.33
0.31
0.29
0.27
0.25
Data mining and visualisation
Conclusions from datamining
Subgroup discovery results suggest the approach is useful as
long as you have metadata on the households
Cluster analysis work suggests it is possible to improve on the
standard profile classes using SM data
Further work needs to be carried out on more representative
datasets
There are many other potential applications!
The SMASH application
Web application
Installation of Hadoop on UoB and CSE clusters
11 Node physical cluster at the university (£50k)
8 Node virtual cluster at CSE (£15k)
Integration of a range of Hadoop-friendly data management
components
Development of a proof-of-concept web application for user
interaction, job management, visualisation etc.
Deployment on both clusters
The SMASH application
Web application
Currently running on the CSE virtual Hadoop cluster
Generating Value from SM Data
Where next?
We have a proof-of-concept system developed with TSB R&D
funding support.
We have mastered the underlying technologies and established
that this approach has the potential to be a low-cost solution to a
number of industry data challenges.
On a technical level the next steps are to
• Further develop the web application
• Refine the datamining algorithms (with more data)
• Implement selected DM algorithms directly on the cluster
On a policy/programme level we want ensure this knowledge is
incorporated into SM rollout infrastructure decision making.
Questions and discussion
@cse_bristol
#SmartMeterData
Contacts:
Simon Roberts [email protected]
Joshua Thumim [email protected]
Web:
www.cse.org.uk
Sign up to our monthly e-news through our website
Follow us on Twitter @cse_bristol