幻灯片 1 - yzBigData

Download Report

Transcript 幻灯片 1 - yzBigData

YZStack: Provisioning Customizable
Solution for Big Data
Sai Wu, Chun Chen, Gang Chen,
Lidan Shou, Ke Chen
Zhejiang University
Hui Cao,
He Bai,
yzBigData Co. Lte.
City Cloud Technology
3H Problem in Deploying the Big
Data System
• How can I build and deploy a big data
system without back-ground knowledge?
• How can I migrate existing applications to
the big data system?
• How can I use my big data system to do
the analysis job?
Too Many Choices
• Visualization :
– Openstack
– Cloudstack
– Vmware
• Cloud storage:
– key-value store (hbase, cassandra, redis,…)
– relational service (AWS, spanner,…)
• Processing engine:
–
–
–
–
–
MapReduce/Hadoop
Dryad
Pregel, GraphLab
Spark
epiC
• Application service:
– Mahout
– Hive
– Spatial Hadoop
Can I Deploy a Big Data System
Like Installing a Windows Software?
• Configure the installation as a customization
process
• The installation software will copy the binary
codes to all servers and do the configuration
automatically
• A browser-based management system to
start/stop the services and monitor the status
YZStack: the Architecture
Applications
•
•
•
•
Layers are loosely
connected
Each layer includes
many selectable
modules
Modules of different
layers are linked via
the common
interfaces
Optimizations are
implemented as
special plugins
Hangzhou
E-card
Smart Traffic
Analyzer for
Power Grid
Green
Hangzhou
SaaS
Plugins
Data Mining
OLAP
Processing
OLTP
Processing
Stream
Processing
Visualization
Security
Module
PaaS
Relational
Transactional Engine
Relational
Analytical Engine
Graph Engine
System
Monitor
YZepiC
Data
Importer
/ETL tools
DaaS
Data Integration Module
Relational Data
Service
Object Based
Service
Key-Value Store
Distributed File System
IaaS
Cloud Virtual Server
Cloud
Network
Optimzation
Modules
Cloud Storage
Features of YZStack
• Adaptive Image
– Based on openstack, partition the big image into
small chunks
– Different images share the same chunk
• Optimization Plugins
–
–
–
–
Column-oriented plugin
Index plugin
Query optimization plugin
Iterative job plugin
• Visualization Tool
– Zoom in/out for different dimensions
Optimization Plugin
Common Interface of Layer
Optimization Plugin 1
Module A
...
Module K
...
Layer 2
Default
Implementation
hooks
Common Interface of Layer
Module A
...
Module K
...
Layer 1
Default
Implementation
hooks
Customized
Function
Customized
Function
Optimization Plugin 2
Customized
Function
Use Case: the Smart Financial System
• Built for the Zhejiang Provincial Department of
Finance (ZPDF)
Visualization Tool
OLAP Module
Data Mining Module
Relational Analytical Engine
SQL Query Parser
Query Optimizer
Query Engine
Monitor Plugin
YZepiC
Security Plugin
Index Plugin
Relational Data Service
Schema
Metadata
Table
File
Data Statistics
Tablet
File
...
Tablet
File
Data Importer
Distributed File System
Virtual
Server
Virtual
Server
...
Virtual
Server
Traffic
Human
Electronic
Tax
Energy
Environment
Economic Prediction
• Collaborate with researchers from college of economics,
Zhejiang University
• Step 1:
– Use the OLAP module to provide a basic view for each
registered company
Economic Prediction (cont.)
• Step 2:
– Healthy Model: Based on the historical data, the
healthy model discovers risks and predicts prospects
of an industry
– Energy Consumption Model: We link the financial
data with the electronic, water, and environment data
to rank each industry based on its energy
consumption per unit of output value.
– Economic Impact: Model By connecting the financial
data to the human resource data, we study how many
workers are employed for an industry and their
average salary
– Combine all three models to rank all industries
accordingly
Economic Prediction (cont.)
• Step 3: Index of Economic (ongoing work)
– To predict the status of the whole Zhejiang
Province using statistics generated by
previous two steps
– Involving multiple complex economic models
– Our economic researchers are using the
visualization tools to build and study their
models
Detection of Improper Payment
• What is the improper payment?
– A person is classified as the low-income type and
buys a house specially for low-and-medium wage
earners. However, he is actually employed by IT
company
– One company may submit different registration files to
different government departments (e.g., it registers as
a high-tech company in the Department of Science,
but as a labor-intensive one in the Department of
Labor) to enjoy various allowances from the
government.
Why ZPDF?
• A harbor of financial data in Zhejiang
Province
– Electronic department
– Traffic department
– Tax department
–…
• It is well motivated
– Expected to save more than 1 billion CNYs
Improper Payment
• Step 1 (Consistent Problem):
– To detect improper payment from two databases, D0
and D1,
– we first generate two star-join queries, Q0 and Q1,
which selectively merge the fact tables with the
dimension tables.
– The trick is that the entities returned by Q0 should not
exist in the results of Q1.
– E.g., Q0 returns the high-income persons, while Q1
returns the users who own a house specially for lowand-medium wage earners.
Consistent Problem
• we apply the LSH (Locality Sensitive Hashing) to generate k hash
values for each tuple from T0 and T1.
• So the tuples sharing the same hash value are considered as a
candidate group.
• We define a similarity function sim(ti; tj) to evaluate the probability of
two tuples representing the same entity. If sim(ti; tj) is greater than a
predefined threshold, it will be forwarded to the verification module
where a human-aided algorithm is applied to filter out the false
positives.
Dimension
Dimension
Table
Dimension
Table
Fact
Table
Candidate
Group
Fact
Table
Candidate
Group
Verification
Candidate
Group
Table
Dimension
Table
Conclusion
• YZStack is tailored for the users who have little or no
experience in deploying and maintaining the cloud
system.
• It simplifies the development of a new big data
application as the process of module selection and
customization.
• To show the flexibility and usability of YZStack, we
demonstrate how we build a smart financial system for
the Zhejiang Provincial Department of Finance using
YZStack.