Transcript PPT

An Energy-Aware File Relocation Strategy
Based on File-Access Frequency and Correlations
Cheng Hu1, Yuhui Deng1
Department of Computer Science, Jinan University
Data Storage and Cluster Computing Lab
http:// dsc.jnu.edu.cn
ICA3PP 2015: The 15th International Conference on Algorithms and Architectures for Parallel
Processing. Zhangjiajie, China, November 18-20, 2015
Agenda
1
Motivation
2
Related work
3
Our strategy
4
System design
5
Evaluation
6
Conclusion
Motivation
The Explosive Growth of Data → Large Data Center
IDC reports:
161 ExaByte per year(2006)
1800 ExaByte per year (2011)
In 1998, Jim Gray predicted
that the global amount of
information would double every
18 months.
Most of the data is stored in data
centers
Google Data Center
Motivation
Large Data Center →Energy Consumption
Annual electricity
reaches a growth
rate of 18% per year.



An average data centre in UK consumes more power than city of Leicester in a
year.
The US Environmental Protection Agency (EPA) reported that 61 billion KWh,
1.5% of US electricity consumption, is used for data center
The cost could reach $7.4 billion annual electricity cost by 2011.
Motivation
Low Resource Utilization
Bursty behavior: workloads
vary significantly through
the day and through the
week.
Fat provision: In order to
guarantee the QoS, the
resources have to be
provided in terms of the
peak workloads.
Request rate for the
World Cup site from
May-June in 1998.
Request rate for the
www.ibm.com site
over February 5-11,
2001.
• The resource utilization ratios of IT systems in large
enterprises are only around 35%. In some enterprises,
this is only at 15%.
• Google reported that servers are rarely completely idle
and seldom operate near their maximum utilization, the
resource utilization is between 10% and 50%.
Related Work
Traditional Approaches



Dynamically turn cluster nodes on/off
Switch cluster nodes to a low-power state on demand
Aggregate hot data to a few hot storage nodes(the
nodes are maintained in active state), and switch the
nodes which store cold data to a low-power state.

Large time penalty +
Energy penalty.
Related Work
Correlated File Accesses


Correlated file access is a typical data access pattern
The probability that an access to a file A will be
followed by the same file (e.g. file B) that followed the
last access to A is very high.
For example:
A → B

Correlated cold files in
different nodes results in
time and energy penalty for
the three nodes!!
Our strategy
Skew File Relocation(SFR) strategy


Aggregate the frequently accessed files (hot data) to
the hot nodes.
Furthermore, aggregates the correlated cold files to
the same cold storage node.
Time and energy penalty are reduced :
From three nodes to one single node
System design
System architecture



Storage nodes are divided into hot ones and cold ones.
Hot files are relocated into the hot nodes, cold files are
relocated into the cold nodes with their correlated ones.
Cold nodes are turned into the standby state after
requests finished.
System design
 System process flowchart
1. First, a client sends a request to
the metadata server.
2. If the request is a file-access
request, the metadata server
checks the information of that
file.
3. If the file exists, it finds out
the node in which the file is
stored, and delivers the request
to the node.
4. Finally, the node receives the
request and data transmission is
setting up between the storage
node and the client.
1. Annotation: Frequency and Correlations Mining (FCM) method is
used to mine file access frequency and correlations.
System design
 Mining File-Access Frequency and Correlations
• Measuring the Correlation of two files: two metrics
• Support: the probability that the two files appear in
a transaction.
• Confidence: the probability that the file FB appears
in a transaction when the file FA appears.
Take file b and c as an example:
Support:
Confidence:
i: the order of the transaction
n: the total number of transactions
System design
Where i, j...k indicate the order of transactions in which
the two files appear, and the divisor represents the sum of
all transactions.
The divisor denotes the sum of all transactions in which the
first file appears.
Evaluation
File relocation--- Select the best cold node by comparing the degree of
• Notice:
correlation
• File-access correlations of two files whose support or confidence
is less than the threshold value will be considered to be 0.
• In other words, if file-access correlations of two files is very
weak, we think it is just a coincidence, and there is no correlations
between them.
Evaluation
Simulation Setup
• The experiment models a clustered storage system
• 16 storage nodes, 1 metadata server.
• Among those storage nodes, 2 nodes are used as hot
nodes, and the other 14 nodes are cold nodes.
Evaluation
Simulation Setup---- three traditional methods for comparison
High-Performance (HP) strategy: Storage nodes are never switched to a
standby state. All the files in the system will not be relocated.
File Relocate Once (FRO): FRO identifies hot and cold files after a
learning stage. Storage nodes are divided into hot ones and cold ones.
Then, FRO executes the relocation only once after the learning stage: hot
files are relocated into hot nodes and cold files are relocated into cold
nodes.
FRO does not leverage the file correlations to relocate cold files.
•Equal File Relocate (EFR): EFR relocates files when a mismatching. It
means if a hot file is in a cold node, EFR relocates it into a hot node in a
round robin style. Similarly, if a cold file is in a hot node, EFR relocates it
into a cold node in a round robin method.

Evaluation
 Platform Environment
Characteristics of simulated storage node
Simulation platform specs
Evaluation
 Platform Environment
Characteristics of the Network file system traces
Detailed information of these traces can be obtained in the web page:
http://www.eecs.harvard.edu/sos/traces.html.
1.Take 8:00-18:00 traces of a weekday respectively from those
network file system traces as workload.
2.8:00-10:00 is used as the learning sample which is provided to FCM.
3.All mentioned file relocation strategies are adopted.
4.We set 0.1 as the threshold of support and 0.4 as the threshold of
confidence.
Evaluation
Experiment results---Energy Consumption
• HP consumes the most energy because all storage nodes work as hot
colds.
• FRO does not consider the file access correlations, saved more than 11%
energy compared with HP.
• EFR relocates files in a round robin.
• SFR leverages the file-access frequency and correlations saved the
most energy compared with other strategies.
Evaluation
Experiment results---Response Time
Average
response time
Variance of
response time.
Note that the experiment value of HP is too small, it is very close to 0.
• Due to the bursty nature of storage workloads,
it requires a significant over provisioning of
the resource capacity to meet the traditional
response time guarantees.
• It is becoming a recognition that having the
maximum compute power be instantly available
is not required, as long as the Quality of
Service (QoS) delivered satisfies a
predetermined standard.
• Power consumption is now a metric equal to
performance.
• It is reasonable that user satisfaction will
decrease as system response time increases.
• Hoxmeier and DiCesare reported that for the
browser based applications, the highest level
of satisfaction existed when the system
response time was 3 s.
• However, satisfaction stayed high and fairly
steady when the response time was varied
from 3 to 9 s.
• When the response time reached 12 s, there
was a noticeable drop in satisfaction.
Conclusion
1. SFR leverages both the file-access frequency
and correlations
2. SER can significantly reduce the energy
consumption while maintaining the system
performance at an acceptable level.
http:// dsc.jnu.edu.cn
ICA3PP2015: The 15th International Conference on Algorithms and Architectures for Parallel
Processing. zhangjiajie, China