Transcript RAID 2
Disaster Recovery Planning
(DRP)
Disaster Recovery Planning
(DRP)
DRP is the process of regaining access to the data,
hardware and software necessary to resume critical
business operations after a natural or human-induced
disaster.
A disaster recovery plan (DRP) should also include
plans for coping with the unexpected or sudden loss
of key personnel, although this is not covered in this
article, the focus of which is data protection.
DRP is part of a larger process known as business
continuity planning (BCP).
What is the difference DRP
and BCP (1/2)
Disaster recovery is the process by which you
resume business after a disruptive event.
The event might be
something huge-like an earthquake or the terrorist
attacks on the World Trade Center
something small, like malfunctioning software caused by
a computer virus.
Given the human tendency to look on the bright
side, many business executives are prone to
ignoring "disaster recovery" because disaster seems
an unlikely event.
What is the difference DRP
and BCP (2/2)
"Business continuity planning" suggests a more
comprehensive approach to making sure you can
keep making money.
Often, the two terms are married under the acronym
BC/DR.
At any rate, DR and/or BC determines how a
company will keep functioning after a disruptive
event until its normal facilities are restored.
What do these plans include
(1/2)
All BC/DR plans need to encompass
how employees will communicate
where they will go
how they will keep doing their jobs.
The details can vary greatly, depending on the
size and scope of a company and the way it
does business.
What do these plans include
(2/2)
For example, The plan at one global
manufacturing company
restore critical mainframes with vital data at a
backup site within four to six days of a disruptive
event,
obtain a mobile PBX unit with 3000 telephones
within two days
recover the company's 1000-plus LANs in order of
business need
set up a temporary call center for 100 agents at a
nearby training facility.
Events that necessitate
disaster recovery
Natural disasters
Fire
Power failure
Terrorist attacks
Organized or deliberate disruptions
Theft
System and/or equipment failures
Human error
Computer viruses
Testing
Prevention against data loss
(1/2)
Backups sent off-site in regular intervals
Create an insurance copy on Microfilm or
similar and store the records off-site.
Includes software as well as all data information,
to facilitate recovery
Use a Remote backup facility if possible to
minimize data loss
Storage Area Networks (SANs) over multiple
sites make data immediately available without
the need to recover or synchronize it
Prevention against data loss
(2/2)
Surge Protectors — to minimize the effect of
power surges on delicate electronic equipment
Uninterruptible Power Supply (UPS) and/or
Backup Generator
Fire Preventions — more alarms, accessible
extinguishers
Anti-virus software and other security
measures
Techniques and technology
Mirroring
RAID : RAID0 – 6 and combination
On-site data storage
Disk mirroring : Redundant arrays of inexpensive disks 1
(RAID1)
Server mirroring: web / ftp /email
Back up - Tape / optical disk
Off-site data storage (backup-site)
Cold sites
Warm sites
Hot site
Mirroring
Mirroring can occur locally or remotely.
Locally means that a server has a second hard drive that
stores data.
A remote mirror means that a remote server contains an
exact duplicate of the data. The second drive is called a
mirrored drive.
Data is written to the original drive when a write
request is issued. Data is then copied to the
mirrored drive, providing a mirror image of the
primary drive.
If one of the hard drives fails, all data is protected
from loss.
Disk mirroring (RAID1)
The replication of logical
disk volumes onto separate
physical hard disks in real
time to ensure continuous
availability, currency and
accuracy.
A mirrored volume is a
complete logical
representation of separate
volume copies
Server mirroring
Mirror sites are most commonly used to provide multiple
sources of the same information, and are of particular value
as a way of providing reliable access to large downloads.
Mirroring is a type of file synchronization
Web server
Email server
To preserve a website or page, especially when it is closed or is about
to be closed.
To counteract censorship and promote freedom of information
To protect loss of email information
ftp server
To allow faster downloads for users at a specific geographical location
Load balancing
Redundant arrays of
inexpensive disks (RAID)
The organization distributes the data across multiple
smaller disks, offering protection froma crash that
could wipe out all data on a single, shared disk.
Benefits of RAID include the following
Increased storage capacity per logical disk volume
High data transfer or I/O rates that improve information
throughput
Lower cost per megabyte of storage
Improved use of data center floor space
RAID0
RAID Level 0 -aka. a stripe set or
striped volume) splits data evenly
across two or more disks (striped)
with no parity information for
redundancy.
It is important to note that RAID 0
provides zero data redundancy.
RAID 0 is normally used to increase
performance
A RAID 0 can be created with disks
of differing sizes, but the storage
space added to the array by each
disk is limited to the size of the
smallest disk
RAID1
A RAID 1 creates an exact
copy (or mirror) of a set of
data on two or more disks.
This is useful when read
performance or reliability are
more important than data
storage capacity.
Such an array can only be as
big as the smallest member
disk.
A classic RAID 1 mirrored pair
contains two disks (see
diagram), which increases
reliability
RAID2
A RAID 2 stripes data at the bit (rather than block) level, and uses a
Hamming code for error correction.
Extremely high data transfer rates are possible.
RAID 2 is the only standard RAID level which can automatically recover
accurate data from single-bit corruption in data.
At the moment, there are no commercial implementations of RAID-2
RAID3
RAID Level 3uses byte-level
striping with a dedicated parity
disk.
RAID 3 is very rare in practice.
One of the side-effects of RAID
3 is that it generally cannot
service multiple requests
simultaneously.
This comes about because any
single block of data will, by
definition, be spread across all
members of the set and will
reside in the same location.
So, any I/O operation requires
activity on every disk.
RAID4
RAID Level 4 uses block-level striping
with a dedicated parity disk.
This allows each member of the set to
act independently when only a single
block is requested.
RAID 4 looks similar to RAID 3 except
that it stripes at the block level, rather
than the byte level.
In the example , a read request for
block "A1" would be serviced by disk 0.
A simultaneous read request for block
B1 would have to wait, but a read
request for B2 could be serviced
concurrently by disk 1.
RAID5
A RAID 5 uses block-level striping with
parity data distributed across all
member disks.
RAID 5 has achieved popularity due to
its low cost of redundancy.
A minimum of 3 disks is generally
required for a complete RAID 5
configuration.
In the example, a read request for
block "A1" would be serviced by disk 0.
A simultaneous read request for block
B1 would have to wait, but a read
request for B2 could be serviced
concurrently by disk 1
RAID6
A RAID 6 extends RAID 5 by
adding an additional parity
block, thus it uses block-level
striping with two parity blocks
distributed across all member
disks.
Improve reliability
Like RAID 5, the parity is
distributed in stripes, with the
parity blocks in a different place
in each stripe.
Nested RAID
Storage Model
Storage Area Network
The Storage Network Industry Association (SNIA)
defines the SAN as a network whose primary
purpose is the transfer of data between computer
systems and storage elements.
A SAN consists of a communication infrastructure,
which provides physical connections; and a
management layer, which organizes the
connections, storage elements, and computer
systems so that data transfer is secure and robust.
SAN ‘s definition
Put in simple terms, a SAN is a specialized,
high-speed network attaching servers and
storage devices
It is sometimes referred to as “the network
behind the servers.”
A SAN introduces the flexibility of networking
to enable one server or many heterogeneous
servers to share a common storage utility,
which may comprise many storage devices,
including disk, tape, and optical storage.
SAN Component
SAN Connectivity
SAN Storage
the connectivity of storage and server components
typically using Fibre Channel (FC).
TAPE /RAID /ESS (Enterprise Storage System)
/JBOD (Just Bunch of Disk) /SSA (Serial Storage
Architecture)
SAN Server
Windows /Unix /Linux and etc
Switched Fabric
An infrastructure specially designed to handle
storage communications called a fabric.
A typical Fibre Channel SAN fabric is made up
of a number of Fibre Channel switches.
Today, all major SAN equipment vendors also
offer some form of Fibre Channel routing
solution, and these bring substantial scalability
benefits to the SAN architecture by allowing
data to cross between different fabrics without
merging them.
Fiber Channel protocol
Fibre Channel is a layered protocol. It consists of 5 layers,
namely:
FC0 The physical layer, which includes cables, fiber optics,
connectors, pinouts etc.
FC1 The data link layer, which implements the 8b/10b
encoding and decoding of signals.
FC2 The network layer, defined by the FC-PI-2 standard,
consists of the core of Fibre Channel, and defines the main
protocols.
FC3 The common services layer, a thin layer that could
eventually implement functions like encryption or RAID.
FC4 The Protocol Mapping layer. Layer in which other
protocols, such as SCSI, are encapsulated into an information
unit for delivery to FC2.
IP Storage Networking
FCIP (Fiber Channel over IP)
iFCP (Internet Fiber Channel Protocol)
It is a method for allowing the transmission of Fibre
Channel information to be tunneled through the IP
network.
It is a mechanism for transmitting data to and from
Fibre Channel storage devices in a SAN, or on the
Internet using TCP/IP
Internet SCSI (iSCSI)
It is a transport protocol that carries SCSI
commands from an initiator to a target.
FCIP (Fiber Channel over IP)
FCIP encapsulates FC frames within TCP/IP, allowing
islands of FC SANs to be interconnected over an IPbased network
TCP/IP is used as the underlying transport to provide
congestion control and in-order delivery FC Frames
All classes of FC frames are treated the same as
datagrams
End-station addressing, address resolution, message
routing, and other elements of the FC network
architecture remain unchanged
iFCP
iFCP is a gateway-to-gateway protocol for
implementing a fibre channel fabric over a TCP/IP
Traffic between fibre channel devices is routed and
switched by TCP/IP network
The iFCP layer maps Fibre Channel frames to a
predetermined TCP connection for transport
FC messaging and routing services are terminated at
the gateways so the fabrics are not merged to one
another
iSCSI
iSCSI is a SCSI transport protocol for mapping of
block-oriented storage data over TCP/IP networks
The iSCSI protocol enables universal access to
storage devices and Storage Area Networks (SANs)
over standard TCP/IP networks
Back up site
A backup site is a location where a business can
easily relocate following a disaster, such as fire,
flood, or terrorist threat. This is an integral part of
the disaster recovery plan of a business.
A backup site can be another location operated by
the business, or contracted via a company that
specializes in disaster recovery services.
In some cases, a business will have an agreement
with a second business to operate a joint disaster
recovery facility.
Cold Sites
A cold site is the most inexpensive type of backup
site for a business to operate.
It provides office spaces to operate
It does not include backed up copies of data and
information from the original location of the business,
nor does it include hardware already set up.
The lack of hardware contributes to the minimal
startup costs of the cold site, but requires additional
time following the disaster to have the operation
running at a capacity close to that prior to the
disaster.
Warm Sites
A warm site is a location where the business
can relocate to after the disaster that is
already stocked with computer hardware
similar to that of the original site, but does
not contain backed up copies of data and
information.
Hot Sites
A hot site is a duplicate of the original site of the
business, with full computer systems as well as nearcomplete backups of user data.
Ideally, a hot site will be up and running within a
matter of hours. This type of backup site is the most
expensive to operate.
Hot sites are popular with stock exchanges and other
financial institutions who may need to evacuate due
to potential bomb threats and must resume normal
operations as soon as possible.
How to choose
Choosing the type is mainly decided by a
company's cost vs. benefit strategy.
Hot sites are traditionally more expensive than
cold sites since much of the equipment the
company needs has already been purchased
and thus the operational costs are higher.
However if the same company loses a
substantial amount of revenue for each day
they are inactive then it may be worth the
cost.
The advantages of a cold site are simple--cost.
It requires much fewer resources to operate a
cold site because no equipment has been
bought prior to the disaster.
The downside with a cold site is the potential
cost that must be incurred in order to make
the cold site effective.
The costs of purchasing equipment on very
short notice may be higher and the disaster
may make the equipment difficult to obtain.
Discovery Planning steps (1/3)
Assess business impact and risk.
This should include an assessment of the business unit's
function and, preferably, a business impact analysis
(BIA).
The purpose of the assessment is to determine the
business unit's relative contribution to the larger
organization (monetary and functional).
The greater the potential impact, the more money a
company should spend to restore a system or process
quickly.
For instance, a stock trading company may decide to pay
for completely redundant IT systems that would allow it
to immediately start processing trades at another
Discovery Planning steps (2/3)
Develop a Disaster Recovery framework.
Data should be categorized by importance. Two
measures of importance are used, RTO and RPO.
Recovery Time Objective (RTO) is the acceptable
amount of time between the disaster and the postdisaster resumption of function (how long can we
wait to restore data?).
Recovery Point Objective (RPO) is the acceptable
data roll-back (how current does the data have to
be?).
Discovery Planning steps (3/3)
Develop a recovery strategy and then a
written Disaster Recovery Plan.
That written plan should address at a minimum:
response, recovery, and resumption of services
detailed tasks.
Adjust information systems to make Disaster
Recovery easier.
This includes consolidating servers and data,
perhaps with a Storage Area Network or other
archival storage method.
Important factors (1/3)
Communication
Personnel — notify all key personnel of the
problem and assign them tasks focused toward the
recovery plan.
Customers — notifying clients about the problem
minimizes panic.
Recall backups
If backup tapes are taken offsite, these need to be
recalled. If using remote backup services, a
network connection to the remote backup location
(or the Internet) will be required.
Important factors (2/3)
Facilities
having backup hot sites or cold sites for larger
companies. Mobile recovery facilities are also
available from many suppliers.
Prepare your employees
during a disaster, employees are required to work
longer, more stressful hours, and a support
system should be in place to alleviate some of
the stress. Prepare them ahead of time to ensure
that work runs smoothly.
Important factors (3/3)
Business information
backups should be stored in a completely separate
location from the company
Testing the plan
provisions, directions, frequency for testing the
plan should be stipulated.
Things to do in DRP (1/4)
Here are 10 absolute basics your plan should cover:
1. Develop and practice a contingency plan that
includes a succession plan for your CEO.
2. Train backup employees to perform emergency
tasks. The employees you count on to lead in an
emergency will not always be available.
3. Determine offsite crisis meeting places for top
executives.
Things to do in DRP (2/4)
4. Make sure that all employees-as well as
executives-are involved in the exercises so that they
get practice in responding to an emergency.
5. Make exercises realistic enough to tap into
employees' emotions so that you can see how they'll
react when the situation gets stressful.
6. Practice crisis communication with employees,
customers and the outside world.
Things to do in DRP (3/4)
7 Invest in an alternate means of communication in
case the phone networks go down.
8. Form partnerships with local emergency response
groups-firefighters, police to establish a good
working relationship. Let them become familiar with
your company and site.
Things to do in DRP (3/3)
9. Evaluate your company's performance during
each test, and work toward constant
improvement. Continuity exercises should
reveal weaknesses.
10. Test your continuity plan regularly to reveal
and accommodate changes. technology,
personnel and facilities are in a constant state
of flux at any company.
Top mistakes in disaster
recovery (1/3)
1. Inadequate planning:
Have you identified all critical systems,
do you have detailed plans to recover them to the current
day?
Everybody thinks they know what they have on their
networks, but most people don't really know how many
servers they have,
how they're configured, or what applications reside on
them-what services were running,
what version of software or operating systems they were
using.
Top mistakes in disaster
recovery (2/3)
2 Failure to bring the business into the planning and
testing of your recovery efforts.
3 Failure to gain support from senior-level managers.
The largest problems here are:
Not demonstrating the level of effort required for full
recovery.
Not conducting a business impact analysis and addressing all
gaps in your recovery model.
Top mistakes in disaster
recovery (3/3)
Not building adequate recovery plans that outline
your recovery time objective, critical systems and
applications, vital documents needed by the
business, and business functions by building
plans for operational activities to be continued
after a disaster.
Not having proper funding that will allow for a
minimum of semiannual testing.